Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
Just checked from where the script is submitted i.e. wrt Driver, the python
env are different. Jupyter one is running within a the virtual environment
which is Python 2.7.5 and the spark-submit one uses 2.6.6. But the
executors have the same python version right? I tried doing a spark-submit
from jupyter shell, it fails to find python 2.7  which is not there hence
throws error.

Here is the udf which might take time:

import base64
import zlib

def decompress(data):

bytecode = base64.b64decode(data)
d = zlib.decompressobj(32 + zlib.MAX_WBITS)
decompressed_data = d.decompress(bytecode )
return(decompressed_data.decode('utf-8'))


Could this because of the two python environment mismatch from Driver
side? But the processing

happens in the executor side?




*Regards,Dhrub*

On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari 
wrote:

> Maybe you can try running it in a python shell or jupyter-console/ipython
> instead of a spark-submit and check how much time it takes too.
>
> Compare the env variables to check that no additional env configuration is
> present in either environment.
>
> Also is the python environment for both the exact same? I ask because it
> looks like you're using a UDF and if the Jupyter python has (let's say)
> numpy compiled with blas it would be faster than a numpy without it. Etc.
> I.E. Some library you use may be using pure python and another may be using
> a faster C extension...
>
> What python libraries are you using in the UDFs? It you don't use UDFs at
> all and use some very simple pure spark functions does the time difference
> still exist?
>
> Also are you using dynamic allocation or some similar spark config which
> could vary performance between runs because the same resources we're not
> utilized on Jupyter / spark-submit?
>
>
> On Wed, Sep 11, 2019, 08:43 Stephen Boesch  wrote:
>
>> Sounds like you have done your homework to properly compare .   I'm
>> guessing the answer to the following is yes .. but in any case:  are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>>
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>> dhruba.w...@gmail.com>:
>>
>>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>>> compressed base64 encoded text data from a hive table and decompressing and
>>> decoding in one of the udfs. Also the time compared is from Spark UI not
>>> how long the job actually takes after submission. Its just the running time
>>> i am comparing/mentioning.
>>>
>>> As mentioned earlier, all the spark conf params even match in two
>>> scripts and that's why i am puzzled what going on.
>>>
>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>> pmccar...@dstillery.com> wrote:
>>>
 It's not obvious from what you pasted, but perhaps the juypter notebook
 already is connected to a running spark context, while spark-submit needs
 to get a new spot in the (YARN?) queue.

 I would check the cluster job IDs for both to ensure you're getting new
 cluster tasks for each.

 On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
 wrote:

> Hi,
>
> I am facing a weird behaviour while running a python script. Here is
> what the code looks like mostly:
>
> def fn1(ip):
>some code...
> ...
>
> def fn2(row):
> ...
> some operations
> ...
> return row1
>
>
> udf_fn1 = udf(fn1)
> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
> ~4500 partitions
> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
> .drop("colz") \
> .withColumnRenamed("colz", "coly")
>
> edf = ddf \
> .filter(ddf.colp == 'some_value') \
> .rdd.map(lambda row: fn2(row)) \
> .toDF()
>
> print edf.count() // simple way for the performance test in both
> platforms
>
> Now when I run the same code in a brand new jupyter notebook it runs
> 6x faster than when I run this python script using spark-submit. The
> configurations are printed and  compared from both the platforms and they
> are exact same. I even tried to run this script in a single cell of 
> jupyter
> notebook and still have the same performance. I need to understand if I am
> missing something in the spark-submit which is causing the issue.  I tried
> to minimise the script to reproduce the same error without much code.
>
> Both are run in client mode on a yarn based spark cluster. The
> machines from which both are executed are also the same and from same 
> user.
>
> What i found is the  the quantile values for median for one ran with
> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am 
> not
> able to figure out why this 

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Abdeali Kothari
Maybe you can try running it in a python shell or jupyter-console/ipython
instead of a spark-submit and check how much time it takes too.

Compare the env variables to check that no additional env configuration is
present in either environment.

Also is the python environment for both the exact same? I ask because it
looks like you're using a UDF and if the Jupyter python has (let's say)
numpy compiled with blas it would be faster than a numpy without it. Etc.
I.E. Some library you use may be using pure python and another may be using
a faster C extension...

What python libraries are you using in the UDFs? It you don't use UDFs at
all and use some very simple pure spark functions does the time difference
still exist?

Also are you using dynamic allocation or some similar spark config which
could vary performance between runs because the same resources we're not
utilized on Jupyter / spark-submit?


On Wed, Sep 11, 2019, 08:43 Stephen Boesch  wrote:

> Sounds like you have done your homework to properly compare .   I'm
> guessing the answer to the following is yes .. but in any case:  are they
> both running against the same spark cluster with the same configuration
> parameters especially executor memory and number of workers?
>
> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
> dhruba.w...@gmail.com>:
>
>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>> compressed base64 encoded text data from a hive table and decompressing and
>> decoding in one of the udfs. Also the time compared is from Spark UI not
>> how long the job actually takes after submission. Its just the running time
>> i am comparing/mentioning.
>>
>> As mentioned earlier, all the spark conf params even match in two scripts
>> and that's why i am puzzled what going on.
>>
>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, 
>> wrote:
>>
>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>> already is connected to a running spark context, while spark-submit needs
>>> to get a new spot in the (YARN?) queue.
>>>
>>> I would check the cluster job IDs for both to ensure you're getting new
>>> cluster tasks for each.
>>>
>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
>>> wrote:
>>>
 Hi,

 I am facing a weird behaviour while running a python script. Here is
 what the code looks like mostly:

 def fn1(ip):
some code...
 ...

 def fn2(row):
 ...
 some operations
 ...
 return row1


 udf_fn1 = udf(fn1)
 cdf = spark.read.table("") //hive table is of size > 500 Gigs with
 ~4500 partitions
 ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
 .drop("colz") \
 .withColumnRenamed("colz", "coly")

 edf = ddf \
 .filter(ddf.colp == 'some_value') \
 .rdd.map(lambda row: fn2(row)) \
 .toDF()

 print edf.count() // simple way for the performance test in both
 platforms

 Now when I run the same code in a brand new jupyter notebook it runs 6x
 faster than when I run this python script using spark-submit. The
 configurations are printed and  compared from both the platforms and they
 are exact same. I even tried to run this script in a single cell of jupyter
 notebook and still have the same performance. I need to understand if I am
 missing something in the spark-submit which is causing the issue.  I tried
 to minimise the script to reproduce the same error without much code.

 Both are run in client mode on a yarn based spark cluster. The machines
 from which both are executed are also the same and from same user.

 What i found is the  the quantile values for median for one ran with
 jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
 able to figure out why this is happening.

 Any one faced this kind of issue before or know how to resolve this?

 *Regards,*
 *Dhrub*

>>>
>>>
>>> --
>>>
>>>
>>> *Patrick McCarthy  *
>>>
>>> Senior Data Scientist, Machine Learning Engineering
>>>
>>> Dstillery
>>>
>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>
>>


Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Ok. Can't think of why that would happen.

Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati <
dhruba.w...@gmail.com>:

> As mentioned in the very first mail:
> * same cluster it is submitted.
> * from same machine they are submitted and also from same user
> * each of them has 128 executors and 2 cores per executor with 8Gigs of
> memory each and both of them are getting that while running
>
> to clarify more let me quote what I mentioned above. *These data is taken
> from Spark-UI when the jobs are almost finished in both.*
> "What i found is the  the quantile values for median for one ran with
> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which
> means per task time taken is much higher in spark-submit script than
> jupyter script. This is where I am really puzzled because they are the
> exact same code. why running them two different ways vary so much in the
> execution time.
>
>
>
>
> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
>
>
> On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch  wrote:
>
>> Sounds like you have done your homework to properly compare .   I'm
>> guessing the answer to the following is yes .. but in any case:  are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>>
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>> dhruba.w...@gmail.com>:
>>
>>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>>> compressed base64 encoded text data from a hive table and decompressing and
>>> decoding in one of the udfs. Also the time compared is from Spark UI not
>>> how long the job actually takes after submission. Its just the running time
>>> i am comparing/mentioning.
>>>
>>> As mentioned earlier, all the spark conf params even match in two
>>> scripts and that's why i am puzzled what going on.
>>>
>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>> pmccar...@dstillery.com> wrote:
>>>
 It's not obvious from what you pasted, but perhaps the juypter notebook
 already is connected to a running spark context, while spark-submit needs
 to get a new spot in the (YARN?) queue.

 I would check the cluster job IDs for both to ensure you're getting new
 cluster tasks for each.

 On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
 wrote:

> Hi,
>
> I am facing a weird behaviour while running a python script. Here is
> what the code looks like mostly:
>
> def fn1(ip):
>some code...
> ...
>
> def fn2(row):
> ...
> some operations
> ...
> return row1
>
>
> udf_fn1 = udf(fn1)
> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
> ~4500 partitions
> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
> .drop("colz") \
> .withColumnRenamed("colz", "coly")
>
> edf = ddf \
> .filter(ddf.colp == 'some_value') \
> .rdd.map(lambda row: fn2(row)) \
> .toDF()
>
> print edf.count() // simple way for the performance test in both
> platforms
>
> Now when I run the same code in a brand new jupyter notebook it runs
> 6x faster than when I run this python script using spark-submit. The
> configurations are printed and  compared from both the platforms and they
> are exact same. I even tried to run this script in a single cell of 
> jupyter
> notebook and still have the same performance. I need to understand if I am
> missing something in the spark-submit which is causing the issue.  I tried
> to minimise the script to reproduce the same error without much code.
>
> Both are run in client mode on a yarn based spark cluster. The
> machines from which both are executed are also the same and from same 
> user.
>
> What i found is the  the quantile values for median for one ran with
> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am 
> not
> able to figure out why this is happening.
>
> Any one faced this kind of issue before or know how to resolve this?
>
> *Regards,*
> *Dhrub*
>


 --


 *Patrick McCarthy  *

 Senior Data Scientist, Machine Learning Engineering

 Dstillery

 470 Park Ave South, 17th Floor, NYC 10016

>>>


Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
As mentioned in the very first mail:
* same cluster it is submitted.
* from same machine they are submitted and also from same user
* each of them has 128 executors and 2 cores per executor with 8Gigs of
memory each and both of them are getting that while running

to clarify more let me quote what I mentioned above. *These data is taken
from Spark-UI when the jobs are almost finished in both.*
"What i found is the  the quantile values for median for one ran with
jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which
means per task time taken is much higher in spark-submit script than
jupyter script. This is where I am really puzzled because they are the
exact same code. why running them two different ways vary so much in the
execution time.




*Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*


On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch  wrote:

> Sounds like you have done your homework to properly compare .   I'm
> guessing the answer to the following is yes .. but in any case:  are they
> both running against the same spark cluster with the same configuration
> parameters especially executor memory and number of workers?
>
> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
> dhruba.w...@gmail.com>:
>
>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>> compressed base64 encoded text data from a hive table and decompressing and
>> decoding in one of the udfs. Also the time compared is from Spark UI not
>> how long the job actually takes after submission. Its just the running time
>> i am comparing/mentioning.
>>
>> As mentioned earlier, all the spark conf params even match in two scripts
>> and that's why i am puzzled what going on.
>>
>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, 
>> wrote:
>>
>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>> already is connected to a running spark context, while spark-submit needs
>>> to get a new spot in the (YARN?) queue.
>>>
>>> I would check the cluster job IDs for both to ensure you're getting new
>>> cluster tasks for each.
>>>
>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
>>> wrote:
>>>
 Hi,

 I am facing a weird behaviour while running a python script. Here is
 what the code looks like mostly:

 def fn1(ip):
some code...
 ...

 def fn2(row):
 ...
 some operations
 ...
 return row1


 udf_fn1 = udf(fn1)
 cdf = spark.read.table("") //hive table is of size > 500 Gigs with
 ~4500 partitions
 ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
 .drop("colz") \
 .withColumnRenamed("colz", "coly")

 edf = ddf \
 .filter(ddf.colp == 'some_value') \
 .rdd.map(lambda row: fn2(row)) \
 .toDF()

 print edf.count() // simple way for the performance test in both
 platforms

 Now when I run the same code in a brand new jupyter notebook it runs 6x
 faster than when I run this python script using spark-submit. The
 configurations are printed and  compared from both the platforms and they
 are exact same. I even tried to run this script in a single cell of jupyter
 notebook and still have the same performance. I need to understand if I am
 missing something in the spark-submit which is causing the issue.  I tried
 to minimise the script to reproduce the same error without much code.

 Both are run in client mode on a yarn based spark cluster. The machines
 from which both are executed are also the same and from same user.

 What i found is the  the quantile values for median for one ran with
 jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
 able to figure out why this is happening.

 Any one faced this kind of issue before or know how to resolve this?

 *Regards,*
 *Dhrub*

>>>
>>>
>>> --
>>>
>>>
>>> *Patrick McCarthy  *
>>>
>>> Senior Data Scientist, Machine Learning Engineering
>>>
>>> Dstillery
>>>
>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>
>>


Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Sounds like you have done your homework to properly compare .   I'm
guessing the answer to the following is yes .. but in any case:  are they
both running against the same spark cluster with the same configuration
parameters especially executor memory and number of workers?

Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
dhruba.w...@gmail.com>:

> No, i checked for that, hence written "brand new" jupyter notebook. Also
> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
> compressed base64 encoded text data from a hive table and decompressing and
> decoding in one of the udfs. Also the time compared is from Spark UI not
> how long the job actually takes after submission. Its just the running time
> i am comparing/mentioning.
>
> As mentioned earlier, all the spark conf params even match in two scripts
> and that's why i am puzzled what going on.
>
> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, 
> wrote:
>
>> It's not obvious from what you pasted, but perhaps the juypter notebook
>> already is connected to a running spark context, while spark-submit needs
>> to get a new spot in the (YARN?) queue.
>>
>> I would check the cluster job IDs for both to ensure you're getting new
>> cluster tasks for each.
>>
>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
>> wrote:
>>
>>> Hi,
>>>
>>> I am facing a weird behaviour while running a python script. Here is
>>> what the code looks like mostly:
>>>
>>> def fn1(ip):
>>>some code...
>>> ...
>>>
>>> def fn2(row):
>>> ...
>>> some operations
>>> ...
>>> return row1
>>>
>>>
>>> udf_fn1 = udf(fn1)
>>> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
>>> ~4500 partitions
>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>> .drop("colz") \
>>> .withColumnRenamed("colz", "coly")
>>>
>>> edf = ddf \
>>> .filter(ddf.colp == 'some_value') \
>>> .rdd.map(lambda row: fn2(row)) \
>>> .toDF()
>>>
>>> print edf.count() // simple way for the performance test in both
>>> platforms
>>>
>>> Now when I run the same code in a brand new jupyter notebook it runs 6x
>>> faster than when I run this python script using spark-submit. The
>>> configurations are printed and  compared from both the platforms and they
>>> are exact same. I even tried to run this script in a single cell of jupyter
>>> notebook and still have the same performance. I need to understand if I am
>>> missing something in the spark-submit which is causing the issue.  I tried
>>> to minimise the script to reproduce the same error without much code.
>>>
>>> Both are run in client mode on a yarn based spark cluster. The machines
>>> from which both are executed are also the same and from same user.
>>>
>>> What i found is the  the quantile values for median for one ran with
>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
>>> able to figure out why this is happening.
>>>
>>> Any one faced this kind of issue before or know how to resolve this?
>>>
>>> *Regards,*
>>> *Dhrub*
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>


Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
No, i checked for that, hence written "brand new" jupyter notebook. Also
the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
compressed base64 encoded text data from a hive table and decompressing and
decoding in one of the udfs. Also the time compared is from Spark UI not
how long the job actually takes after submission. Its just the running time
i am comparing/mentioning.

As mentioned earlier, all the spark conf params even match in two scripts
and that's why i am puzzled what going on.

On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, 
wrote:

> It's not obvious from what you pasted, but perhaps the juypter notebook
> already is connected to a running spark context, while spark-submit needs
> to get a new spot in the (YARN?) queue.
>
> I would check the cluster job IDs for both to ensure you're getting new
> cluster tasks for each.
>
> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
> wrote:
>
>> Hi,
>>
>> I am facing a weird behaviour while running a python script. Here is what
>> the code looks like mostly:
>>
>> def fn1(ip):
>>some code...
>> ...
>>
>> def fn2(row):
>> ...
>> some operations
>> ...
>> return row1
>>
>>
>> udf_fn1 = udf(fn1)
>> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
>> ~4500 partitions
>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>> .drop("colz") \
>> .withColumnRenamed("colz", "coly")
>>
>> edf = ddf \
>> .filter(ddf.colp == 'some_value') \
>> .rdd.map(lambda row: fn2(row)) \
>> .toDF()
>>
>> print edf.count() // simple way for the performance test in both platforms
>>
>> Now when I run the same code in a brand new jupyter notebook it runs 6x
>> faster than when I run this python script using spark-submit. The
>> configurations are printed and  compared from both the platforms and they
>> are exact same. I even tried to run this script in a single cell of jupyter
>> notebook and still have the same performance. I need to understand if I am
>> missing something in the spark-submit which is causing the issue.  I tried
>> to minimise the script to reproduce the same error without much code.
>>
>> Both are run in client mode on a yarn based spark cluster. The machines
>> from which both are executed are also the same and from same user.
>>
>> What i found is the  the quantile values for median for one ran with
>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
>> able to figure out why this is happening.
>>
>> Any one faced this kind of issue before or know how to resolve this?
>>
>> *Regards,*
>> *Dhrub*
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Request for contributor permissions

2019-09-10 Thread Takeshi Yamamuro
Hi, Alaa

Thanks for you contact!
You can file a jira without any permission.

btw, have you checked the contribution guide?
https://spark.apache.org/contributing.html
You'd be better to check that before contributions.

Bests,
Takeshi

On Wed, Sep 11, 2019 at 4:37 AM Alaa Zbair  wrote:

> Hello dev,
>
> I am interested in contributing in the Spark project, please add me to the
> contributors list. My Jira username is: Chilio
>
> Thanks.
>
> Alaa Zbair.
>
>

-- 
---
Takeshi Yamamuro


Request for contributor permissions

2019-09-10 Thread Alaa Zbair
Hello dev,

I am interested in contributing in the Spark project, please add me to the
contributors list. My Jira username is: Chilio

Thanks.

Alaa Zbair.


[jira] Lantao Jin shared "SPARK-29038: SPIP: Support Spark Materialized View" with you

2019-09-10 Thread Lantao Jin (Jira)
Lantao Jin shared an issue with you


SPIP: Support Spark Materialized View

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcoming some new committers and PMC members

2019-09-10 Thread Stavros Kontopoulos
Congrats! Well deserved.

On Tue, Sep 10, 2019 at 1:20 PM Driesprong, Fokko 
wrote:

> Congrats all, well deserved!
>
>
> Cheers, Fokko
>
> Op di 10 sep. 2019 om 10:21 schreef Gabor Somogyi <
> gabor.g.somo...@gmail.com>:
>
>> Congrats Guys!
>>
>> G
>>
>>
>> On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> The Spark PMC recently voted to add several new committers and one PMC
>>> member. Join me in welcoming them to their new roles!
>>>
>>> New PMC member: Dongjoon Hyun
>>>
>>> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
>>> Weichen Xu, Ruifeng Zheng
>>>
>>> The new committers cover lots of important areas including ML, SQL, and
>>> data sources, so it’s great to have them here. All the best,
>>>
>>> Matei and the Spark PMC
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

--


Re: Welcoming some new committers and PMC members

2019-09-10 Thread Driesprong, Fokko
Congrats all, well deserved!


Cheers, Fokko

Op di 10 sep. 2019 om 10:21 schreef Gabor Somogyi :

> Congrats Guys!
>
> G
>
>
> On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add several new committers and one PMC
>> member. Join me in welcoming them to their new roles!
>>
>> New PMC member: Dongjoon Hyun
>>
>> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
>> Weichen Xu, Ruifeng Zheng
>>
>> The new committers cover lots of important areas including ML, SQL, and
>> data sources, so it’s great to have them here. All the best,
>>
>> Matei and the Spark PMC
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


[DISCUSS][SPIP][SPARK-29031] Materialized columns

2019-09-10 Thread Jason Guo
Hi,

I'd like to propose a feature name materialized column. This feature will
boost queries on complex type columns.


https://docs.google.com/document/d/186bzUv4CRwoYY_KliNWTexkNCysQo3VUTLQVrVijyl4/edit?usp=sharing

*Background*
In data warehouse domain, there is a common requirement to add new fields
to existing tables. In practice, data engineers usually use complex type,
such as Map (or they may use JSON), and put all subfields into it.
However, it may impact the query performance dramatically because

   1. It is a waste of IO. The whole column (in Map format) should be read
   and Spark extract the required keys from the map, even though the query
   requires only one or a few keys in the map
   2. Vectorized read can not be exploit. Currently, vectorized read can be
   enabled only when all required columns are in atomic type. When a query
   read subfield in a complex type column, vectorized read can not be exploit
   3. Filter pushdown can not be utilized. Only when all required fields
   are in atomic type can filter pushdown be enabled
   4. CPU is wasted because of duplicated computation.  When JSON is
   selected to store all keys, JSON happens each time we query a subfield in
   it. However, JSON parse is a CPU intensive operation, especially when the
   JSON string is very long


*Goal*

   - Add a new SQL grammar of Materialized column
   - Implicitly rewrite SQL queries on the complex type of columns if there
   is a materialized columns for it
   - If the data type of the materialized columns is atomic type, even
   though the origin column type is in complex type, enable vectorized read
   and filter pushdown to improve performance


*Usage*
*#1 Add materialized columns to an existing table*
Step 1: Create a normal table

> CREATE TABLE x (
> name STRING,
> age INT,
> params STRING,
> event MAP
> ) USING parquet;


Step 2: Add materialized columns to an existing table

> ALTER TABLE x ADD COLUMNS (
> new_age INT *MATERIALIZED* age + 1,
> city STRING *MATERIALIZED* get_json_object(params, '$.city'),
> label STRING *MATERIALIZED* event['label']
> );


*#2 Create a new table with materialized table*

> CREATE TABLE x (
> name STRING,
> age INT,
> params STRING,
> event MAP,
> new_age INT MATERIALIZED age + 1,
> city STRING MATERIALIZED get_json_object(params, '$.city'),
> label STRING MATERIALIZED event['label']
> ) USING parquet;



When issue a query on complex type column as below
SELECT name, age+1, get_json_object(params, '$.city'), event['label']
FROM x
WHERE event['label']='newuser';

It is equivalent to
SELECT name, new_age, city, label
FROM x
WHERE label = 'newuser'

The query performance improved dramatically because

   1. The new query (after rewritten) will read the new column city (in
   string type) instead of read the whole map of params(in map string). Much
   lesser data are need to read
   2. Vectorized read can be utilized in the new query and can not be used
   in the old one. Because vectorized read can only be enabled when all
   required columns are in atomic type
   3. Filter can be pushdown. Only filters on atomic column can be
   pushdown. The original filter  event['label'] = 'newuser' is on complex
   column, so it can not be pushdown.
   4. The new query do not need to parse JSON any more. JSON parse is a CPU
   intensive operation which will impact performance dramatically


-- 


Thanks & Best Regards,
Jason Guo


Re: Welcoming some new committers and PMC members

2019-09-10 Thread Gabor Somogyi
Congrats Guys!

G


On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers and one PMC
> member. Join me in welcoming them to their new roles!
>
> New PMC member: Dongjoon Hyun
>
> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
> Weichen Xu, Ruifeng Zheng
>
> The new committers cover lots of important areas including ML, SQL, and
> data sources, so it’s great to have them here. All the best,
>
> Matei and the Spark PMC
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


RE: Welcoming some new committers and PMC members

2019-09-10 Thread Dilip Biswal
Congratulations !! Very well deserved.
 
-- Dilip
 
- Original message -From: "Kazuaki Ishizaki" To: Matei Zaharia Cc: dev Subject: [EXTERNAL] Re: Welcoming some new committers and PMC membersDate: Mon, Sep 9, 2019 9:25 PM Congrats! Well deserved.Kazuaki Ishizaki,From:        Matei Zaharia To:        dev Date:        2019/09/10 09:32Subject:        [EXTERNAL] Welcoming some new committers and PMC members
Hi all,The Spark PMC recently voted to add several new committers and one PMC member. Join me in welcoming them to their new roles!New PMC member: Dongjoon HyunNew committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, Weichen Xu, Ruifeng ZhengThe new committers cover lots of important areas including ML, SQL, and data sources, so it’s great to have them here. All the best,Matei and the Spark PMC-To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org