Re: script running in jupyter 6-7x faster than spark submit
Just checked from where the script is submitted i.e. wrt Driver, the python env are different. Jupyter one is running within a the virtual environment which is Python 2.7.5 and the spark-submit one uses 2.6.6. But the executors have the same python version right? I tried doing a spark-submit from jupyter shell, it fails to find python 2.7 which is not there hence throws error. Here is the udf which might take time: import base64 import zlib def decompress(data): bytecode = base64.b64decode(data) d = zlib.decompressobj(32 + zlib.MAX_WBITS) decompressed_data = d.decompress(bytecode ) return(decompressed_data.decode('utf-8')) Could this because of the two python environment mismatch from Driver side? But the processing happens in the executor side? *Regards,Dhrub* On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari wrote: > Maybe you can try running it in a python shell or jupyter-console/ipython > instead of a spark-submit and check how much time it takes too. > > Compare the env variables to check that no additional env configuration is > present in either environment. > > Also is the python environment for both the exact same? I ask because it > looks like you're using a UDF and if the Jupyter python has (let's say) > numpy compiled with blas it would be faster than a numpy without it. Etc. > I.E. Some library you use may be using pure python and another may be using > a faster C extension... > > What python libraries are you using in the UDFs? It you don't use UDFs at > all and use some very simple pure spark functions does the time difference > still exist? > > Also are you using dynamic allocation or some similar spark config which > could vary performance between runs because the same resources we're not > utilized on Jupyter / spark-submit? > > > On Wed, Sep 11, 2019, 08:43 Stephen Boesch wrote: > >> Sounds like you have done your homework to properly compare . I'm >> guessing the answer to the following is yes .. but in any case: are they >> both running against the same spark cluster with the same configuration >> parameters especially executor memory and number of workers? >> >> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < >> dhruba.w...@gmail.com>: >> >>> No, i checked for that, hence written "brand new" jupyter notebook. Also >>> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs >>> compressed base64 encoded text data from a hive table and decompressing and >>> decoding in one of the udfs. Also the time compared is from Spark UI not >>> how long the job actually takes after submission. Its just the running time >>> i am comparing/mentioning. >>> >>> As mentioned earlier, all the spark conf params even match in two >>> scripts and that's why i am puzzled what going on. >>> >>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, < >>> pmccar...@dstillery.com> wrote: >>> It's not obvious from what you pasted, but perhaps the juypter notebook already is connected to a running spark context, while spark-submit needs to get a new spot in the (YARN?) queue. I would check the cluster job IDs for both to ensure you're getting new cluster tasks for each. On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati wrote: > Hi, > > I am facing a weird behaviour while running a python script. Here is > what the code looks like mostly: > > def fn1(ip): >some code... > ... > > def fn2(row): > ... > some operations > ... > return row1 > > > udf_fn1 = udf(fn1) > cdf = spark.read.table("") //hive table is of size > 500 Gigs with > ~4500 partitions > ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ > .drop("colz") \ > .withColumnRenamed("colz", "coly") > > edf = ddf \ > .filter(ddf.colp == 'some_value') \ > .rdd.map(lambda row: fn2(row)) \ > .toDF() > > print edf.count() // simple way for the performance test in both > platforms > > Now when I run the same code in a brand new jupyter notebook it runs > 6x faster than when I run this python script using spark-submit. The > configurations are printed and compared from both the platforms and they > are exact same. I even tried to run this script in a single cell of > jupyter > notebook and still have the same performance. I need to understand if I am > missing something in the spark-submit which is causing the issue. I tried > to minimise the script to reproduce the same error without much code. > > Both are run in client mode on a yarn based spark cluster. The > machines from which both are executed are also the same and from same > user. > > What i found is the the quantile values for median for one ran with > jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am > not > able to figure out why this
Re: script running in jupyter 6-7x faster than spark submit
Maybe you can try running it in a python shell or jupyter-console/ipython instead of a spark-submit and check how much time it takes too. Compare the env variables to check that no additional env configuration is present in either environment. Also is the python environment for both the exact same? I ask because it looks like you're using a UDF and if the Jupyter python has (let's say) numpy compiled with blas it would be faster than a numpy without it. Etc. I.E. Some library you use may be using pure python and another may be using a faster C extension... What python libraries are you using in the UDFs? It you don't use UDFs at all and use some very simple pure spark functions does the time difference still exist? Also are you using dynamic allocation or some similar spark config which could vary performance between runs because the same resources we're not utilized on Jupyter / spark-submit? On Wed, Sep 11, 2019, 08:43 Stephen Boesch wrote: > Sounds like you have done your homework to properly compare . I'm > guessing the answer to the following is yes .. but in any case: are they > both running against the same spark cluster with the same configuration > parameters especially executor memory and number of workers? > > Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < > dhruba.w...@gmail.com>: > >> No, i checked for that, hence written "brand new" jupyter notebook. Also >> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs >> compressed base64 encoded text data from a hive table and decompressing and >> decoding in one of the udfs. Also the time compared is from Spark UI not >> how long the job actually takes after submission. Its just the running time >> i am comparing/mentioning. >> >> As mentioned earlier, all the spark conf params even match in two scripts >> and that's why i am puzzled what going on. >> >> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, >> wrote: >> >>> It's not obvious from what you pasted, but perhaps the juypter notebook >>> already is connected to a running spark context, while spark-submit needs >>> to get a new spot in the (YARN?) queue. >>> >>> I would check the cluster job IDs for both to ensure you're getting new >>> cluster tasks for each. >>> >>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati >>> wrote: >>> Hi, I am facing a weird behaviour while running a python script. Here is what the code looks like mostly: def fn1(ip): some code... ... def fn2(row): ... some operations ... return row1 udf_fn1 = udf(fn1) cdf = spark.read.table("") //hive table is of size > 500 Gigs with ~4500 partitions ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ .drop("colz") \ .withColumnRenamed("colz", "coly") edf = ddf \ .filter(ddf.colp == 'some_value') \ .rdd.map(lambda row: fn2(row)) \ .toDF() print edf.count() // simple way for the performance test in both platforms Now when I run the same code in a brand new jupyter notebook it runs 6x faster than when I run this python script using spark-submit. The configurations are printed and compared from both the platforms and they are exact same. I even tried to run this script in a single cell of jupyter notebook and still have the same performance. I need to understand if I am missing something in the spark-submit which is causing the issue. I tried to minimise the script to reproduce the same error without much code. Both are run in client mode on a yarn based spark cluster. The machines from which both are executed are also the same and from same user. What i found is the the quantile values for median for one ran with jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not able to figure out why this is happening. Any one faced this kind of issue before or know how to resolve this? *Regards,* *Dhrub* >>> >>> >>> -- >>> >>> >>> *Patrick McCarthy * >>> >>> Senior Data Scientist, Machine Learning Engineering >>> >>> Dstillery >>> >>> 470 Park Ave South, 17th Floor, NYC 10016 >>> >>
Re: script running in jupyter 6-7x faster than spark submit
Ok. Can't think of why that would happen. Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati < dhruba.w...@gmail.com>: > As mentioned in the very first mail: > * same cluster it is submitted. > * from same machine they are submitted and also from same user > * each of them has 128 executors and 2 cores per executor with 8Gigs of > memory each and both of them are getting that while running > > to clarify more let me quote what I mentioned above. *These data is taken > from Spark-UI when the jobs are almost finished in both.* > "What i found is the the quantile values for median for one ran with > jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which > means per task time taken is much higher in spark-submit script than > jupyter script. This is where I am really puzzled because they are the > exact same code. why running them two different ways vary so much in the > execution time. > > > > > *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* > > > On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch wrote: > >> Sounds like you have done your homework to properly compare . I'm >> guessing the answer to the following is yes .. but in any case: are they >> both running against the same spark cluster with the same configuration >> parameters especially executor memory and number of workers? >> >> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < >> dhruba.w...@gmail.com>: >> >>> No, i checked for that, hence written "brand new" jupyter notebook. Also >>> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs >>> compressed base64 encoded text data from a hive table and decompressing and >>> decoding in one of the udfs. Also the time compared is from Spark UI not >>> how long the job actually takes after submission. Its just the running time >>> i am comparing/mentioning. >>> >>> As mentioned earlier, all the spark conf params even match in two >>> scripts and that's why i am puzzled what going on. >>> >>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, < >>> pmccar...@dstillery.com> wrote: >>> It's not obvious from what you pasted, but perhaps the juypter notebook already is connected to a running spark context, while spark-submit needs to get a new spot in the (YARN?) queue. I would check the cluster job IDs for both to ensure you're getting new cluster tasks for each. On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati wrote: > Hi, > > I am facing a weird behaviour while running a python script. Here is > what the code looks like mostly: > > def fn1(ip): >some code... > ... > > def fn2(row): > ... > some operations > ... > return row1 > > > udf_fn1 = udf(fn1) > cdf = spark.read.table("") //hive table is of size > 500 Gigs with > ~4500 partitions > ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ > .drop("colz") \ > .withColumnRenamed("colz", "coly") > > edf = ddf \ > .filter(ddf.colp == 'some_value') \ > .rdd.map(lambda row: fn2(row)) \ > .toDF() > > print edf.count() // simple way for the performance test in both > platforms > > Now when I run the same code in a brand new jupyter notebook it runs > 6x faster than when I run this python script using spark-submit. The > configurations are printed and compared from both the platforms and they > are exact same. I even tried to run this script in a single cell of > jupyter > notebook and still have the same performance. I need to understand if I am > missing something in the spark-submit which is causing the issue. I tried > to minimise the script to reproduce the same error without much code. > > Both are run in client mode on a yarn based spark cluster. The > machines from which both are executed are also the same and from same > user. > > What i found is the the quantile values for median for one ran with > jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am > not > able to figure out why this is happening. > > Any one faced this kind of issue before or know how to resolve this? > > *Regards,* > *Dhrub* > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016 >>>
Re: script running in jupyter 6-7x faster than spark submit
As mentioned in the very first mail: * same cluster it is submitted. * from same machine they are submitted and also from same user * each of them has 128 executors and 2 cores per executor with 8Gigs of memory each and both of them are getting that while running to clarify more let me quote what I mentioned above. *These data is taken from Spark-UI when the jobs are almost finished in both.* "What i found is the the quantile values for median for one ran with jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which means per task time taken is much higher in spark-submit script than jupyter script. This is where I am really puzzled because they are the exact same code. why running them two different ways vary so much in the execution time. *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch wrote: > Sounds like you have done your homework to properly compare . I'm > guessing the answer to the following is yes .. but in any case: are they > both running against the same spark cluster with the same configuration > parameters especially executor memory and number of workers? > > Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < > dhruba.w...@gmail.com>: > >> No, i checked for that, hence written "brand new" jupyter notebook. Also >> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs >> compressed base64 encoded text data from a hive table and decompressing and >> decoding in one of the udfs. Also the time compared is from Spark UI not >> how long the job actually takes after submission. Its just the running time >> i am comparing/mentioning. >> >> As mentioned earlier, all the spark conf params even match in two scripts >> and that's why i am puzzled what going on. >> >> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, >> wrote: >> >>> It's not obvious from what you pasted, but perhaps the juypter notebook >>> already is connected to a running spark context, while spark-submit needs >>> to get a new spot in the (YARN?) queue. >>> >>> I would check the cluster job IDs for both to ensure you're getting new >>> cluster tasks for each. >>> >>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati >>> wrote: >>> Hi, I am facing a weird behaviour while running a python script. Here is what the code looks like mostly: def fn1(ip): some code... ... def fn2(row): ... some operations ... return row1 udf_fn1 = udf(fn1) cdf = spark.read.table("") //hive table is of size > 500 Gigs with ~4500 partitions ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ .drop("colz") \ .withColumnRenamed("colz", "coly") edf = ddf \ .filter(ddf.colp == 'some_value') \ .rdd.map(lambda row: fn2(row)) \ .toDF() print edf.count() // simple way for the performance test in both platforms Now when I run the same code in a brand new jupyter notebook it runs 6x faster than when I run this python script using spark-submit. The configurations are printed and compared from both the platforms and they are exact same. I even tried to run this script in a single cell of jupyter notebook and still have the same performance. I need to understand if I am missing something in the spark-submit which is causing the issue. I tried to minimise the script to reproduce the same error without much code. Both are run in client mode on a yarn based spark cluster. The machines from which both are executed are also the same and from same user. What i found is the the quantile values for median for one ran with jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not able to figure out why this is happening. Any one faced this kind of issue before or know how to resolve this? *Regards,* *Dhrub* >>> >>> >>> -- >>> >>> >>> *Patrick McCarthy * >>> >>> Senior Data Scientist, Machine Learning Engineering >>> >>> Dstillery >>> >>> 470 Park Ave South, 17th Floor, NYC 10016 >>> >>
Re: script running in jupyter 6-7x faster than spark submit
Sounds like you have done your homework to properly compare . I'm guessing the answer to the following is yes .. but in any case: are they both running against the same spark cluster with the same configuration parameters especially executor memory and number of workers? Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < dhruba.w...@gmail.com>: > No, i checked for that, hence written "brand new" jupyter notebook. Also > the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs > compressed base64 encoded text data from a hive table and decompressing and > decoding in one of the udfs. Also the time compared is from Spark UI not > how long the job actually takes after submission. Its just the running time > i am comparing/mentioning. > > As mentioned earlier, all the spark conf params even match in two scripts > and that's why i am puzzled what going on. > > On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, > wrote: > >> It's not obvious from what you pasted, but perhaps the juypter notebook >> already is connected to a running spark context, while spark-submit needs >> to get a new spot in the (YARN?) queue. >> >> I would check the cluster job IDs for both to ensure you're getting new >> cluster tasks for each. >> >> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati >> wrote: >> >>> Hi, >>> >>> I am facing a weird behaviour while running a python script. Here is >>> what the code looks like mostly: >>> >>> def fn1(ip): >>>some code... >>> ... >>> >>> def fn2(row): >>> ... >>> some operations >>> ... >>> return row1 >>> >>> >>> udf_fn1 = udf(fn1) >>> cdf = spark.read.table("") //hive table is of size > 500 Gigs with >>> ~4500 partitions >>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ >>> .drop("colz") \ >>> .withColumnRenamed("colz", "coly") >>> >>> edf = ddf \ >>> .filter(ddf.colp == 'some_value') \ >>> .rdd.map(lambda row: fn2(row)) \ >>> .toDF() >>> >>> print edf.count() // simple way for the performance test in both >>> platforms >>> >>> Now when I run the same code in a brand new jupyter notebook it runs 6x >>> faster than when I run this python script using spark-submit. The >>> configurations are printed and compared from both the platforms and they >>> are exact same. I even tried to run this script in a single cell of jupyter >>> notebook and still have the same performance. I need to understand if I am >>> missing something in the spark-submit which is causing the issue. I tried >>> to minimise the script to reproduce the same error without much code. >>> >>> Both are run in client mode on a yarn based spark cluster. The machines >>> from which both are executed are also the same and from same user. >>> >>> What i found is the the quantile values for median for one ran with >>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not >>> able to figure out why this is happening. >>> >>> Any one faced this kind of issue before or know how to resolve this? >>> >>> *Regards,* >>> *Dhrub* >>> >> >> >> -- >> >> >> *Patrick McCarthy * >> >> Senior Data Scientist, Machine Learning Engineering >> >> Dstillery >> >> 470 Park Ave South, 17th Floor, NYC 10016 >> >
Re: script running in jupyter 6-7x faster than spark submit
No, i checked for that, hence written "brand new" jupyter notebook. Also the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs compressed base64 encoded text data from a hive table and decompressing and decoding in one of the udfs. Also the time compared is from Spark UI not how long the job actually takes after submission. Its just the running time i am comparing/mentioning. As mentioned earlier, all the spark conf params even match in two scripts and that's why i am puzzled what going on. On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, wrote: > It's not obvious from what you pasted, but perhaps the juypter notebook > already is connected to a running spark context, while spark-submit needs > to get a new spot in the (YARN?) queue. > > I would check the cluster job IDs for both to ensure you're getting new > cluster tasks for each. > > On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati > wrote: > >> Hi, >> >> I am facing a weird behaviour while running a python script. Here is what >> the code looks like mostly: >> >> def fn1(ip): >>some code... >> ... >> >> def fn2(row): >> ... >> some operations >> ... >> return row1 >> >> >> udf_fn1 = udf(fn1) >> cdf = spark.read.table("") //hive table is of size > 500 Gigs with >> ~4500 partitions >> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ >> .drop("colz") \ >> .withColumnRenamed("colz", "coly") >> >> edf = ddf \ >> .filter(ddf.colp == 'some_value') \ >> .rdd.map(lambda row: fn2(row)) \ >> .toDF() >> >> print edf.count() // simple way for the performance test in both platforms >> >> Now when I run the same code in a brand new jupyter notebook it runs 6x >> faster than when I run this python script using spark-submit. The >> configurations are printed and compared from both the platforms and they >> are exact same. I even tried to run this script in a single cell of jupyter >> notebook and still have the same performance. I need to understand if I am >> missing something in the spark-submit which is causing the issue. I tried >> to minimise the script to reproduce the same error without much code. >> >> Both are run in client mode on a yarn based spark cluster. The machines >> from which both are executed are also the same and from same user. >> >> What i found is the the quantile values for median for one ran with >> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not >> able to figure out why this is happening. >> >> Any one faced this kind of issue before or know how to resolve this? >> >> *Regards,* >> *Dhrub* >> > > > -- > > > *Patrick McCarthy * > > Senior Data Scientist, Machine Learning Engineering > > Dstillery > > 470 Park Ave South, 17th Floor, NYC 10016 >
Re: Request for contributor permissions
Hi, Alaa Thanks for you contact! You can file a jira without any permission. btw, have you checked the contribution guide? https://spark.apache.org/contributing.html You'd be better to check that before contributions. Bests, Takeshi On Wed, Sep 11, 2019 at 4:37 AM Alaa Zbair wrote: > Hello dev, > > I am interested in contributing in the Spark project, please add me to the > contributors list. My Jira username is: Chilio > > Thanks. > > Alaa Zbair. > > -- --- Takeshi Yamamuro
Request for contributor permissions
Hello dev, I am interested in contributing in the Spark project, please add me to the contributors list. My Jira username is: Chilio Thanks. Alaa Zbair.
[jira] Lantao Jin shared "SPARK-29038: SPIP: Support Spark Materialized View" with you
Lantao Jin shared an issue with you SPIP: Support Spark Materialized View > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Welcoming some new committers and PMC members
Congrats! Well deserved. On Tue, Sep 10, 2019 at 1:20 PM Driesprong, Fokko wrote: > Congrats all, well deserved! > > > Cheers, Fokko > > Op di 10 sep. 2019 om 10:21 schreef Gabor Somogyi < > gabor.g.somo...@gmail.com>: > >> Congrats Guys! >> >> G >> >> >> On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia >> wrote: >> >>> Hi all, >>> >>> The Spark PMC recently voted to add several new committers and one PMC >>> member. Join me in welcoming them to their new roles! >>> >>> New PMC member: Dongjoon Hyun >>> >>> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, >>> Weichen Xu, Ruifeng Zheng >>> >>> The new committers cover lots of important areas including ML, SQL, and >>> data sources, so it’s great to have them here. All the best, >>> >>> Matei and the Spark PMC >>> >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> --
Re: Welcoming some new committers and PMC members
Congrats all, well deserved! Cheers, Fokko Op di 10 sep. 2019 om 10:21 schreef Gabor Somogyi : > Congrats Guys! > > G > > > On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia > wrote: > >> Hi all, >> >> The Spark PMC recently voted to add several new committers and one PMC >> member. Join me in welcoming them to their new roles! >> >> New PMC member: Dongjoon Hyun >> >> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, >> Weichen Xu, Ruifeng Zheng >> >> The new committers cover lots of important areas including ML, SQL, and >> data sources, so it’s great to have them here. All the best, >> >> Matei and the Spark PMC >> >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>
[DISCUSS][SPIP][SPARK-29031] Materialized columns
Hi, I'd like to propose a feature name materialized column. This feature will boost queries on complex type columns. https://docs.google.com/document/d/186bzUv4CRwoYY_KliNWTexkNCysQo3VUTLQVrVijyl4/edit?usp=sharing *Background* In data warehouse domain, there is a common requirement to add new fields to existing tables. In practice, data engineers usually use complex type, such as Map (or they may use JSON), and put all subfields into it. However, it may impact the query performance dramatically because 1. It is a waste of IO. The whole column (in Map format) should be read and Spark extract the required keys from the map, even though the query requires only one or a few keys in the map 2. Vectorized read can not be exploit. Currently, vectorized read can be enabled only when all required columns are in atomic type. When a query read subfield in a complex type column, vectorized read can not be exploit 3. Filter pushdown can not be utilized. Only when all required fields are in atomic type can filter pushdown be enabled 4. CPU is wasted because of duplicated computation. When JSON is selected to store all keys, JSON happens each time we query a subfield in it. However, JSON parse is a CPU intensive operation, especially when the JSON string is very long *Goal* - Add a new SQL grammar of Materialized column - Implicitly rewrite SQL queries on the complex type of columns if there is a materialized columns for it - If the data type of the materialized columns is atomic type, even though the origin column type is in complex type, enable vectorized read and filter pushdown to improve performance *Usage* *#1 Add materialized columns to an existing table* Step 1: Create a normal table > CREATE TABLE x ( > name STRING, > age INT, > params STRING, > event MAP > ) USING parquet; Step 2: Add materialized columns to an existing table > ALTER TABLE x ADD COLUMNS ( > new_age INT *MATERIALIZED* age + 1, > city STRING *MATERIALIZED* get_json_object(params, '$.city'), > label STRING *MATERIALIZED* event['label'] > ); *#2 Create a new table with materialized table* > CREATE TABLE x ( > name STRING, > age INT, > params STRING, > event MAP, > new_age INT MATERIALIZED age + 1, > city STRING MATERIALIZED get_json_object(params, '$.city'), > label STRING MATERIALIZED event['label'] > ) USING parquet; When issue a query on complex type column as below SELECT name, age+1, get_json_object(params, '$.city'), event['label'] FROM x WHERE event['label']='newuser'; It is equivalent to SELECT name, new_age, city, label FROM x WHERE label = 'newuser' The query performance improved dramatically because 1. The new query (after rewritten) will read the new column city (in string type) instead of read the whole map of params(in map string). Much lesser data are need to read 2. Vectorized read can be utilized in the new query and can not be used in the old one. Because vectorized read can only be enabled when all required columns are in atomic type 3. Filter can be pushdown. Only filters on atomic column can be pushdown. The original filter event['label'] = 'newuser' is on complex column, so it can not be pushdown. 4. The new query do not need to parse JSON any more. JSON parse is a CPU intensive operation which will impact performance dramatically -- Thanks & Best Regards, Jason Guo
Re: Welcoming some new committers and PMC members
Congrats Guys! G On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia wrote: > Hi all, > > The Spark PMC recently voted to add several new committers and one PMC > member. Join me in welcoming them to their new roles! > > New PMC member: Dongjoon Hyun > > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, > Weichen Xu, Ruifeng Zheng > > The new committers cover lots of important areas including ML, SQL, and > data sources, so it’s great to have them here. All the best, > > Matei and the Spark PMC > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
RE: Welcoming some new committers and PMC members
Congratulations !! Very well deserved. -- Dilip - Original message -From: "Kazuaki Ishizaki" To: Matei Zaharia Cc: dev Subject: [EXTERNAL] Re: Welcoming some new committers and PMC membersDate: Mon, Sep 9, 2019 9:25 PM Congrats! Well deserved.Kazuaki Ishizaki,From: Matei Zaharia To: dev Date: 2019/09/10 09:32Subject: [EXTERNAL] Welcoming some new committers and PMC members Hi all,The Spark PMC recently voted to add several new committers and one PMC member. Join me in welcoming them to their new roles!New PMC member: Dongjoon HyunNew committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, Weichen Xu, Ruifeng ZhengThe new committers cover lots of important areas including ML, SQL, and data sources, so it’s great to have them here. All the best,Matei and the Spark PMC-To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org