[pyspark 2.3+] Bucketing with sort - incremental data load?
Hi All, Can we use bucketing with sorting functionality to save data incrementally (say daily) ? I understand bucketing is supported in Spark only with saveAsTable, however can this be used with mode "append" instead of "overwrite"? My understanding around bucketing was, you need to rewrite entire table every time, can someone help advice? -- Regards, Rishi Shah
dynamic allocation in spark-shell
Sometimes it's convenient to start a spark-shell on cluster, like ./spark/bin/spark-shell --master yarn --deploy-mode client --num-executors 100 --executor-memory 15g --executor-cores 4 --driver-memory 10g --queue myqueue However, with command like this, those allocated resources will be occupied until the console exits. Just wandering if it is possible to start a spark-shell with dynamicAllocation enabled? If it is, how to specify the configs? Can anyone give an quick example? Thanks!
Re: Should python-2 be supported in Spark 3.0?
Here is the draft announcement: === Plan for dropping Python 2 support As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2 support in or before 2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1.4 release in 2015. However, maintaining Python 2/3 compatibility is an increasing burden and it essentially limits the use of Python 3 features in Spark. Given the end of life (EOL) of Python 2 is coming, we plan to eventually drop Python 2 support as well. The current plan is as follows: * In the next major release in 2019, we will deprecate Python 2 support. PySpark users will see a deprecation warning if Python 2 is used. We will publish a migration guide for PySpark users to migrate to Python 3. * We will drop Python 2 support in a future release in 2020, after Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is used. * For releases that support Python 2, e.g., Spark 2.4, their patch releases will continue supporting Python 2. However, after Python 2 EOL, we might not take patches that are specific to Python 2. === Sean helped make a pass. If it looks good, I'm going to upload it to Spark website and announce it here. Let me know if you think we should do a VOTE instead. On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng wrote: > I created https://issues.apache.org/jira/browse/SPARK-27884 to track the > work. > > On Thu, May 30, 2019 at 2:18 AM Felix Cheung > wrote: > >> We don’t usually reference a future release on website >> >> > Spark website and state that Python 2 is deprecated in Spark 3.0 >> >> I suspect people will then ask when is Spark 3.0 coming out then. Might >> need to provide some clarity on that. >> > > We can say the "next major release in 2019" instead of Spark 3.0. Spark > 3.0 timeline certainly requires a new thread to discuss. > > >> >> >> -- >> *From:* Reynold Xin >> *Sent:* Thursday, May 30, 2019 12:59:14 AM >> *To:* shane knapp >> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen >> Fen; Xiangrui Meng; dev; user >> *Subject:* Re: Should python-2 be supported in Spark 3.0? >> >> +1 on Xiangrui’s plan. >> >> On Thu, May 30, 2019 at 7:55 AM shane knapp wrote: >> >>> I don't have a good sense of the overhead of continuing to support Python 2; is it large enough to consider dropping it in Spark 3.0? from the build/test side, it will actually be pretty easy to continue >>> support for python2.7 for spark 2.x as the feature sets won't be expanding. >>> >> >>> that being said, i will be cracking a bottle of champagne when i can >>> delete all of the ansible and anaconda configs for python2.x. :) >>> >> > On the development side, in a future release that drops Python 2 support > we can remove code that maintains python 2/3 compatibility and start using > python 3 only features, which is also quite exciting. > > >> >>> shane >>> -- >>> Shane Knapp >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >>
Re: Should python-2 be supported in Spark 3.0?
I created https://issues.apache.org/jira/browse/SPARK-27884 to track the work. On Thu, May 30, 2019 at 2:18 AM Felix Cheung wrote: > We don’t usually reference a future release on website > > > Spark website and state that Python 2 is deprecated in Spark 3.0 > > I suspect people will then ask when is Spark 3.0 coming out then. Might > need to provide some clarity on that. > We can say the "next major release in 2019" instead of Spark 3.0. Spark 3.0 timeline certainly requires a new thread to discuss. > > > -- > *From:* Reynold Xin > *Sent:* Thursday, May 30, 2019 12:59:14 AM > *To:* shane knapp > *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen > Fen; Xiangrui Meng; dev; user > *Subject:* Re: Should python-2 be supported in Spark 3.0? > > +1 on Xiangrui’s plan. > > On Thu, May 30, 2019 at 7:55 AM shane knapp wrote: > >> I don't have a good sense of the overhead of continuing to support >>> Python 2; is it large enough to consider dropping it in Spark 3.0? >>> >>> from the build/test side, it will actually be pretty easy to continue >> support for python2.7 for spark 2.x as the feature sets won't be expanding. >> > >> that being said, i will be cracking a bottle of champagne when i can >> delete all of the ansible and anaconda configs for python2.x. :) >> > On the development side, in a future release that drops Python 2 support we can remove code that maintains python 2/3 compatibility and start using python 3 only features, which is also quite exciting. > >> shane >> -- >> Shane Knapp >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> >
Re: Upsert for hive tables
Since parquet don't support updates you have to backfill your dataset. If that is your regular scenario you should partition your parquet files so backfilling becomes easier. As the data is structured now you have to update everything just to upsert quite a small amount of changed data. Look at your data, look at your use case and use partitioning (and bucketing if you want to eliminate/reduce shuffle joins) to store your data in a more optimal way. Lets say your large table is a timeline of events stretching three years back but your updated data is only from the last week or month. If you'd partition by year/month/week/day you could just backfill the partitions that was updated. Adapt the pattern to your particular scenario and data size. If everything is random and no sure way to decide what partition updated will happen in you could just break down your dataset by key % (suitable_partition_size/assumed_total_size_of_dataset). There are alot of partitioning schemes but the point is you have to limit the amount of data to read from disk, filter and write back to get better performance. regards, Magnus On Wed, May 29, 2019 at 7:20 PM Tomasz Krol wrote: > Hey Guys, > > I am wondering what would be your approach to following scenario: > > I have two tables - one (Table A) is relatively small (e.g 50GB) and > second one (Table B) much bigger (e.g. 3TB). Both are parquet tables. > > I want to ADD all records from Table A to Table B which dont exist in > Table B yet. I use only one field (e.g. key) to check existence for > specific record. > > Then I want to UPDATE (by values from Table A) all records in Table B > which also exist in Table A. To determine if specific record exist I use > also the same "key" field. > > To achieve above I run following sql queries: > > 1. Find existing records and insert into temp table > > insert into temp_table select a.cols from Table A a left semi join Table B > b on a.key = b.key > > 2. Find new records and insert them into temp table > > insert into temp_table select a.cols from Table A a left anti join Table B > b on a.key = b.key > > 3. Find existing records in Table B which dont exist in Table A > > insert into temp_table select b.cols from Table B b left anti join Table A > a a.key = b. key > > In that way I built Table B updated with records from Table A. > However, the problem here is the step 3, because I am inserting almost 3 > TB of data that takes obviously some time. > I was trying different approaches but no luck. > > I am wondering whats your ideas how can we perform this scenario > efficiently in Spark? > > Cheers > > Tom > -- > Tomasz Krol > patric...@gmail.com >
Re: Should python-2 be supported in Spark 3.0?
We don’t usually reference a future release on website > Spark website and state that Python 2 is deprecated in Spark 3.0 I suspect people will then ask when is Spark 3.0 coming out then. Might need to provide some clarity on that. From: Reynold Xin Sent: Thursday, May 30, 2019 12:59:14 AM To: shane knapp Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; Xiangrui Meng; dev; user Subject: Re: Should python-2 be supported in Spark 3.0? +1 on Xiangrui’s plan. On Thu, May 30, 2019 at 7:55 AM shane knapp mailto:skn...@berkeley.edu>> wrote: I don't have a good sense of the overhead of continuing to support Python 2; is it large enough to consider dropping it in Spark 3.0? from the build/test side, it will actually be pretty easy to continue support for python2.7 for spark 2.x as the feature sets won't be expanding. that being said, i will be cracking a bottle of champagne when i can delete all of the ansible and anaconda configs for python2.x. :) shane -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu
Re: Upsert for hive tables
Unfortunately, dont have timestamps in those tables:( Only key on which I can check existence of specific record. But even with the timestamp how would you make the update.? When I say update I mean to overwrite existing record. For example you have following in table A key| field1 | field2 1 a b and in Table B key| field1 | field2 1 c d so after update I want to have in Table B 1 | a | b Dont want to insert new row in this case, just overwrite the existing one. Thanks On Thu 30 May 2019 at 05:10, Aakash Basu wrote: > Don't you have a date/timestamp to handle updates? So, you're talking > about CDC? If you've Datestamp you can check if that/those key(s) exists, > if exists then check if timestamp matches, if that matches, then ignore, if > that doesn't then update. > > On Thu 30 May, 2019, 7:11 AM Genieliu, wrote: > >> Isn't step1 and step2 producing the copy of Table A? >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- Tomasz Krol patric...@gmail.com
Re: Should python-2 be supported in Spark 3.0?
+1 on Xiangrui’s plan. On Thu, May 30, 2019 at 7:55 AM shane knapp wrote: > I don't have a good sense of the overhead of continuing to support >> Python 2; is it large enough to consider dropping it in Spark 3.0? >> >> from the build/test side, it will actually be pretty easy to continue > support for python2.7 for spark 2.x as the feature sets won't be expanding. > > that being said, i will be cracking a bottle of champagne when i can > delete all of the ansible and anaconda configs for python2.x. :) > > shane > -- > Shane Knapp > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu >