[pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-30 Thread Rishi Shah
Hi All,

Can we use bucketing with sorting functionality to save data incrementally
(say daily) ? I understand bucketing is supported in Spark only with
saveAsTable, however can this be used with mode "append" instead of
"overwrite"?

My understanding around bucketing was, you need to rewrite entire table
every time, can someone help advice?

-- 
Regards,

Rishi Shah


dynamic allocation in spark-shell

2019-05-30 Thread Qian He
Sometimes it's convenient to start a spark-shell on cluster, like
./spark/bin/spark-shell --master yarn --deploy-mode client --num-executors
100 --executor-memory 15g --executor-cores 4 --driver-memory 10g --queue
myqueue
However, with command like this, those allocated resources will be occupied
until the console exits.

Just wandering if it is possible to start a spark-shell with
dynamicAllocation enabled? If it is, how to specify the configs? Can anyone
give an quick example? Thanks!


Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
Here is the draft announcement:

===
Plan for dropping Python 2 support

As many of you already knew, Python core development team and many utilized
Python packages like Pandas and NumPy will drop Python 2 support in or
before 2020/01/01. Apache Spark has supported both Python 2 and 3 since
Spark 1.4 release in 2015. However, maintaining Python 2/3 compatibility is
an increasing burden and it essentially limits the use of Python 3 features
in Spark. Given the end of life (EOL) of Python 2 is coming, we plan to
eventually drop Python 2 support as well. The current plan is as follows:

* In the next major release in 2019, we will deprecate Python 2 support.
PySpark users will see a deprecation warning if Python 2 is used. We will
publish a migration guide for PySpark users to migrate to Python 3.
* We will drop Python 2 support in a future release in 2020, after Python 2
EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
* For releases that support Python 2, e.g., Spark 2.4, their patch releases
will continue supporting Python 2. However, after Python 2 EOL, we might
not take patches that are specific to Python 2.
===

Sean helped make a pass. If it looks good, I'm going to upload it to Spark
website and announce it here. Let me know if you think we should do a VOTE
instead.

On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:

> I created https://issues.apache.org/jira/browse/SPARK-27884 to track the
> work.
>
> On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
> wrote:
>
>> We don’t usually reference a future release on website
>>
>> > Spark website and state that Python 2 is deprecated in Spark 3.0
>>
>> I suspect people will then ask when is Spark 3.0 coming out then. Might
>> need to provide some clarity on that.
>>
>
> We can say the "next major release in 2019" instead of Spark 3.0. Spark
> 3.0 timeline certainly requires a new thread to discuss.
>
>
>>
>>
>> --
>> *From:* Reynold Xin 
>> *Sent:* Thursday, May 30, 2019 12:59:14 AM
>> *To:* shane knapp
>> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
>> Fen; Xiangrui Meng; dev; user
>> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>>
>> +1 on Xiangrui’s plan.
>>
>> On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:
>>
>>> I don't have a good sense of the overhead of continuing to support
 Python 2; is it large enough to consider dropping it in Spark 3.0?

 from the build/test side, it will actually be pretty easy to continue
>>> support for python2.7 for spark 2.x as the feature sets won't be expanding.
>>>
>>
>>> that being said, i will be cracking a bottle of champagne when i can
>>> delete all of the ansible and anaconda configs for python2.x.  :)
>>>
>>
> On the development side, in a future release that drops Python 2 support
> we can remove code that maintains python 2/3 compatibility and start using
> python 3 only features, which is also quite exciting.
>
>
>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>


Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
I created https://issues.apache.org/jira/browse/SPARK-27884 to track the
work.

On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
wrote:

> We don’t usually reference a future release on website
>
> > Spark website and state that Python 2 is deprecated in Spark 3.0
>
> I suspect people will then ask when is Spark 3.0 coming out then. Might
> need to provide some clarity on that.
>

We can say the "next major release in 2019" instead of Spark 3.0. Spark 3.0
timeline certainly requires a new thread to discuss.


>
>
> --
> *From:* Reynold Xin 
> *Sent:* Thursday, May 30, 2019 12:59:14 AM
> *To:* shane knapp
> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
> Fen; Xiangrui Meng; dev; user
> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>
> +1 on Xiangrui’s plan.
>
> On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:
>
>> I don't have a good sense of the overhead of continuing to support
>>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>>
>>> from the build/test side, it will actually be pretty easy to continue
>> support for python2.7 for spark 2.x as the feature sets won't be expanding.
>>
>
>> that being said, i will be cracking a bottle of champagne when i can
>> delete all of the ansible and anaconda configs for python2.x.  :)
>>
>
On the development side, in a future release that drops Python 2 support we
can remove code that maintains python 2/3 compatibility and start using
python 3 only features, which is also quite exciting.


>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Re: Upsert for hive tables

2019-05-30 Thread Magnus Nilsson
Since parquet don't support updates you have to backfill your dataset. If
that is your regular scenario you should partition your parquet files so
backfilling becomes easier.

As the data is structured now you have to update everything just to upsert
quite a small amount of changed data. Look at your data, look at your use
case and use partitioning (and bucketing if you want to eliminate/reduce
shuffle joins) to store your data in a more optimal way.

Lets say your large table is a timeline of events stretching three years
back but your updated data is only from the last week or month. If you'd
partition by year/month/week/day you could just backfill the partitions
that was updated. Adapt the pattern to your particular scenario and data
size.

If everything is random and no sure way to decide what partition updated
will happen in you could just break down your dataset by key %
(suitable_partition_size/assumed_total_size_of_dataset). There are alot of
partitioning schemes but the point is you have to limit the amount of data
to read from disk, filter and write back to get better performance.

regards,

Magnus

On Wed, May 29, 2019 at 7:20 PM Tomasz Krol  wrote:

> Hey Guys,
>
> I am wondering what would be your approach to following scenario:
>
> I have two tables - one (Table A) is relatively small (e.g 50GB) and
> second one (Table B) much bigger (e.g. 3TB). Both are parquet tables.
>
>  I want to ADD all records from Table A to Table B which dont exist in
> Table B yet. I use only one field (e.g. key) to check existence for
> specific record.
>
> Then I want to UPDATE (by values from Table A) all records in Table B
> which also exist in Table A. To determine if specific record exist I use
> also the same "key" field.
>
> To achieve above I run following sql queries:
>
> 1. Find existing records and insert into temp table
>
> insert into temp_table select a.cols from Table A a left semi join Table B
> b on a.key = b.key
>
> 2. Find new records and insert them into temp table
>
> insert into temp_table select a.cols from Table A a left anti join Table B
> b on a.key = b.key
>
> 3. Find existing records in Table B which dont exist in   Table A
>
> insert into temp_table select b.cols from Table B b left anti join Table A
> a a.key = b. key
>
> In that way I built Table B updated with records from Table A.
> However, the problem here is the step 3, because I am inserting almost 3
> TB of data that takes obviously some time.
> I was trying different approaches but no luck.
>
> I am wondering whats your ideas how can we perform this scenario
> efficiently in Spark?
>
> Cheers
>
> Tom
> --
> Tomasz Krol
> patric...@gmail.com
>


Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Felix Cheung
We don’t usually reference a future release on website

> Spark website and state that Python 2 is deprecated in Spark 3.0

I suspect people will then ask when is Spark 3.0 coming out then. Might need to 
provide some clarity on that.



From: Reynold Xin 
Sent: Thursday, May 30, 2019 12:59:14 AM
To: shane knapp
Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; 
Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?

+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support 
for python2.7 for spark 2.x as the feature sets won't be expanding.

that being said, i will be cracking a bottle of champagne when i can delete all 
of the ansible and anaconda configs for python2.x.  :)

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Upsert for hive tables

2019-05-30 Thread Tomasz Krol
Unfortunately, dont have timestamps in those tables:( Only key on which I
can check existence of specific record.

But even with the timestamp how would you make the update.? When I say
update I mean to overwrite existing record.

For example you have following in table A

key| field1 | field2
1 a   b

and in Table B
key| field1 | field2
1 c d


so after update I want to have in Table B

1 | a | b

Dont want to insert new row in this case, just overwrite the existing one.

Thanks



On Thu 30 May 2019 at 05:10, Aakash Basu  wrote:

> Don't you have a date/timestamp to handle updates? So, you're talking
> about CDC? If you've Datestamp you can check if that/those key(s) exists,
> if exists then check if timestamp matches, if that matches, then ignore, if
> that doesn't then update.
>
> On Thu 30 May, 2019, 7:11 AM Genieliu,  wrote:
>
>> Isn't step1 and step2 producing the copy of Table A?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
Tomasz Krol
patric...@gmail.com


Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Reynold Xin
+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:

> I don't have a good sense of the overhead of continuing to support
>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>
>> from the build/test side, it will actually be pretty easy to continue
> support for python2.7 for spark 2.x as the feature sets won't be expanding.
>
> that being said, i will be cracking a bottle of champagne when i can
> delete all of the ansible and anaconda configs for python2.x.  :)
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>