Re: dynamic allocation in spark-shell

2019-05-30 Thread Deepak Sharma
You can start spark-shell with these properties: --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.initialExecutors=2 --conf spark.dynamicAllocation.minExecutors=2 --conf spark.dynamicAllocation.maxExecutors=5 On Fri, May 31, 2019 at 5:30 AM Qian He wrote: > Sometimes it

Re: [pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-30 Thread Gourav Sengupta
Hi Rishi, I think that if you are using sorting and then appending data locally there will no need to bucket data and you are good with external tables that way. Regards, Gourav On Fri, May 31, 2019 at 3:43 AM Rishi Shah wrote: > Hi All, > > Can we use bucketing with sorting functionality to s

[pyspark 2.3+] Bucketing with sort - incremental data load?

2019-05-30 Thread Rishi Shah
Hi All, Can we use bucketing with sorting functionality to save data incrementally (say daily) ? I understand bucketing is supported in Spark only with saveAsTable, however can this be used with mode "append" instead of "overwrite"? My understanding around bucketing was, you need to rewrite entir

dynamic allocation in spark-shell

2019-05-30 Thread Qian He
Sometimes it's convenient to start a spark-shell on cluster, like ./spark/bin/spark-shell --master yarn --deploy-mode client --num-executors 100 --executor-memory 15g --executor-cores 4 --driver-memory 10g --queue myqueue However, with command like this, those allocated resources will be occupied u

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
Here is the draft announcement: === Plan for dropping Python 2 support As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2 support in or before 2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
I created https://issues.apache.org/jira/browse/SPARK-27884 to track the work. On Thu, May 30, 2019 at 2:18 AM Felix Cheung wrote: > We don’t usually reference a future release on website > > > Spark website and state that Python 2 is deprecated in Spark 3.0 > > I suspect people will then ask wh

Re: Upsert for hive tables

2019-05-30 Thread Magnus Nilsson
Since parquet don't support updates you have to backfill your dataset. If that is your regular scenario you should partition your parquet files so backfilling becomes easier. As the data is structured now you have to update everything just to upsert quite a small amount of changed data. Look at yo

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Felix Cheung
We don’t usually reference a future release on website > Spark website and state that Python 2 is deprecated in Spark 3.0 I suspect people will then ask when is Spark 3.0 coming out then. Might need to provide some clarity on that. From: Reynold Xin Sent: Thur

Re: Upsert for hive tables

2019-05-30 Thread Tomasz Krol
Unfortunately, dont have timestamps in those tables:( Only key on which I can check existence of specific record. But even with the timestamp how would you make the update.? When I say update I mean to overwrite existing record. For example you have following in table A key| field1 | field2 1

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Reynold Xin
+1 on Xiangrui’s plan. On Thu, May 30, 2019 at 7:55 AM shane knapp wrote: > I don't have a good sense of the overhead of continuing to support >> Python 2; is it large enough to consider dropping it in Spark 3.0? >> >> from the build/test side, it will actually be pretty easy to continue > suppo