date:20190629

Re: Implementing Upsert logic Through Streaming

2019-06-29 Thread Chris Teoh

Not sure what your needs are here. If you can afford to wait, increase your micro batch windows to a long period of time, aggregate your data by key every micro batch and then apply those changes to the Oracle database. Since you're using text file to stream, there's no way to pre partition your

Re: Map side join without broadcast

2019-06-29 Thread Chris Teoh

The closest thing I can think of here is if you have both dataframes written out using buckets. Hive uses this technique for join optimisation such that both datasets of the same bucket are read by the same mapper to achieve map side joins. On Sat., 29 Jun. 2019, 9:10 pm jelmer, wrote: > I have

Re: Map side join without broadcast

2019-06-29 Thread Arbab Khalil

You can use coalesce(1) or repartition on B but it would be better to put A in cache so that it becomes available on all executors and as well as in memory because it contians on one row. On Sat, Jun 29, 2019 at 4:10 PM jelmer wrote: > I have 2 dataframes, > > Dataframe A which contains 1

Re: [pyspark 2.3+] CountDistinct

2019-06-29 Thread Rishi Shah

Thanks Abdeali! Please find details below: df.agg(countDistinct(col('col1'))).show() --> 450089 df.agg(countDistinct(col('col1'))).show() --> 450076 df.filter(col('col1').isNull()).count() --> 0 df.filter(col('col1').isNotNull()).count() --> 450063 col1 is a string Spark version 2.4.0 datasize:

Map side join without broadcast

2019-06-29 Thread jelmer

I have 2 dataframes, Dataframe A which contains 1 element per partition that is gigabytes big (an index) Dataframe B which is made up out of millions of small rows. I want to join B on A but i want all the work to be done on the executors holding the partitions of dataframe A Is there a way to

Re: [pyspark 2.3+] CountDistinct

2019-06-29 Thread Abdeali Kothari

How large is the data frame and what data type are you counting distinct for? I use count distinct quite a bit and haven't noticed any thing peculiar. Also, which exact version in 2.3.x? And, are performing any operations on the DF before the countDistinct? I recall there was a bug when I did

Re: Implementing Upsert logic Through Streaming

Re: Map side join without broadcast

Re: Map side join without broadcast

Re: [pyspark 2.3+] CountDistinct

Map side join without broadcast

Re: [pyspark 2.3+] CountDistinct

6 matches

Site Navigation

Mail list logo

Footer information