Not sure what your needs are here.
If you can afford to wait, increase your micro batch windows to a long
period of time, aggregate your data by key every micro batch and then apply
those changes to the Oracle database.
Since you're using text file to stream, there's no way to pre partition
your
The closest thing I can think of here is if you have both dataframes
written out using buckets. Hive uses this technique for join optimisation
such that both datasets of the same bucket are read by the same mapper to
achieve map side joins.
On Sat., 29 Jun. 2019, 9:10 pm jelmer, wrote:
> I have
You can use coalesce(1) or repartition on B but it would be better to put A
in cache so that it becomes available on all executors and as well as in
memory because it contians on one row.
On Sat, Jun 29, 2019 at 4:10 PM jelmer wrote:
> I have 2 dataframes,
>
> Dataframe A which contains 1
Thanks Abdeali! Please find details below:
df.agg(countDistinct(col('col1'))).show() --> 450089
df.agg(countDistinct(col('col1'))).show() --> 450076
df.filter(col('col1').isNull()).count() --> 0
df.filter(col('col1').isNotNull()).count() --> 450063
col1 is a string
Spark version 2.4.0
datasize:
I have 2 dataframes,
Dataframe A which contains 1 element per partition that is gigabytes big
(an index)
Dataframe B which is made up out of millions of small rows.
I want to join B on A but i want all the work to be done on the executors
holding the partitions of dataframe A
Is there a way to
How large is the data frame and what data type are you counting distinct
for?
I use count distinct quite a bit and haven't noticed any thing peculiar.
Also, which exact version in 2.3.x?
And, are performing any operations on the DF before the countDistinct?
I recall there was a bug when I did