Hi,
How do I get the filename from
textFileStream
Using streaming.
Thanks a mill
Standard Bank email disclaimer and confidentiality note
Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to read
our email disclaimer and confidentiality note. Kindly email
> Not sure if the dynamic overwrite logic is implemented in Spark or in Hive
AFAIK I'm using spark implementation(s). Does the thread dump that I posted
show that? I'd like to remain within Spark impl.
What I'm trying to ask is, do you spark developers see some ways to
optimize this?
Otherwise,
There is a probably a limit in the number of element you can pass in the
list of partitions for the listPartitionsWithAuthInfo API call. Not sure if
the dynamic overwrite logic is implemented in Spark or in Hive, in which
case using hive 1.2.1 is probably the reason for un-optimized logic but
also
Ok, I've verified that hive> SHOW PARTITIONS is using get_partition_names,
which is always quite fast. Spark's insertInto uses
get_partitions_with_auth which
is much slower (it also gets location etc. of each partition).
I created a test in java that with a local metastore client to measure the
Why do you need 1 partition when 10 partition is doing the job .. ??
Thanks
Ankit
From: vincent gromakowski
Date: Thursday, 25. April 2019 at 09:12
To: Juho Autio
Cc: user
Subject: Re: [Spark SQL]: Slow insertInto overwrite if target table has many
partitions
Which metastore are you
Which metastore are you using?
Le jeu. 25 avr. 2019 à 09:02, Juho Autio a écrit :
> Would anyone be able to answer this question about the non-optimal
> implementation of insertInto?
>
> On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote:
>
>> Hi,
>>
>> My job is writing ~10 partitions with
Would anyone be able to answer this question about the non-optimal
implementation of insertInto?
On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote:
> Hi,
>
> My job is writing ~10 partitions with insertInto. With the same input /
> output data the total duration of the job is very different
Hi,
My job is writing ~10 partitions with insertInto. With the same input /
output data the total duration of the job is very different depending on
how many partitions the target table has.
Target table with 10 of partitions:
1 min 30 s
Target table with ~1 partitions:
13 min 0 s
It seems