Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread Juho Autio
> Not sure if the dynamic overwrite logic is implemented in Spark or in Hive AFAIK I'm using spark implementation(s). Does the thread dump that I posted show that? I'd like to remain within Spark impl. What I'm trying to ask is, do you spark developers see some ways to optimize this? Otherwise,

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread vincent gromakowski
There is a probably a limit in the number of element you can pass in the list of partitions for the listPartitionsWithAuthInfo API call. Not sure if the dynamic overwrite logic is implemented in Spark or in Hive, in which case using hive 1.2.1 is probably the reason for un-optimized logic but also

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread Juho Autio
Ok, I've verified that hive> SHOW PARTITIONS is using get_partition_names, which is always quite fast. Spark's insertInto uses get_partitions_with_auth which is much slower (it also gets location etc. of each partition). I created a test in java that with a local metastore client to measure the

Re: Different query result between spark thrift server and spark-shell

2019-04-25 Thread Jun Zhu
Never mind, I got the point, spark replace hive parquet with it's own, Should set spark.sql.hive.convertMetastoreParquet=false to use hive's. Thanks On Thu, Apr 25, 2019 at 5:00 PM Jun Zhu wrote: > Hi, > We are using plugins from apache hudi which self defined a hive external > table

Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-25 Thread M Bilal
If I understand correctly this would set the split size in the Hadoop configuration when reading file. I can see that being useful when you want to create more partitions than what the block size in HDFS might dictate. Instead what I want to do is to create a single partition for each file written

Different query result between spark thrift server and spark-shell

2019-04-25 Thread Jun Zhu
Hi, We are using plugins from apache hudi which self defined a hive external table inputformat with: ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) STORED AS INPUTFORMAT

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread Khare, Ankit
Why do you need 1 partition when 10 partition is doing the job .. ?? Thanks Ankit From: vincent gromakowski Date: Thursday, 25. April 2019 at 09:12 To: Juho Autio Cc: user Subject: Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions Which metastore are you

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread vincent gromakowski
Which metastore are you using? Le jeu. 25 avr. 2019 à 09:02, Juho Autio a écrit : > Would anyone be able to answer this question about the non-optimal > implementation of insertInto? > > On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote: > >> Hi, >> >> My job is writing ~10 partitions with

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread Juho Autio
Would anyone be able to answer this question about the non-optimal implementation of insertInto? On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote: > Hi, > > My job is writing ~10 partitions with insertInto. With the same input / > output data the total duration of the job is very different