Thank you, Subash. It works! On Tue, Aug 13, 2019 at 5:58 AM Subash Prabakar <subashpraba...@gmail.com> wrote:
> I had the similar issue reading the external parquet table . In my case I > had permission issue in one partition so I added filter to exclude that > partition but still the spark didn’t prune it. Then I read that in order > for spark to be aware of all the partitions it first read the folders and > then updated its metastore . Then the sql is applied on TOP of it. Instead > of using the existing hive SerDe and this property is only for parquet > files. > > Hive metastore Parquet table conversion > <https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#hive-metastore-parquet-table-conversion> > > When reading from and writing to Hive metastore Parquet tables, Spark SQL > will try to use its own Parquet support instead of Hive SerDe for better > performance. This behavior is controlled by the > spark.sql.hive.convertMetastoreParquetconfiguration, and is turned on by > default. > > Reference: > https://spark.apache.org/docs/2.3.0/sql-programming-guide.html > > Set the above property to false . It should work. > > If anyone have better explanation please let me know - I have same > question. Why only parquet has this problem ? > > Thanks > Subash > > On Fri, 9 Aug 2019 at 16:18, Hao Ren <inv...@gmail.com> wrote: > >> Hi Mich, >> >> Thank you for your reply. >> I need to be more clear about the environment. I am using spark-shell to >> run the query. >> Actually, the query works even without core-site, hdfs-site being under >> $SPARK_HOME/conf. >> My problem is efficiency. Because all of the partitions was scanned >> instead of the one in question during the execution of the spark sql query. >> This is why this simple query takes too much time. >> I would like to know how to improve this by just reading the specific >> partition in question. >> >> Feel free to ask more questions if I am not clear. >> >> Best regards, >> Hao >> >> On Thu, Aug 8, 2019 at 9:05 PM Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> also need others as well using soft link ls -l >>> >>> cd $SPARK_HOME/conf >>> >>> hive-site.xml -> ${HIVE_HOME/conf/hive-site.xml >>> core-site.xml -> ${HADOOP_HOME}/etc/hadoop/core-site.xml >>> hdfs-site.xml -> ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Thu, 8 Aug 2019 at 15:16, Hao Ren <inv...@gmail.com> wrote: >>> >>>> >>>> >>>> ---------- Forwarded message --------- >>>> From: Hao Ren <inv...@gmail.com> >>>> Date: Thu, Aug 8, 2019 at 4:15 PM >>>> Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive >>>> table >>>> To: Gourav Sengupta <gourav.sengu...@gmail.com> >>>> >>>> >>>> Hi Gourva, >>>> >>>> I am using enableHiveSupport. >>>> The table was not created by Spark. The table already exists in Hive. >>>> All I did is just reading it by using SQL query in Spark. >>>> FYI, I put hive-site.xml in spark/conf/ directory to make sure that >>>> Spark can access to Hive. >>>> >>>> Hao >>>> >>>> On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta < >>>> gourav.sengu...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> Just out of curiosity did you start the SPARK session using >>>>> enableHiveSupport() ? >>>>> >>>>> Or are you creating the table using SPARK? >>>>> >>>>> >>>>> Regards, >>>>> Gourav >>>>> >>>>> On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <inv...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> I am using Spark SQL 2.3.3 to read a hive table which is partitioned >>>>>> by day, hour, platform, request_status and is_sampled. The underlying >>>>>> data >>>>>> is in parquet format on HDFS. >>>>>> Here is the SQL query to read just *one partition*. >>>>>> >>>>>> ``` >>>>>> spark.sql(""" >>>>>> SELECT rtb_platform_id, SUM(e_cpm) >>>>>> FROM raw_logs.fact_request >>>>>> WHERE day = '2019-08-01' >>>>>> AND hour = '00' >>>>>> AND platform = 'US' >>>>>> AND request_status = '3' >>>>>> AND is_sampled = 1 >>>>>> GROUP BY rtb_platform_id >>>>>> """).show >>>>>> ``` >>>>>> >>>>>> However, from the Spark web UI, the stage description shows: >>>>>> >>>>>> ``` >>>>>> Listing leaf files and directories for 201616 paths: >>>>>> viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, >>>>>> ... >>>>>> ``` >>>>>> >>>>>> It seems the job is reading all of the partitions of the table and >>>>>> the job takes too long for just one partition. One workaround is using >>>>>> `spark.read.parquet` API to read parquet files directly. Spark has >>>>>> partition-awareness for partitioned directories. >>>>>> >>>>>> But still, I would like to know if there is a way to leverage >>>>>> partition-awareness via Hive by using `spark.sql` API? >>>>>> >>>>>> Any help is highly appreciated! >>>>>> >>>>>> Thank you. >>>>>> >>>>>> -- >>>>>> Hao Ren >>>>>> >>>>> >>>> >>>> -- >>>> Hao Ren >>>> >>>> Software Engineer in Machine Learning @ Criteo >>>> >>>> Paris, France >>>> >>>> >>>> -- >>>> Hao Ren >>>> >>>> Software Engineer in Machine Learning @ Criteo >>>> >>>> Paris, France >>>> >>> >> >> -- >> Hao Ren >> >> Software Engineer in Machine Learning @ Criteo >> >> Paris, France >> > -- Hao Ren Software Engineer in Machine Learning @ Criteo Paris, France