Re: General Question (Spark Hive integration )
Thanks for the response Silvio, my table is not partitioned because my filter column is primary key , I guess we can't partition on primary key column. My table has 600 million data if I query single regard it seems by default its loding whole data and taking some time to just return single record. Pls suggest if any thing I can tune here, and mine 5 node spark cluster Bala > On Jan 21, 2016, at 7:07 PM, Silvio Fiorito > wrote: > > Also, just to clarify it doesn’t read the whole table into memory unless you > specifically cache it. > > From: Silvio Fiorito > Date: Thursday, January 21, 2016 at 10:02 PM > To: "Balaraju.Kagidala Kagidala" , > "user@spark.apache.org" > Subject: Re: General Question (Spark Hive integration ) > > Hi Bala, > > It depends on how your Hive table is configured. If you used partitioning and > you are filtering on a partition column then it will only load the relevant > partitions. If, however, you’re filtering on a non-partitioned column then it > will have to read all the data and then filter as part of the Spark job. > > Thanks, > Silvio > > From: "Balaraju.Kagidala Kagidala" > Date: Thursday, January 21, 2016 at 9:37 PM > To: "user@spark.apache.org" > Subject: General Question (Spark Hive integration ) > > Hi , > > > I have simple question regarding Spark Hive integration with DataFrames. > > When we query for a table, does spark loads whole table into memory and > applies the filter on top of it or it only loads the data with filter applied. > > for example if the my query 'select * from employee where deptno=10' does my > rdd loads whole employee data into memory and applies fileter or will it load > only dept number 10 data. > > > Thanks > Bala > > > > >
Re: General Question (Spark Hive integration )
Also, just to clarify it doesn’t read the whole table into memory unless you specifically cache it. From: Silvio Fiorito mailto:silvio.fior...@granturing.com>> Date: Thursday, January 21, 2016 at 10:02 PM To: "Balaraju.Kagidala Kagidala" mailto:balaraju.kagid...@gmail.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: General Question (Spark Hive integration ) Hi Bala, It depends on how your Hive table is configured. If you used partitioning and you are filtering on a partition column then it will only load the relevant partitions. If, however, you’re filtering on a non-partitioned column then it will have to read all the data and then filter as part of the Spark job. Thanks, Silvio From: "Balaraju.Kagidala Kagidala" mailto:balaraju.kagid...@gmail.com>> Date: Thursday, January 21, 2016 at 9:37 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: General Question (Spark Hive integration ) Hi , I have simple question regarding Spark Hive integration with DataFrames. When we query for a table, does spark loads whole table into memory and applies the filter on top of it or it only loads the data with filter applied. for example if the my query 'select * from employee where deptno=10' does my rdd loads whole employee data into memory and applies fileter or will it load only dept number 10 data. Thanks Bala
Re: General Question (Spark Hive integration )
Hi Bala, It depends on how your Hive table is configured. If you used partitioning and you are filtering on a partition column then it will only load the relevant partitions. If, however, you’re filtering on a non-partitioned column then it will have to read all the data and then filter as part of the Spark job. Thanks, Silvio From: "Balaraju.Kagidala Kagidala" mailto:balaraju.kagid...@gmail.com>> Date: Thursday, January 21, 2016 at 9:37 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: General Question (Spark Hive integration ) Hi , I have simple question regarding Spark Hive integration with DataFrames. When we query for a table, does spark loads whole table into memory and applies the filter on top of it or it only loads the data with filter applied. for example if the my query 'select * from employee where deptno=10' does my rdd loads whole employee data into memory and applies fileter or will it load only dept number 10 data. Thanks Bala
General Question (Spark Hive integration )
Hi , I have simple question regarding Spark Hive integration with DataFrames. When we query for a table, does spark loads whole table into memory and applies the filter on top of it or it only loads the data with filter applied. for example if the my query 'select * from employee where deptno=10' does my rdd loads whole employee data into memory and applies fileter or will it load only dept number 10 data. Thanks Bala