You need the table in an efficient format, such as Orc or parquet. Have the table sorted appropriately (hint: most discriminating column in the where clause). Do not use SAN or virtualization for the slave nodes.
Can you please post your query. I always recommend to avoid single updates where possible. They are very inefficient for analytics scenarios - this is somehow also true for the traditional database world (depends on the use case of course). > On 07 Jan 2016, at 05:47, Balaraju.Kagidala Kagidala > <balaraju.kagid...@gmail.com> wrote: > > Hi , > > I am new user to spark. I am trying to use Spark to process huge Hive data > using Spark DataFrames. > > > I have 5 node Spark cluster each with 30 GB memory. i am want to process hive > table with 450GB data using DataFrames. To fetch single row from Hive table > its taking 36 mins. Pls suggest me what wrong here and any help is > appreciated. > > > Thanks > Bala > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org