Hello Nitin, Thanks for suggesting me about the partition. But I want to tell one thing that I forgot to mention before is that :* I am using Indexes on all tables tables which are used again and again. * But the problem is that after execution I didn't see the difference in performance (before applying the index and after applying it) I have created the indexes as below: sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE"; res2 = stmt2.executeQuery(sql); sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp select C1.Uid, C1.VisitDate, C1.ID from TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and C1.VisitDate=T.VisitDate").toString(); stmt2.executeUpdate(sql); sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table TmpElementTable"; stmt2.executeUpdate(sql); sql = "alter index clinical_index on TmpElementTable REBUILD"; res2 = stmt2.executeQuery(sql); *Did I use it in correct way?*
As you told me told me to try with partition Actually I am altering the table with large number of columns at the runtime only. If i use partition in such situation then is it good to use partition for all columns? So, I want to know that After using the partition Will it be able to improve the performance or do I need to use both Partition and Indexes? -- Regards, Bhavesh Shah On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: > it is definitely possible to increase your performance. > > I have run queries where more than 10 billion records were involved. > If you are doing joins in your queries, you may have a look at different > kind of joins supported by hive. > If one of your table is very small in size compared to another table then > you may consider mapside join etc > > Also the number of maps and reducers are decided by the split size you > provide to maps. > > I would suggest before you go full speed, decide on how you want to layout > data for hive. > > You can try loading some data, partition the data and write queries based > on partition then performance will improve but in that case your queries > will be in batch processing format. there are other approaches as well. > > > On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bhavesh25s...@gmail.com>wrote: > >> That I fail to know, how many maps and reducers are there. Because due to >> some reason my instance get terminated :( >> I want to know one thing that If we use multiple nodes, then what should >> be the count of maps and reducers. >> Actually I am confused about that. How to decide it? >> >> Also I want to try the different properties like block size, compress >> output, size of in-memorybuffer, parallel execution etc. >> Will these all properties matters to increase the performance? >> >> Nitin, you have read all my use case. Whatever the thing I did to >> implement with the help of Hadoop is correct? >> Is it possible to increase the performance? >> >> Thanks Nitin for your reply. :) >> >> -- >> Regards, >> Bhavesh Shah >> >> >> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: >> >>> with a 10 node cluster the performance should improve. >>> how many maps and reducers are being launched? >>> >>> >>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah >>> <bhavesh25s...@gmail.com>wrote: >>> >>>> I have near about 1 billion records in my relational database. >>>> Currently locally I am using just one cluster. But I also tried this on >>>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the >>>> complete program is same as that on my single local machine. >>>> >>>> >>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar >>>> <nitinpawar...@gmail.com>wrote: >>>> >>>>> how many # records? >>>>> >>>>> what is your hadoop cluster setup? how many nodes? >>>>> if you are running hadoop on a single node setup with normal desktop, >>>>> i doubt it will be of any help. >>>>> >>>>> You need a stronger cluster setup for better query runtimes and >>>>> ofcourse query optimization which I guess you would have already taken >>>>> care. >>>>> >>>>> >>>>> >>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah < >>>>> bhavesh25s...@gmail.com> wrote: >>>>> >>>>>> Hello all, >>>>>> My Use Case is: >>>>>> 1) I have a relational database which has a very large data. (MS SQL >>>>>> Server) >>>>>> 2) I want to do analysis on these huge data and want to generate >>>>>> reports >>>>>> on it after analysis. >>>>>> Like this I have to generate various reports based on different >>>>>> analysis. >>>>>> >>>>>> I tried to implement this using Hive. What I did is: >>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP. >>>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive >>>>>> Thrift Server >>>>>> 3) I am getting the correct result in table form, which I am expecting >>>>>> 4) But the problem is that the time which require to execute is too >>>>>> much >>>>>> long. >>>>>> (My complete program is executing in near about 3-4 hours on *small >>>>>> amount of data*). >>>>>> >>>>>> >>>>>> I decided to do this using Hive. >>>>>> And as I told previously how much time Hive consumed for >>>>>> execution. my >>>>>> organization is expecting to complete this task in near about less >>>>>> than >>>>>> 1/2 hours >>>>>> >>>>>> Now after spending too much time for complete execution for this task >>>>>> what >>>>>> should I do? >>>>>> I want to ask one thing that: >>>>>> *Is this Use Case is possible with Hive?* If possible what should I >>>>>> do in >>>>>> >>>>>> my program to increase the performance? >>>>>> *And If not possible what is the other good way to implement this Use >>>>>> Case?* >>>>>> >>>>>> >>>>>> Please reply me. >>>>>> Thanks >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Bhavesh Shah >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Nitin Pawar >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Bhavesh Shah >>>> >>>> >>> >>> >>> -- >>> Nitin Pawar >>> >>> >> >> >> >> > > > -- > Nitin Pawar > >