Re: Is my Use Case possible with Hive?

Bhavesh Shah Mon, 14 May 2012 04:09:28 -0700

Hello Nitin,
Thanks for suggesting me about the partition.
But I want to tell one thing that I forgot to mention before is that :*
I am using Indexes on all tables tables which are used again and again. *
But the problem is that after execution I didn't see the difference in
performance (before applying the index and after applying it)
I have created the indexes as below:
sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
res2 = stmt2.executeQuery(sql);
sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
C1.VisitDate, C1.ID from
       TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
C1.VisitDate=T.VisitDate").toString();
stmt2.executeUpdate(sql);
sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
TmpElementTable";
stmt2.executeUpdate(sql);
sql = "alter index clinical_index on TmpElementTable REBUILD";
res2 = stmt2.executeQuery(sql);
*Did I use it in correct way?*


As you told me told me to try with partition
Actually I am altering the table with large number of columns at the
runtime only.
If i use partition in such situation then is it good to use partition for
all columns?

So, I want to know that After using the partition Will it be able to
improve the performance or
do I need to use both Partition and Indexes?




-- 
Regards,
Bhavesh Shah


On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:

> it is definitely possible to increase your performance.
>
> I have run queries where more than 10 billion records were involved.
> If you are doing joins in your queries, you may have a look at different
> kind of joins supported by hive.
> If one of your table is very small in size compared to another table then
> you may consider mapside join etc
>
> Also the number of maps and reducers are decided by the split size you
> provide to maps.
>
> I would suggest before you go full speed, decide on how you want to layout
> data for hive.
>
> You can try loading some data, partition the data and write queries based
> on partition then performance will improve but in that case your queries
> will be in batch processing format. there are other approaches as well.
>
>
> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bhavesh25s...@gmail.com>wrote:
>
>> That I fail to know, how many maps and reducers are there. Because due to
>> some reason my instance get terminated   :(
>> I want to know one thing that If we use multiple nodes, then what should
>> be the count of maps and reducers.
>> Actually I am confused about that. How to decide it?
>>
>> Also I want to try the different properties like block size, compress
>> output, size of in-memorybuffer, parallel execution etc.
>> Will these all properties matters to increase the performance?
>>
>> Nitin, you have read all my use case. Whatever the thing I did to
>> implement with the help of Hadoop is correct?
>> Is it possible to increase the performance?
>>
>> Thanks Nitin for your reply.   :)
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:
>>
>>> with a 10 node cluster the performance should improve.
>>> how many maps and reducers are being launched?
>>>
>>>
>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah 
>>> <bhavesh25s...@gmail.com>wrote:
>>>
>>>> I have near about 1 billion records in my relational database.
>>>> Currently locally I am using just one cluster. But I also tried this on
>>>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the
>>>> complete program is same as that on my  single local machine.
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar 
>>>> <nitinpawar...@gmail.com>wrote:
>>>>
>>>>> how many # records?
>>>>>
>>>>> what is your hadoop cluster setup? how many nodes?
>>>>> if you are running hadoop on a single node setup with normal desktop,
>>>>> i doubt it will be of any help.
>>>>>
>>>>> You need a stronger cluster setup for better query runtimes and
>>>>> ofcourse query optimization which I guess you would have already taken 
>>>>> care.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>> bhavesh25s...@gmail.com> wrote:
>>>>>
>>>>>> Hello all,
>>>>>> My Use Case is:
>>>>>> 1) I have a relational database which has a very large data. (MS SQL
>>>>>> Server)
>>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>>> reports
>>>>>> on it after analysis.
>>>>>> Like this I have to generate various reports based on different
>>>>>> analysis.
>>>>>>
>>>>>> I tried to implement this using Hive. What I did is:
>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive
>>>>>> Thrift Server
>>>>>> 3) I am getting the correct result in table form, which I am expecting
>>>>>> 4) But the problem is that the time which require to execute is too
>>>>>> much
>>>>>> long.
>>>>>>    (My complete program is executing in near about 3-4 hours on *small
>>>>>> amount of data*).
>>>>>>
>>>>>>
>>>>>>    I decided to do this using Hive.
>>>>>>     And as I told previously how much time Hive consumed for
>>>>>> execution. my
>>>>>> organization is expecting to complete this task in near about less
>>>>>> than
>>>>>> 1/2 hours
>>>>>>
>>>>>> Now after spending too much time for complete execution for this task
>>>>>> what
>>>>>> should I do?
>>>>>> I want to ask one thing that:
>>>>>> *Is this Use Case is possible with Hive?* If possible what should I
>>>>>> do in
>>>>>>
>>>>>> my program to increase the performance?
>>>>>> *And If not possible what is the other good way to implement this Use
>>>>>> Case?*
>>>>>>
>>>>>>
>>>>>> Please reply me.
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Bhavesh Shah
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Bhavesh Shah
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>>
>>
>>
>>
>>
>
>
> --
> Nitin Pawar
>
>

Re: Is my Use Case possible with Hive?

Reply via email to