External tables do not all support index creation.  If you are using a non-HDFS 
data store, say S3, you can not create an index against that table.

I am not sure about other non-HDFS data stores.

Pat

From: Soren Flexner [mailto:[email protected]]
Sent: Tuesday, April 26, 2011 11:18 AM
To: [email protected]
Subject: Re: External tables pros & cons


  The only real difference between a managed table and an external table is the 
location of the data file on HDFS.

  - Managed table:

  Say you have some data you want to analyze in Hive, with fields (foo, bar).

  So you CREATE a table "table_name", with fields (foo,bar) and LOAD the data 
into that table using the Hive cli client.  That's a 'managed' table, and hive 
will decide where the data gets stored on HDFS ( probably 
/hive/warehouse/table_name/...)

  - External table:

  Say that instead, you have some data already on HDFS, maybe in 
/user/login/data/..., also of the form (foo,bar).  Maybe you use this data in 
some MapReduce processes, so you wouldn't want to move it out of the directory 
it's in.  But maybe you DO want to analyze it using Hive.

  So you CREATE EXTERNAL a table "table_name", and include the path to the 
data.  In this case, you don't LOAD the data using the Hive client, because the 
data is already on HDFS.  You just point Hive to the location of the data.

  You can find out where Hive thinks the data for a table is by issuing the 
command: describe extended table_name;   The path is in the (somewhat messy) 
result.

  To convince yourself that there's really NO DIFFERENCE between a managed 
table and an external table, create a managed table and then hadoop fs -mv the 
data for it from an 'external' location into /hive/warehouse/table_name/...   
as long as the schema matches up with the file, hive will magically find the 
data.

  One important realization is that Hive doesn't "do" anything to your data in 
either case. The data is just stored as a set of files on HDFS (say, tab 
delimited), the contents of which look roughly like:

foo1\tbar1
foo2\tbar2
foo3\tbar3
foo4\tbar4
etc...

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL

  -s

P.S. As an aside, it should be clear by now that Ranjith's comment is correct:  
if you want multiple schemas for the same data, you'll need to make N-1 of them 
EXTERNAL tables, since Hive will create a copy of your data under 
/hive/warehouse/tableN/...  for each table you LOAD to.  You could create the 
first table as a managed table, and then point all the others to the directory 
which Hive creates.
On Tue, Apr 26, 2011 at 9:42 AM, Raghunath, Ranjith 
<[email protected]<mailto:[email protected]>> wrote:
>From my understanding of external tables they are great if we need multiple 
>schemas associated with the same base data. A place where this could come 
>handy is if we using custom SerDe implementation. The drawback that I have 
>noticed has to do with binary file formats. One will have to use a custom map 
>reduce job to generate the base data if you plan to use that as an external 
>table. Please feel free to correct me here if I am wrong.


Thanks,
Ranjith

________________________________
From: Shantian Purkad 
<[email protected]<mailto:[email protected]>>
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Sent: Tue Apr 26 11:18:46 2011
Subject: External tables pros & cons
Hi,

What is the difference between external table and managed tables (apart from 
data being stored outside hive warehouse and not deleted when table is dropped)

Are there any drawbacks of external tables? Or lack of support for certain 
features (such as indexes, joins, subqueries etc etc )?

When and why should one not use external tables?

Thanks and Regards,
Shantian

Reply via email to