Re: new feature in hive: links

2012-05-22 Thread Ashutosh Chauhan
To kickstart the review, I did a quick review of the doc. Few questions
popped out to me, which I asked. Sambavi was kind enough to come back with
replies for them. I am continuing to look into it. Will encourage other
folks to look into it as well.


Thanks,

Ashutosh


Begin Forward Message


Hi Ashutosh

** **

Thanks for looking through the design and providing your feedback!

** **

Responses below:

* What exactly is contained in tracking capacity usage. One is disk space.
That I presume you are going to track via summing size under database
directory. Are you also thinking of tracking resource usage in terms of
CPU/memory/network utilization for different teams? 

Right now the capacity usage in Hive we will track is the disk space
(managed tables that belong to the namespace + imported tables). We will
track the mappers and reducers that the namepace utilizes directly from
Hadoop.

** **

* Each namespace (ns) will have exactly one database. If so, then users are
not allowed to create/use databases in such deployment? Not necessarily a
problem, just trying to understand design.

Yes, you are correct – this is a limitation of the design. Introducing a
new concept seemed heavyweight, so you can instead think of this as
“self-contained” databases. But it means that a given namespace cannot have
sub-databases in it.

** **

* How are you going to keep metadata consistent across two ns? If metadata
gets updated in remote ns, will it get automatically updated in user's
local ns? If yes, how will this be implemented? If no, then every time user
need to use data from remote ns, she has to bring metadata uptodate in her
ns. How will she do it?

Metadata will be kept in sync for linked tables. We will make alter table
on the remote table (source of the link) cause an update to the target of
the link. Note that from a Hive perspective, the metadata for the source
and target of a link is in the same metastore.

** **

* Is it even required that metadata of two linked tables to be consistent?
Seems like user has to run alter link add partition herself for each
partition. She can choose only to add few partitions. In this case, tables
in two ns have different number of partitions and thus data.

What you say above is true for static links. For dynamic links, add and
drop partition on the source of the link will cause the target to get those
partitions as well (we trap alter table add/drop partition to provide this
behavior).

** **

* Who is allowed to create links?

Any user on the database who has create/all privileges on the database. We
could potentially create a new privilege for this, but I think create
privilege should suffice. We can similarly map alter, drop privileges to
the appropriate operations.

** **

* Once user creates a link, who can use it? If everyone is allowed to
access, then I don't see how is it different from the problem that you are
outlining in first alternative design option, wherein user having an access
to two ns via roles has access to data on both ns.

The link creates metadata in the target database. So you can only access
data that has been linked into this database (access is via the T@Y or Y.T
syntax depending on the chosen design option). Note that this is different
than having a role that a user maps to since in that case, there is no
local metadata in the target database specifying if the imported data is
accessible from this database.

** **

* If links are first class concepts, then authorization model also needs to
understand them? I don't see any mention of that.

Yes, you are correct. We need to account for the authorization model.

** **

* I see there is a hdfs jira for implementing hard links of files in hdfs
layer, so that takes care of linking physical data on hdfs. What about
tables whose data is stored in external systems. For example, hbase. Does
hbase also needs to implement feature of hard-linking their table for hive
to make use of this feature? What about other storage handlers like
cassandra, mongodb etc.

The link does not create a link on HDFS. It just points to the source
table/partitions. You can think of it as a Hive-level link so there is no
need for any changes/features from the other storage handlers.

** **

* Migration will involve two step process of distcp'ing data from one
cluster to another and then replicating one mysql instance to another. Are
there any other steps? Do you plan to (later) build tools to automate this
process of migration.

Yes, we will be building tools to enable migration of a namespace.
Migration will involve replicating the metadata and the data as you mention
above.

** **

* When migrating ns from one datacenter to another, will links be dropped
or they are also preserved? 

We will preserve them – by copying the data for the links to the other
datacenter.

** **

Hope that helps. Please ask any more questions that come up as 

Re: new feature in hive: links

2012-05-22 Thread Carl Steinbach
I added the comments/questions to the wiki (
https://cwiki.apache.org/confluence/display/Hive/Links). I'm also copying
them here:

The first draft of this proposal is very hard to decipher because it relies
on terms that aren't well defined. For example, here's the second sentence
from the motivations section:

bq. Growth beyond a single warehouse (or) separation of capacity usage and
allocation requires the creation of multiple physical warehouses, i.e.,
separate Hive instances.

What's the difference between a warehouse and a physical warehouse? How do
you define a Hive instance? In the requirements section the term virtual
warehouse is introduced and equated to a namespace, but clearly it's more
than that because otherwise DBs/Schemas would suffice. Can you please
update the proposal to include definitions for these terms?


bq. Prevent access using two part name syntax (Y.T) if namespaces feature
is on in a Hive instance. This ensures the database is self-contained.

The cross-namespace HiveConf ACL proposed in HIVE-3016 doesn't prevent
anyone from doing anything because there is no way to keep users from
disabling it. I'm surprised to see this ticket mentioned here since three
committers have already gone on record saying that this is the wrong
approach, and one committer even -1'd it. If preventing cross-db references
in queries is a requirement for this project, then I think Hive's
authorization mechanism will need to be extended to support this
privilege/restriction.

From the design section:

bq. We are building a namespace service external to Hive that has metadata
on namespace location across the Hive instances, and allows importing data
across Hive instances using replication.

Does the work proposed in HIVE-2989 also include adding this Db/Table
replication infrastructure to Hive? If so, what is the timeline for adding
it?

Thanks.

Carl

On Tue, May 22, 2012 at 9:18 AM, Ashutosh Chauhan hashut...@apache.orgwrote:

 To kickstart the review, I did a quick review of the doc. Few questions
 popped out to me, which I asked. Sambavi was kind enough to come back with
 replies for them. I am continuing to look into it. Will encourage other
 folks to look into it as well.


 Thanks,

 Ashutosh


 Begin Forward Message


 Hi Ashutosh

 ** **

 Thanks for looking through the design and providing your feedback!

 ** **

 Responses below:

 * What exactly is contained in tracking capacity usage. One is disk space.
 That I presume you are going to track via summing size under database
 directory. Are you also thinking of tracking resource usage in terms of
 CPU/memory/network utilization for different teams? 

 Right now the capacity usage in Hive we will track is the disk space
 (managed tables that belong to the namespace + imported tables). We will
 track the mappers and reducers that the namepace utilizes directly from
 Hadoop.

 ** **

 * Each namespace (ns) will have exactly one database. If so, then users are
 not allowed to create/use databases in such deployment? Not necessarily a
 problem, just trying to understand design.

 Yes, you are correct – this is a limitation of the design. Introducing a
 new concept seemed heavyweight, so you can instead think of this as
 “self-contained” databases. But it means that a given namespace cannot have
 sub-databases in it.

 ** **

 * How are you going to keep metadata consistent across two ns? If metadata
 gets updated in remote ns, will it get automatically updated in user's
 local ns? If yes, how will this be implemented? If no, then every time user
 need to use data from remote ns, she has to bring metadata uptodate in her
 ns. How will she do it?

 Metadata will be kept in sync for linked tables. We will make alter table
 on the remote table (source of the link) cause an update to the target of
 the link. Note that from a Hive perspective, the metadata for the source
 and target of a link is in the same metastore.

 ** **

 * Is it even required that metadata of two linked tables to be consistent?
 Seems like user has to run alter link add partition herself for each
 partition. She can choose only to add few partitions. In this case, tables
 in two ns have different number of partitions and thus data.

 What you say above is true for static links. For dynamic links, add and
 drop partition on the source of the link will cause the target to get those
 partitions as well (we trap alter table add/drop partition to provide this
 behavior).

 ** **

 * Who is allowed to create links?

 Any user on the database who has create/all privileges on the database. We
 could potentially create a new privilege for this, but I think create
 privilege should suffice. We can similarly map alter, drop privileges to
 the appropriate operations.

 ** **

 * Once user creates a link, who can use it? If everyone is allowed to
 access, then I don't see how is it different from the problem that you are
 outlining in first 

new feature in hive: links

2012-05-21 Thread Namit Jain


There is a requirement for a new feature, for which Sambavi has written a
detailed overview  https://cwiki.apache.org/confluence/display/Hive/Links.

We would like to get some core concepts out of it, and implement them in open 
source hive, so that the whole community gets it.
There are new concepts, so please comment, and we can take it forward 
accordingly.


Thanks,
-namit