[jira] [Created] (HIVE-11023) Disable directSQL if datanucleus.identifierFactory = datanucleus2

Sushanth Sowmyan (JIRA) Tue, 16 Jun 2015 09:28:03 -0700

Sushanth Sowmyan created HIVE-11023:
---------------------------------------


             Summary: Disable directSQL if datanucleus.identifierFactory = 
datanucleus2
                 Key: HIVE-11023
                 URL: https://issues.apache.org/jira/browse/HIVE-11023
             Project: Hive
          Issue Type: Bug
          Components: Metastore
    Affects Versions: 1.3.0, 1.2.1, 2.0.0
            Reporter: Sushanth Sowmyan
            Assignee: Sushanth Sowmyan


We hit an interesting bug in a case where datanucleus.identifierFactory = 
datanucleus2 .

The problem is that directSql handgenerates SQL strings assuming "datanucleus1" 
naming scheme. If a user has their metastore JDO managed by 
datanucleus.identifierFactory = datanucleus2 , the SQL strings we generate are 
incorrect.

One simple example of what this results in is the following: whenever DN 
persists a field which is held as a List<T>, it winds up storing each T as a 
separate line in the appropriate mapping table, and has a column called 
INTEGER_IDX, which holds the position in the list. Then, upon reading, it 
automatically reads all relevant lines with an ORDER BY INTEGER_IDX, which 
results in the list retaining its order. In DN2 naming scheme, the column is 
called IDX, instead of INTEGER_IDX. If the user has run appropriate metatool 
upgrade scripts, it is highly likely that they have both columns, INTEGER_IDX 
and IDX.

Whenever they use JDO, such as with all writes, it will then use the IDX field, 
and when they do any sort of optimized reads, such as through directSQL, it 
will ORDER BY INTEGER_IDX.

An immediate danger is seen when we consider that the schema of a table is 
stored as a List<FieldSchema> , and while IDX has 0,1,2,3,... , INTEGER_IDX 
will contain 0,0,0,0,... and thus, any attempt to describe the table or fetch 
schema for the table can come up mixed up in the table's native hashing order, 
rather than sorted by the index.

This can then result in schema ordering being different from the actual table. 
For eg:, if a user has a (a:int,b:string,c:string), a describe on this may 
return (c:string, a:int, b: string), and thus, queries which are inserting 
after selecting from another table can have ClassCastExceptions when trying to 
insert data in the wong order - this is how we discovered this bug. This 
problem, however, can be far worse, if there are no type problems - it is 
possible, for eg., that if a,b&c were all strings, that that insert query would 
succeed but mix up the order, which then results in user table data being mixed 
up. This has the potential to be very bad.

We should write a tool to help convert metastores that use "datanucleus2" to 
"datanucleus1"(more difficult, needs more one-time testing) or change directSql 
to support both(easier to code, but increases test-coverage matrix 
significantly and we should really then be testing against both schemes). But 
in the short term, we should disable directSql if we see that the 
identifierfactory is "datanucleus2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11023) Disable directSQL if datanucleus.identifierFactory = datanucleus2

Reply via email to