[jira] [Created] (HIVE-13801) LLAP UI does not seem to accept credentials

2016-05-19 Thread Siddharth Seth (JIRA)
Siddharth Seth created HIVE-13801:
-

 Summary: LLAP UI does not seem to accept credentials 
 Key: HIVE-13801
 URL: https://issues.apache.org/jira/browse/HIVE-13801
 Project: Hive
  Issue Type: Bug
Affects Versions: 2.1.0
Reporter: Siddharth Seth
Priority: Critical


Effectively making it unusable on a secure cluster.

This could well be a misconfiguration of the cluster - but I tried using the 
same credentials against the YARN Timeline Server - and that worked fine.

Steps to obtain credentials. kinit - start and configure firefox to use SPNEGO, 
try accessing the UGI.

cc [~gopalv]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-13800) Disable auth enabled by default on LLAP UI for secure clusters

2016-05-19 Thread Siddharth Seth (JIRA)
Siddharth Seth created HIVE-13800:
-

 Summary: Disable auth enabled by default on LLAP UI for secure 
clusters
 Key: HIVE-13800
 URL: https://issues.apache.org/jira/browse/HIVE-13800
 Project: Hive
  Issue Type: Task
Reporter: Siddharth Seth
Assignee: Siddharth Seth


There's no sensitive information that I'm aware of. (The logs would be the most 
sensitive).
Similar to the HS2 UI, the LLAP UI can be default unprotected even on secure 
clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-13799) Optimize TableScanRule::checkBucketedTable

2016-05-19 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created HIVE-13799:
---

 Summary: Optimize TableScanRule::checkBucketedTable
 Key: HIVE-13799
 URL: https://issues.apache.org/jira/browse/HIVE-13799
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 46956: HIVE-13444 LLAP: add HMAC signatures to LLAP; verify them on LLAP side

2016-05-19 Thread Siddharth Seth

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46956/#review134076
---



Think it will be useful to add some tests around
1) signing / validation
2) The config parameter (assuming it stays), and it behaving the way it's 
intended to make sure tokens are created with the correct parameters. There's a 
lot of ! of ! of ! checks happening.


common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (lines 2698 - 2699)


Is this primarily for config ? Rename to have a positive connotation maybe ?



common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (line 2699)


Rename "user" to "llapuser"/"serviceowner" - something that implies this is 
only for the user owning the service.
Maybe call the other two settings "always", "never" - instead of "true", 
"false"



common/src/java/org/apache/hadoop/hive/conf/HiveConf.java (line 2700)


What should the default value here be ?
"false" seems to imply sign alyway. In case the client config is setup to 
obtain tokens remotely - instead of directly from ZK on the client side in HS2 
- Tez would end up obtaining tokens which require signing as well ?



llap-common/src/java/org/apache/hadoop/hive/llap/security/LlapSigner.java (line 
29)


I'm not sure this will actually be usable, given that what is being signed 
is a protobuf generated class.



llap-common/src/java/org/apache/hadoop/hive/llap/security/LlapTokenProvider.java
 (line 1)


Not used anywhere. Re-introduce in the patch where it's required ?



llap-common/src/java/org/apache/hadoop/hive/llap/security/SecretManager.java 
(line 126)


Can a second login be avoided. I'm guessing this is because the ZK 
principla may be different from the llap principla.
What was the reason for them to be different again ? (Especially w.r.t the 
SecretManager). Not sure if the fallback to using the llap principal and keytab 
will work if they have to be different.



llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/ContainerRunnerImpl.java
 (line 168)


Move this to after checking if vertexBinary is set ? Potentially error out 
if both are set.

IIRC, vertexBinary will be set by external clients, and vertex will be set 
by Tez ?



llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/ContainerRunnerImpl.java
 (line 170)


Maybe move all of these checks into the RPC layers itself ... i.e. 
LlapServiceServerImpl. As early as possible.



llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/ContainerRunnerImpl.java
 (line 262)


Why is this required ? The signature will only exist if vertexBinary is 
present ?



llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/ContainerRunnerImpl.java
 (line 267)


A follow up jira may be to limit the age of keys.
i.e. if a keyId is older than a certain amount of time - fail the request. 
I'm not sure how ZKSecretManager rotates these keys, and when they are 
invalidated.

A user can potentially use an old (presumably compromsied key) to generate 
requests - which will be valid if keys are not rotated/aged.



llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/LlapProtocolServerImpl.java
 (line 66)


The meaning is a little unclear, when considered along with the negative 
connotation of the config parameter. I don't actually know what a TRUE value 
here means. Even more so when considered alongside the parameter called 
"isNoSigning"



llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/LlapProtocolServerImpl.java
 (line 144)


"user"



llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/LlapProtocolServerImpl.java
 (line 284)


All of this logic should be invoked even when obtaining tokens from ZKSM 
directly.

Whether Tez is being used, or an external client - as long as HS2 is 
obtaining a token, it can do it directly from ZK. This code path is not likely 
to be exercised a lot.
Assuming that invocation (when it happens, and likely needs another jira) - 
will call in to LlapTokenLocalClient.createToken directly - and will send in 
isSigningRequired based on all of the same configs.

Would be better to move the 

Re: Review Request 46754: HIVE-13391 add an option to LLAP to use keytab to authenticate to read data

2016-05-19 Thread Sergey Shelukhin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46754/
---

(Updated May 20, 2016, 1:49 a.m.)


Review request for hive and Siddharth Seth.


Repository: hive-git


Description
---

see JIRA


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/common/UgiFactory.java PRE-CREATION 
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 9cc8fbe 
  
llap-client/src/java/org/apache/hadoop/hive/llap/registry/impl/LlapZookeeperRegistryImpl.java
 cffa493 
  
llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/ContainerRunnerImpl.java
 2524dc2 
  llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/LlapDaemon.java 
de817e3 
  
llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/TaskRunnerCallable.java
 74359fa 
  llap-server/src/java/org/apache/hadoop/hive/llap/io/api/impl/LlapIoImpl.java 
fea3dc7 
  
llap-server/src/java/org/apache/hadoop/hive/llap/io/decode/ColumnVectorProducer.java
 b3b571d 
  
llap-server/src/java/org/apache/hadoop/hive/llap/security/LlapUgiFactoryFactory.java
 PRE-CREATION 
  
llap-server/src/test/org/apache/hadoop/hive/llap/daemon/impl/TaskExecutorTestHelpers.java
 279baf1 
  
llap-server/src/test/org/apache/hadoop/hive/llap/daemon/impl/comparator/TestFirstInFirstOutComparator.java
 a250882 

Diff: https://reviews.apache.org/r/46754/diff/


Testing
---


Thanks,

Sergey Shelukhin



[jira] [Created] (HIVE-13798) Fix the unit test failure org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ivyDownload

2016-05-19 Thread Aihua Xu (JIRA)
Aihua Xu created HIVE-13798:
---

 Summary: Fix the unit test failure 
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ivyDownload
 Key: HIVE-13798
 URL: https://issues.apache.org/jira/browse/HIVE-13798
 Project: Hive
  Issue Type: Sub-task
Reporter: Aihua Xu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] hive pull request: HIVE-9660 Add length to ORC indexes so that the...

2016-05-19 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/hive/pull/77

HIVE-9660 Add length to ORC indexes so that the reader knows how much to 
read.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/hive hive-9660

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hive/pull/77.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #77


commit 014e9aaec1cb8f7257b997e953e6cc30d34a71cf
Author: Owen O'Malley 
Date:   2016-03-26T02:39:12Z

HIVE-11417. Move the ReaderImpl and RowReaderImpl to the ORC module,
by making shims for the row by row reader.

commit afda4610a8c1ed9fe3adc86c6fc1b08b5fdae7aa
Author: Owen O'Malley 
Date:   2016-05-13T21:44:34Z

HIVE-9660 Add length to ORC indexes so that the reader knows how much
to read.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (HIVE-13797) Provide a connection string example in beeline

2016-05-19 Thread Vihang Karajgaonkar (JIRA)
Vihang Karajgaonkar created HIVE-13797:
--

 Summary: Provide a connection string example in beeline
 Key: HIVE-13797
 URL: https://issues.apache.org/jira/browse/HIVE-13797
 Project: Hive
  Issue Type: Improvement
  Components: Beeline
Affects Versions: 2.0.0
Reporter: Vihang Karajgaonkar
Assignee: Vihang Karajgaonkar
Priority: Minor


It would save a bunch of googling if we could provide some examples of 
connection strings directly to beeline help message

Eg:
{code}
./bin/beeline --help
Usage: java org.apache.hive.cli.beeline.BeeLine 
   -uthe JDBC URL to connect to
   -r  reconnect to last saved connect url (in 
conjunction with !save)
   -nthe username to connect as
   -pthe password to connect as
   -dthe driver class to use
   -i   script file for initialization
   -e   query that should be executed
   -f   script file that should be executed
   -w (or) --password-file   the password file to read password 
from
   --hiveconf property=value   Use value for given property
   --hivevar name=valuehive variable name and value
   This is Hive specific settings in which 
variables
   can be set at session level and referenced 
in Hive
   commands or queries.
   --color=[true/false]control whether color is used for display
   --showHeader=[true/false]   show column names in query results
   --headerInterval=ROWS;  the interval between which heades are 
displayed
   --fastConnect=[true/false]  skip building table/column list for 
tab-completion
   --autoCommit=[true/false]   enable/disable automatic transaction commit
   --verbose=[true/false]  show verbose error messages and debug info
   --showWarnings=[true/false] display connection warnings
   --showNestedErrs=[true/false]   display nested errors
   --numberFormat=[pattern]format numbers using DecimalFormat pattern
   --force=[true/false]continue running script even after errors
   --maxWidth=MAXWIDTH the maximum width of the terminal
   --maxColumnWidth=MAXCOLWIDTHthe maximum width to use when displaying 
columns
   --silent=[true/false]   be more silent
   --autosave=[true/false] automatically save preferences
   --outputformat=[table/vertical/csv2/tsv2/dsv/csv/tsv]  format mode for 
result display
   Note that csv, and tsv are deprecated - use 
csv2, tsv2 instead
   --incremental=[true/false]  Defaults to false. When set to false, the 
entire result set
   is fetched and buffered before being 
displayed, yielding optimal
   display column sizing. When set to true, 
result rows are displayed
   immediately as they are fetched, yielding 
lower latency and
   memory usage at the price of extra display 
column padding.
   Setting --incremental=true is recommended if 
you encounter an OutOfMemory
   on the client side (due to the fetched 
result set size being large).
   --truncateTable=[true/false]truncate table column when it exceeds length
   --delimiterForDSV=DELIMITER specify the delimiter for 
delimiter-separated values output format (default: |)
   --isolation=LEVEL   set the transaction isolation level
   --nullemptystring=[true/false]  set to true to get historic behavior of 
printing null as empty string
   --addlocaldriverjar=DRIVERJARNAME Add driver jar file in the beeline client 
side
   --addlocaldrivername=DRIVERNAME Add drvier name needs to be supported in the 
beeline client side
   --showConnectedUrl=[true/false] Prompt HiveServer2s URI to which this 
beeline connected.
   Only works for HiveServer2 cluster mode.
   --help  display this message
 
   Example:
1. beeline -u jdbc:hive2://localhost:1 username password
2. beeline -n username -p password -u jdbc:hive2://hs2.local:10012
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-13796) fix some tests on branch-1

2016-05-19 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created HIVE-13796:
---

 Summary: fix some tests on branch-1
 Key: HIVE-13796
 URL: https://issues.apache.org/jira/browse/HIVE-13796
 Project: Hive
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 47546: HIVE-13448 LLAP: check ZK acls for ZKSM and fail if they are too permissive

2016-05-19 Thread Siddharth Seth

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47546/#review134040
---


Ship it!




Ship It!

- Siddharth Seth


On May 18, 2016, 7:07 p.m., Sergey Shelukhin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47546/
> ---
> 
> (Updated May 18, 2016, 7:07 p.m.)
> 
> 
> Review request for hive, Prasanth_J and Siddharth Seth.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> 
> Diffs
> -
> 
>   
> llap-client/src/java/org/apache/hadoop/hive/llap/registry/impl/LlapZookeeperRegistryImpl.java
>  cffa493 
>   
> llap-common/src/java/org/apache/hadoop/hive/llap/security/SecretManager.java 
> 465b204 
> 
> Diff: https://reviews.apache.org/r/47546/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Sergey Shelukhin
> 
>



[jira] [Created] (HIVE-13795) TxnHandler should know if operation is using dynamic partitions

2016-05-19 Thread Eugene Koifman (JIRA)
Eugene Koifman created HIVE-13795:
-

 Summary: TxnHandler should know if operation is using dynamic 
partitions
 Key: HIVE-13795
 URL: https://issues.apache.org/jira/browse/HIVE-13795
 Project: Hive
  Issue Type: Bug
  Components: Transactions
Affects Versions: 1.3.0, 2.1.0
Reporter: Eugene Koifman


TxnHandler.checkLock() see more comments around 
"isPartOfDynamicPartitionInsert". If TxnHandler knew whether it is being called 
as part of an op running with dynamic partitions, it could be more efficient. 
In that case we don't have to write to TXN_COMPONENTS at all during lock 
acquisition. Conversely, if not running with DynPart then, we can kill current 
txn on lock grant rather than wait until commit time.

if addDynamicPartitions() also knew about DynPart it could eliminate the Delete 
from Txn_components... statement



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Becoming a contributor

2016-05-19 Thread Thejas Nair
I had modified the jira setting a while back to allow any jira users
to add assign bugs to themselves. So having a jira account should be
sufficient to assign bugs to yourself.
However, recently INFRA had to put jira in lockdown mode due to spam,
and when it goes in that mode mode, only people in the contributor
category can perform most actions.
TLDR -  being in jira contributor list (which is what Alan would have
added you to), is useful in certain cases, but not necessary for most
part.


On Thu, May 19, 2016 at 10:23 AM, Vihang Karajgaonkar
 wrote:
> Thanks Alan.
>
> On Wed, May 18, 2016 at 5:07 PM, Alan Gates  wrote:
>
>> You’ve been added, so you should now be able to assign JIRAs to yourself.
>>
>> Alan.
>>
>> > On May 18, 2016, at 13:38, Vihang Karajgaonkar 
>> wrote:
>> >
>> > Thanks Alan! My JIRA id is vihangk1
>> >
>> > -Vihang
>> >
>> >> On May 18, 2016, at 1:09 PM, Alan Gates  wrote:
>> >>
>> >> Nope, you’re good.  You can ignore the stuff about review board as we
>> generally only use that now for large and complex patches.  If you have a
>> JIRA in mind you’d like to work on that no one else has you can assign it
>> to yourself* and get started.  If someone else is working on it you should
>> coordinate via comments on that JIRA to see if there’s a way you can help.
>> >>
>> >> Welcome to the team.
>> >>
>> >> Alan.
>> >>
>> >> *You probably can’t assign JIRAs to yourself yet, but if you reply to
>> this email with your JIRA id I’ll make it so you can.
>> >>
>> >> Alan.
>> >>
>> >>> On May 18, 2016, at 11:42, Vihang Karajgaonkar 
>> wrote:
>> >>>
>> >>> Hello Everyone,
>> >>>
>> >>> I would like to start working on issues reported on Hive in JIRA. I
>> followed the steps mentioned in
>> https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-BecomingaContributor
>> <
>> https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-BecomingaContributor>.
>> Is there anything else that I need to do to get added as a contributor?
>> >>>
>> >>> Thanks,
>> >>> Vihang
>> >>
>> >
>>
>>


[jira] [Created] (HIVE-13794) HIVE_RPC_QUERY_PLAN should always be set when generating LLAP splits

2016-05-19 Thread Jason Dere (JIRA)
Jason Dere created HIVE-13794:
-

 Summary: HIVE_RPC_QUERY_PLAN should always be set when generating 
LLAP splits
 Key: HIVE-13794
 URL: https://issues.apache.org/jira/browse/HIVE-13794
 Project: Hive
  Issue Type: Sub-task
  Components: llap
Reporter: Jason Dere
Assignee: Jason Dere


This option was being added in the test, but really should be set any time we 
are generating the LLAP input splits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-13793) Columns in GROUP BY gaining an alias?

2016-05-19 Thread Furcy Pin (JIRA)
Furcy Pin created HIVE-13793:


 Summary: Columns in GROUP BY gaining an alias?
 Key: HIVE-13793
 URL: https://issues.apache.org/jira/browse/HIVE-13793
 Project: Hive
  Issue Type: Bug
Affects Versions: 2.0.0, 1.1.1
Reporter: Furcy Pin
Priority: Minor


I've found that Hive sometimes automatically gives an alias to
the columns that are in a group by.
I'm not sure if this is a bug or a feature but it seems to be inconsistent
with the way it usually resolves column names :

How to reproduce:

This query is ok :
{code}
SELECT 
s.foo
FROM (SELECT NAMED_STRUCT("foo", 2) as s) T 
;
+--+--+
| foo  |
+--+--+
| 2|
+--+--+
{code}

This query fails (and it's normal) because the column 'foo' does not exist (but 
's.foo' does).
{code}
SELECT 
foo
FROM (SELECT NAMED_STRUCT("foo", 2) as s) T 
;
Error: Error while compiling statement: FAILED: SemanticException [Error 
10004]: Line 2:0 Invalid table alias or column reference 'foo': (possible 
column names are: s) (state=42000,code=10004)
{code}

But adding a GROUP BY seems to make Hive rename 's.foo' into 'foo',
and the following query works (which seems less normal to me) . 
{code}
SELECT 
foo
FROM (SELECT NAMED_STRUCT("foo", 2) as s) T 
GROUP BY s.foo
;
+--+--+
| foo  |
+--+--+
| 2|
+--+--+
{code}

Is this a bug or a feature ?

In this example it is mostly harmless, but I though perhaps it might help 
finding a flaw in the query processor.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-13792) Show create table should not show stats info in the table properties

2016-05-19 Thread Aihua Xu (JIRA)
Aihua Xu created HIVE-13792:
---

 Summary: Show create table should not show stats info in the table 
properties
 Key: HIVE-13792
 URL: https://issues.apache.org/jira/browse/HIVE-13792
 Project: Hive
  Issue Type: Sub-task
  Components: Query Planning
Affects Versions: 2.1.0
Reporter: Aihua Xu
Assignee: Aihua Xu


>From the test 
>org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_queries 
>failure, we are printing table stats in show create table parameters. This 
>info should be skipped since it would be incorrect when you just copy them to 
>create a table. And also the format for TBLPROPERTIES is not well formed.

{noformat}
CREATE EXTERNAL TABLE `hbase_table_1_like`(
  `key` int COMMENT 'It is a column key',
  `value` string COMMENT 'It is the column string value')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  'hbase.columns.mapping'='cf:string',
  'serialization.format'='1')
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
  'hbase.table.name'='hbase_table_0',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'totalSize'='0',
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 47554: HIVE-13750

2016-05-19 Thread Ashutosh Chauhan


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 556
> > 
> >
> > Can you also add comment why this is valid only for RS added by SPDO 
> > and not in general?
> 
> Jesús Camacho Rodríguez wrote:
> I was thinking further about this, and indeed I made many assumptions 
> thinking that SPDO will not add the sort columns at the end of the additional 
> RS keys, so then we need to make sure the ordering is correct... but this is 
> not true, as it will add them.
> 
> I think this optimization can be done general, as ordering before 
> ordering again is a noop. I have changed the code accordingly, I would like 
> your opinion. Am I missing something? Let's see what QA returns...
> 
> Ashutosh Chauhan wrote:
> yeah.. thats what I think.. this is not specific to SPDO at logical 
> layer. 
> Thing I am not 100% sure about is in FileSink the way SPDO is implemented 
> I *think* it assumes all rows for a particular key come in sorted and in one 
> batch. Now, when we merge and sort by two keys and if there is a case that FS 
> was expecting all rows sorted on second key only, now will get them sorted on 
> 2 keys and it may have to close ORC writer and then reopen again, as oppose 
> to write it once and never opening it again.
> 
> Jesús Camacho Rodríguez wrote:
> I am trying to understand this part... The way the optimization works, we 
> will keep the second RS with its keys, we just remove the first RS... Thus, 
> this should not be a problem? Or maybe I am not understanding it correctly
> 
> Ashutosh Chauhan wrote:
> I think you are correct. Since relevant FS is active in last reducer 
> containing RS which we is not changing, we are not changing any assumptions 
> for FS.
> 
> Ashutosh Chauhan wrote:
> On a second thought I think there is more to it. e.g. we are transforming 
> RS (a) followed by RS (a,b) to RS(a,b) Than this is not valid transformation 
> in general, only if second RS is introduced by SPDO. e.g, if these two RS are 
> part of two GBYs (or a join & GBy), than its invalid because partitioning on 
> (a,b) != partitioning on (a)
> 
> Jesús Camacho Rodríguez wrote:
> But aggressiveDedup is really restrictive: only pRS-SEL*-cRS, thus we 
> would bail out in those specific cases.

Makes sense. Thanks for the explaination!


- Ashutosh


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47554/#review133844
---


On May 19, 2016, 10:49 a.m., Jesús Camacho Rodríguez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47554/
> ---
> 
> (Updated May 19, 2016, 10:49 a.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-13750
> https://issues.apache.org/jira/browse/HIVE-13750
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-13750
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
>  010c89ed978296709b052cc7bc80256a27658e2b 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
>  733620b84657a21829248afe72ab16ad9692f37e 
>   ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out 
> d03bfe422743d9a5a6b85f9a6198e1e27024f129 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out 
> dec872ab0eef54bd92d5c2bc068e2805cc14e272 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out 
> 832580325873dee741ba86239ee571873994a808 
>   ql/src/test/results/clientpositive/reducesink_dedup.q.out 
> b89df52f965385b85894757896eee487b29c52ae 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out 
> a90e3f63b4646cf0ade9785a501ebd1a6b2a3406 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out 
> 723e8192f2735059005fc3c5c96732a2c4be49c1 
> 
> Diff: https://reviews.apache.org/r/47554/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Jesús Camacho Rodríguez
> 
>



Re: Review Request 47554: HIVE-13750

2016-05-19 Thread Jesús Camacho Rodríguez


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 556
> > 
> >
> > Can you also add comment why this is valid only for RS added by SPDO 
> > and not in general?
> 
> Jesús Camacho Rodríguez wrote:
> I was thinking further about this, and indeed I made many assumptions 
> thinking that SPDO will not add the sort columns at the end of the additional 
> RS keys, so then we need to make sure the ordering is correct... but this is 
> not true, as it will add them.
> 
> I think this optimization can be done general, as ordering before 
> ordering again is a noop. I have changed the code accordingly, I would like 
> your opinion. Am I missing something? Let's see what QA returns...
> 
> Ashutosh Chauhan wrote:
> yeah.. thats what I think.. this is not specific to SPDO at logical 
> layer. 
> Thing I am not 100% sure about is in FileSink the way SPDO is implemented 
> I *think* it assumes all rows for a particular key come in sorted and in one 
> batch. Now, when we merge and sort by two keys and if there is a case that FS 
> was expecting all rows sorted on second key only, now will get them sorted on 
> 2 keys and it may have to close ORC writer and then reopen again, as oppose 
> to write it once and never opening it again.
> 
> Jesús Camacho Rodríguez wrote:
> I am trying to understand this part... The way the optimization works, we 
> will keep the second RS with its keys, we just remove the first RS... Thus, 
> this should not be a problem? Or maybe I am not understanding it correctly
> 
> Ashutosh Chauhan wrote:
> I think you are correct. Since relevant FS is active in last reducer 
> containing RS which we is not changing, we are not changing any assumptions 
> for FS.
> 
> Ashutosh Chauhan wrote:
> On a second thought I think there is more to it. e.g. we are transforming 
> RS (a) followed by RS (a,b) to RS(a,b) Than this is not valid transformation 
> in general, only if second RS is introduced by SPDO. e.g, if these two RS are 
> part of two GBYs (or a join & GBy), than its invalid because partitioning on 
> (a,b) != partitioning on (a)

But aggressiveDedup is really restrictive: only pRS-SEL*-cRS, thus we would 
bail out in those specific cases.


- Jesús


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47554/#review133844
---


On May 19, 2016, 10:49 a.m., Jesús Camacho Rodríguez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47554/
> ---
> 
> (Updated May 19, 2016, 10:49 a.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-13750
> https://issues.apache.org/jira/browse/HIVE-13750
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-13750
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
>  010c89ed978296709b052cc7bc80256a27658e2b 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
>  733620b84657a21829248afe72ab16ad9692f37e 
>   ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out 
> d03bfe422743d9a5a6b85f9a6198e1e27024f129 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out 
> dec872ab0eef54bd92d5c2bc068e2805cc14e272 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out 
> 832580325873dee741ba86239ee571873994a808 
>   ql/src/test/results/clientpositive/reducesink_dedup.q.out 
> b89df52f965385b85894757896eee487b29c52ae 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out 
> a90e3f63b4646cf0ade9785a501ebd1a6b2a3406 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out 
> 723e8192f2735059005fc3c5c96732a2c4be49c1 
> 
> Diff: https://reviews.apache.org/r/47554/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Jesús Camacho Rodríguez
> 
>



Re: Becoming a contributor

2016-05-19 Thread Vihang Karajgaonkar
Thanks Alan.

On Wed, May 18, 2016 at 5:07 PM, Alan Gates  wrote:

> You’ve been added, so you should now be able to assign JIRAs to yourself.
>
> Alan.
>
> > On May 18, 2016, at 13:38, Vihang Karajgaonkar 
> wrote:
> >
> > Thanks Alan! My JIRA id is vihangk1
> >
> > -Vihang
> >
> >> On May 18, 2016, at 1:09 PM, Alan Gates  wrote:
> >>
> >> Nope, you’re good.  You can ignore the stuff about review board as we
> generally only use that now for large and complex patches.  If you have a
> JIRA in mind you’d like to work on that no one else has you can assign it
> to yourself* and get started.  If someone else is working on it you should
> coordinate via comments on that JIRA to see if there’s a way you can help.
> >>
> >> Welcome to the team.
> >>
> >> Alan.
> >>
> >> *You probably can’t assign JIRAs to yourself yet, but if you reply to
> this email with your JIRA id I’ll make it so you can.
> >>
> >> Alan.
> >>
> >>> On May 18, 2016, at 11:42, Vihang Karajgaonkar 
> wrote:
> >>>
> >>> Hello Everyone,
> >>>
> >>> I would like to start working on issues reported on Hive in JIRA. I
> followed the steps mentioned in
> https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-BecomingaContributor
> <
> https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-BecomingaContributor>.
> Is there anything else that I need to do to get added as a contributor?
> >>>
> >>> Thanks,
> >>> Vihang
> >>
> >
>
>


Re: Review Request 47554: HIVE-13750

2016-05-19 Thread Ashutosh Chauhan


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 556
> > 
> >
> > Can you also add comment why this is valid only for RS added by SPDO 
> > and not in general?
> 
> Jesús Camacho Rodríguez wrote:
> I was thinking further about this, and indeed I made many assumptions 
> thinking that SPDO will not add the sort columns at the end of the additional 
> RS keys, so then we need to make sure the ordering is correct... but this is 
> not true, as it will add them.
> 
> I think this optimization can be done general, as ordering before 
> ordering again is a noop. I have changed the code accordingly, I would like 
> your opinion. Am I missing something? Let's see what QA returns...
> 
> Ashutosh Chauhan wrote:
> yeah.. thats what I think.. this is not specific to SPDO at logical 
> layer. 
> Thing I am not 100% sure about is in FileSink the way SPDO is implemented 
> I *think* it assumes all rows for a particular key come in sorted and in one 
> batch. Now, when we merge and sort by two keys and if there is a case that FS 
> was expecting all rows sorted on second key only, now will get them sorted on 
> 2 keys and it may have to close ORC writer and then reopen again, as oppose 
> to write it once and never opening it again.
> 
> Jesús Camacho Rodríguez wrote:
> I am trying to understand this part... The way the optimization works, we 
> will keep the second RS with its keys, we just remove the first RS... Thus, 
> this should not be a problem? Or maybe I am not understanding it correctly
> 
> Ashutosh Chauhan wrote:
> I think you are correct. Since relevant FS is active in last reducer 
> containing RS which we is not changing, we are not changing any assumptions 
> for FS.

On a second thought I think there is more to it. e.g. we are transforming RS 
(a) followed by RS (a,b) to RS(a,b) Than this is not valid transformation in 
general, only if second RS is introduced by SPDO. e.g, if these two RS are part 
of two GBYs (or a join & GBy), than its invalid because partitioning on (a,b) 
!= partitioning on (a)


- Ashutosh


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47554/#review133844
---


On May 19, 2016, 10:49 a.m., Jesús Camacho Rodríguez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47554/
> ---
> 
> (Updated May 19, 2016, 10:49 a.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-13750
> https://issues.apache.org/jira/browse/HIVE-13750
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-13750
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
>  010c89ed978296709b052cc7bc80256a27658e2b 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
>  733620b84657a21829248afe72ab16ad9692f37e 
>   ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out 
> d03bfe422743d9a5a6b85f9a6198e1e27024f129 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out 
> dec872ab0eef54bd92d5c2bc068e2805cc14e272 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out 
> 832580325873dee741ba86239ee571873994a808 
>   ql/src/test/results/clientpositive/reducesink_dedup.q.out 
> b89df52f965385b85894757896eee487b29c52ae 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out 
> a90e3f63b4646cf0ade9785a501ebd1a6b2a3406 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out 
> 723e8192f2735059005fc3c5c96732a2c4be49c1 
> 
> Diff: https://reviews.apache.org/r/47554/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Jesús Camacho Rodríguez
> 
>



Re: Review Request 47554: HIVE-13750

2016-05-19 Thread Ashutosh Chauhan


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 556
> > 
> >
> > Can you also add comment why this is valid only for RS added by SPDO 
> > and not in general?
> 
> Jesús Camacho Rodríguez wrote:
> I was thinking further about this, and indeed I made many assumptions 
> thinking that SPDO will not add the sort columns at the end of the additional 
> RS keys, so then we need to make sure the ordering is correct... but this is 
> not true, as it will add them.
> 
> I think this optimization can be done general, as ordering before 
> ordering again is a noop. I have changed the code accordingly, I would like 
> your opinion. Am I missing something? Let's see what QA returns...
> 
> Ashutosh Chauhan wrote:
> yeah.. thats what I think.. this is not specific to SPDO at logical 
> layer. 
> Thing I am not 100% sure about is in FileSink the way SPDO is implemented 
> I *think* it assumes all rows for a particular key come in sorted and in one 
> batch. Now, when we merge and sort by two keys and if there is a case that FS 
> was expecting all rows sorted on second key only, now will get them sorted on 
> 2 keys and it may have to close ORC writer and then reopen again, as oppose 
> to write it once and never opening it again.
> 
> Jesús Camacho Rodríguez wrote:
> I am trying to understand this part... The way the optimization works, we 
> will keep the second RS with its keys, we just remove the first RS... Thus, 
> this should not be a problem? Or maybe I am not understanding it correctly

I think you are correct. Since relevant FS is active in last reducer containing 
RS which we is not changing, we are not changing any assumptions for FS.


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 584
> > 
> >
> > Currently, _buceket_number_ comes in as constant when it really its 
> > not. Will that break assumption about ignoring constant?
> 
> Jesús Camacho Rodríguez wrote:
> No, precisely that is the reason why we ignore constants. Also, because 
> ordering by a constant will not change the order of the records.
> 
> Ashutosh Chauhan wrote:
> Problem is although type of _bucket_number_ is ExprNodeConstantDesc, its 
> *not* a constant. Its created as a placeholder and value of this column is 
> calculated on a per-row basis in ReduceSinkOperator::computeBucketNumber() 
> which means at run time its not a constant. Correct fix for this is create 
> _bucket_number_ as ExprNodeColumnDesc in SPDO.
> 
> Jesús Camacho Rodríguez wrote:
> OK, I understand... This has changed in the new patch, so I do not think 
> this is an issue. Nevertheless, I think the correct way of implementing it is 
> as you just explained (feels kind of a hack to use ExprNodeConstantDesc for 
> that).

In general, we need to handle constants in RS dedup better. HIVE-12866 is open 
for that. This will be prequisite for that.


- Ashutosh


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47554/#review133844
---


On May 19, 2016, 10:49 a.m., Jesús Camacho Rodríguez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47554/
> ---
> 
> (Updated May 19, 2016, 10:49 a.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-13750
> https://issues.apache.org/jira/browse/HIVE-13750
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-13750
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
>  010c89ed978296709b052cc7bc80256a27658e2b 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
>  733620b84657a21829248afe72ab16ad9692f37e 
>   ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out 
> d03bfe422743d9a5a6b85f9a6198e1e27024f129 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out 
> dec872ab0eef54bd92d5c2bc068e2805cc14e272 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out 
> 832580325873dee741ba86239ee571873994a808 
>   ql/src/test/results/clientpositive/reducesink_dedup.q.out 
> b89df52f965385b85894757896eee487b29c52ae 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out 
> a90e3f63b4646cf0ade9785a501ebd1a6b2a3406 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out 
> 

Re: Review Request 47554: HIVE-13750

2016-05-19 Thread Jesús Camacho Rodríguez


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 556
> > 
> >
> > Can you also add comment why this is valid only for RS added by SPDO 
> > and not in general?
> 
> Jesús Camacho Rodríguez wrote:
> I was thinking further about this, and indeed I made many assumptions 
> thinking that SPDO will not add the sort columns at the end of the additional 
> RS keys, so then we need to make sure the ordering is correct... but this is 
> not true, as it will add them.
> 
> I think this optimization can be done general, as ordering before 
> ordering again is a noop. I have changed the code accordingly, I would like 
> your opinion. Am I missing something? Let's see what QA returns...
> 
> Ashutosh Chauhan wrote:
> yeah.. thats what I think.. this is not specific to SPDO at logical 
> layer. 
> Thing I am not 100% sure about is in FileSink the way SPDO is implemented 
> I *think* it assumes all rows for a particular key come in sorted and in one 
> batch. Now, when we merge and sort by two keys and if there is a case that FS 
> was expecting all rows sorted on second key only, now will get them sorted on 
> 2 keys and it may have to close ORC writer and then reopen again, as oppose 
> to write it once and never opening it again.

I am trying to understand this part... The way the optimization works, we will 
keep the second RS with its keys, we just remove the first RS... Thus, this 
should not be a problem? Or maybe I am not understanding it correctly


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 584
> > 
> >
> > Currently, _buceket_number_ comes in as constant when it really its 
> > not. Will that break assumption about ignoring constant?
> 
> Jesús Camacho Rodríguez wrote:
> No, precisely that is the reason why we ignore constants. Also, because 
> ordering by a constant will not change the order of the records.
> 
> Ashutosh Chauhan wrote:
> Problem is although type of _bucket_number_ is ExprNodeConstantDesc, its 
> *not* a constant. Its created as a placeholder and value of this column is 
> calculated on a per-row basis in ReduceSinkOperator::computeBucketNumber() 
> which means at run time its not a constant. Correct fix for this is create 
> _bucket_number_ as ExprNodeColumnDesc in SPDO.

OK, I understand... This has changed in the new patch, so I do not think this 
is an issue. Nevertheless, I think the correct way of implementing it is as you 
just explained (feels kind of a hack to use ExprNodeConstantDesc for that).


- Jesús


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47554/#review133844
---


On May 19, 2016, 10:49 a.m., Jesús Camacho Rodríguez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47554/
> ---
> 
> (Updated May 19, 2016, 10:49 a.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-13750
> https://issues.apache.org/jira/browse/HIVE-13750
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-13750
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
>  010c89ed978296709b052cc7bc80256a27658e2b 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
>  733620b84657a21829248afe72ab16ad9692f37e 
>   ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out 
> d03bfe422743d9a5a6b85f9a6198e1e27024f129 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out 
> dec872ab0eef54bd92d5c2bc068e2805cc14e272 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out 
> 832580325873dee741ba86239ee571873994a808 
>   ql/src/test/results/clientpositive/reducesink_dedup.q.out 
> b89df52f965385b85894757896eee487b29c52ae 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out 
> a90e3f63b4646cf0ade9785a501ebd1a6b2a3406 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out 
> 723e8192f2735059005fc3c5c96732a2c4be49c1 
> 
> Diff: https://reviews.apache.org/r/47554/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Jesús Camacho Rodríguez
> 
>



Re: Review Request 47554: HIVE-13750

2016-05-19 Thread Ashutosh Chauhan


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 556
> > 
> >
> > Can you also add comment why this is valid only for RS added by SPDO 
> > and not in general?
> 
> Jesús Camacho Rodríguez wrote:
> I was thinking further about this, and indeed I made many assumptions 
> thinking that SPDO will not add the sort columns at the end of the additional 
> RS keys, so then we need to make sure the ordering is correct... but this is 
> not true, as it will add them.
> 
> I think this optimization can be done general, as ordering before 
> ordering again is a noop. I have changed the code accordingly, I would like 
> your opinion. Am I missing something? Let's see what QA returns...

yeah.. thats what I think.. this is not specific to SPDO at logical layer. 
Thing I am not 100% sure about is in FileSink the way SPDO is implemented I 
*think* it assumes all rows for a particular key come in sorted and in one 
batch. Now, when we merge and sort by two keys and if there is a case that FS 
was expecting all rows sorted on second key only, now will get them sorted on 2 
keys and it may have to close ORC writer and then reopen again, as oppose to 
write it once and never opening it again.


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 584
> > 
> >
> > Currently, _buceket_number_ comes in as constant when it really its 
> > not. Will that break assumption about ignoring constant?
> 
> Jesús Camacho Rodríguez wrote:
> No, precisely that is the reason why we ignore constants. Also, because 
> ordering by a constant will not change the order of the records.

Problem is although type of _bucket_number_ is ExprNodeConstantDesc, its *not* 
a constant. Its created as a placeholder and value of this column is calculated 
on a per-row basis in ReduceSinkOperator::computeBucketNumber() which means at 
run time its not a constant. Correct fix for this is create _bucket_number_ as 
ExprNodeColumnDesc in SPDO.


- Ashutosh


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47554/#review133844
---


On May 19, 2016, 10:49 a.m., Jesús Camacho Rodríguez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47554/
> ---
> 
> (Updated May 19, 2016, 10:49 a.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-13750
> https://issues.apache.org/jira/browse/HIVE-13750
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-13750
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
>  010c89ed978296709b052cc7bc80256a27658e2b 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
>  733620b84657a21829248afe72ab16ad9692f37e 
>   ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out 
> d03bfe422743d9a5a6b85f9a6198e1e27024f129 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out 
> dec872ab0eef54bd92d5c2bc068e2805cc14e272 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out 
> 832580325873dee741ba86239ee571873994a808 
>   ql/src/test/results/clientpositive/reducesink_dedup.q.out 
> b89df52f965385b85894757896eee487b29c52ae 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out 
> a90e3f63b4646cf0ade9785a501ebd1a6b2a3406 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out 
> 723e8192f2735059005fc3c5c96732a2c4be49c1 
> 
> Diff: https://reviews.apache.org/r/47554/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Jesús Camacho Rodríguez
> 
>



[jira] [Created] (HIVE-13791) Fix failure Unit Test TestHiveSessionImpl.testLeakOperationHandle

2016-05-19 Thread Nemon Lou (JIRA)
Nemon Lou created HIVE-13791:


 Summary: Fix  failure Unit Test 
TestHiveSessionImpl.testLeakOperationHandle
 Key: HIVE-13791
 URL: https://issues.apache.org/jira/browse/HIVE-13791
 Project: Hive
  Issue Type: Test
  Components: Test
Affects Versions: 2.1.0
Reporter: Nemon Lou
Assignee: Nemon Lou
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 47554: HIVE-13750

2016-05-19 Thread Jesús Camacho Rodríguez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47554/
---

(Updated May 19, 2016, 10:49 a.m.)


Review request for hive and Ashutosh Chauhan.


Bugs: HIVE-13750
https://issues.apache.org/jira/browse/HIVE-13750


Repository: hive-git


Description
---

HIVE-13750


Diffs (updated)
-

  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
 010c89ed978296709b052cc7bc80256a27658e2b 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
 733620b84657a21829248afe72ab16ad9692f37e 
  ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out 
d03bfe422743d9a5a6b85f9a6198e1e27024f129 
  ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out 
dec872ab0eef54bd92d5c2bc068e2805cc14e272 
  ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out 
832580325873dee741ba86239ee571873994a808 
  ql/src/test/results/clientpositive/reducesink_dedup.q.out 
b89df52f965385b85894757896eee487b29c52ae 
  ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out 
a90e3f63b4646cf0ade9785a501ebd1a6b2a3406 
  ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out 
723e8192f2735059005fc3c5c96732a2c4be49c1 

Diff: https://reviews.apache.org/r/47554/diff/


Testing
---


Thanks,

Jesús Camacho Rodríguez



Re: Review Request 47554: HIVE-13750

2016-05-19 Thread Jesús Camacho Rodríguez


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 519
> > 
> >
> > This is already called, before a call to procesDynP..
> > Do we need it again?

They are different methods:
- _checkValidDynPartSortDedup_ checks whether the we have a valid chain of 
operators between pRS and cRS, and
- _checkDynPartSortDedupPossible_ makes all the necessary checks to ensure that 
the transformation can be done.

Nevertheless, checking back the code it seems like I should merge both methods. 
I have done that in the new patch and I have restructed a bit the code.


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 538
> > 
> >
> > Can you add comments what are permissible operator sequences here?

Currently the only valid sequence is pRS-SEL*-cRS.


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 584
> > 
> >
> > Currently, _buceket_number_ comes in as constant when it really its 
> > not. Will that break assumption about ignoring constant?

No, precisely that is the reason why we ignore constants. Also, because 
ordering by a constant will not change the order of the records.


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  lines 740-744
> > 
> >
> > Should it be other way round? That we first try normal dedup and then 
> > extended.

Extended is more aggressive; I think we should try being aggressive first.


> On May 19, 2016, 12:06 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java,
> >  line 556
> > 
> >
> > Can you also add comment why this is valid only for RS added by SPDO 
> > and not in general?

I was thinking further about this, and indeed I made many assumptions thinking 
that SPDO will not add the sort columns at the end of the additional RS keys, 
so then we need to make sure the ordering is correct... but this is not true, 
as it will add them.

I think this optimization can be done general, as ordering before ordering 
again is a noop. I have changed the code accordingly, I would like your 
opinion. Am I missing something? Let's see what QA returns...


- Jesús


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47554/#review133844
---


On May 18, 2016, 9:42 p.m., Jesús Camacho Rodríguez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47554/
> ---
> 
> (Updated May 18, 2016, 9:42 p.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-13750
> https://issues.apache.org/jira/browse/HIVE-13750
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-13750
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
>  010c89ed978296709b052cc7bc80256a27658e2b 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
>  733620b84657a21829248afe72ab16ad9692f37e 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 
> d7e404c9946461e20357ed53dd8da468590683c6 
>   ql/src/test/results/clientpositive/dynpart_sort_opt_vectorization.q.out 
> d03bfe422743d9a5a6b85f9a6198e1e27024f129 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization.q.out 
> dec872ab0eef54bd92d5c2bc068e2805cc14e272 
>   ql/src/test/results/clientpositive/dynpart_sort_optimization_acid.q.out 
> 832580325873dee741ba86239ee571873994a808 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_opt_vectorization.q.out 
> a90e3f63b4646cf0ade9785a501ebd1a6b2a3406 
>   ql/src/test/results/clientpositive/tez/dynpart_sort_optimization.q.out 
> 723e8192f2735059005fc3c5c96732a2c4be49c1 
> 
> Diff: https://reviews.apache.org/r/47554/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Jesús Camacho Rodríguez
> 
>



Re: using the Hive SQL parser in Spark

2016-05-19 Thread Reynold Xin
I want to give an update since there have been some new development since
my last email.

We did import Hive's parser into Spark in Feb, but then in April that was
replaced by another ANTLR4 based parser. So the net effect is that this
didn't happen (no release was made with the Hive parser).

Thanks for the support.

On Friday, December 18, 2015, Reynold Xin  wrote:

> (Please use reply-all so I see the replies)
>
> Responses inline.
>
>
> On Fri, Dec 18, 2015 at 1:17 PM, Yin Huai  > wrote:
>
>> Let me add Reynold to the thread.
>>
>> On Fri, Dec 18, 2015 at 12:36 PM, Gopal Vijayaraghavan > > wrote:
>>
>>>
>>> >We have looked into various options, and it looks like the best option
>>> is
>>> >to copy the ANTLR grammar file from Hive into Spark. Because the grammar
>>> >file is tightly coupled with Hive's semantic analysis, we need to
>>> refactor
>>> >some code to use them so it will end up becoming the .g file plus some
>>> >coupled code.
>>>
>>> Is the eventual goal to contribute that fork back into Hive & have Hive
>>> devs maintain a compatible parser for SparkSQL?
>>>
>>> Would that affect Hive's ability to refactor the SQL parser in the future
>>> or is this a one-time only deal?
>>
>>
> I am not sure if it is useful at all to port that back to Hive since it
> has zero user facing benefit, and would require Hive devs to spend a lot of
> time reviewing the changes. Refactoring like this is always risky for an
> established project.
>
>
>>
>>>
>>> >parser. From Hive's perspective this does not provide any immediate
>>> >benefits. From Spark's perspective, we iterate very quickly so having to
>>> >depend on an external component also slow down our development. We also
>>> >have some requirements that simply don't apply in other projects (e.g.
>>> >being able to parse DataFrame expressions).
>>>
>>> From that I assume, this involves some form of cut-paste duplication of
>>> the code into SparkSQL project with that version diverging away from
>>> Hive's.
>>
>>
> That is correct.
>
>
>>
>>>
>>> > Thanks a lot for developing this parser, and we will try our best to
>>> > contribute back as we fix bugs. I will also make sure we have the
>>> proper
>>> > acknowledgment when we do this.
>>>
>>>
>>> Under the Apache license, there's no actual restriction against a hostile
>>> embrace-extend by copying hive's code verbatim as long as the fork
>>> retains
>>> license notices.
>>>
>>> The maintainability concerns are mostly around whether this is intended
>>> as
>>> an ongoing relationship, including any compatibility committments from
>>> hive-dev@.
>>>
>>
> No commitments needed from Hive. You should update/improve the parser as
> you see fit. We do have a pretty comprehensive suite of Hive compatibility
> tests (by using the Hive tests directly) to ensure SQL compatibility with
> Hive. We will continue running those. We will also try our best to
> contribute back bug fixes to the parser.
>
>
>


[jira] [Created] (HIVE-13790) log4j2 syslog appender not taking "LoggerFields" and "KeyValuePair" options

2016-05-19 Thread Alexandre Linte (JIRA)
Alexandre Linte created HIVE-13790:
--

 Summary: log4j2 syslog appender not taking "LoggerFields" and 
"KeyValuePair" options
 Key: HIVE-13790
 URL: https://issues.apache.org/jira/browse/HIVE-13790
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2, Metastore
Affects Versions: 2.0.0
 Environment: Hive 2.0.0, Hadoop 2.7.2, Spark 1.6.1, HBase 1.1.2
Reporter: Alexandre Linte


I'm trying to use the Syslog appender with log4j2 in Hive 2.0.0. The syslog 
appender is configured on my hiveserver2 and my metastore.
With a simple configuration, the logs are well written in the logfile with a 
generic pattern layout:
{noformat}
May 19 10:12:16 myhiveserver2.fr Starting HiveServer2
May 19 10:12:18 myhiveserver2.fr Connected to metastore.
May 19 10:12:20 myhiveserver2.fr Service: CLIService is inited.
May 19 10:12:20 myhiveserver2.fr Service: ThriftBinaryCLIService is inited.
{noformat}
I tried to customize this pattern layout by adding the loggerFields parameter 
in my hive-log4j2.properties. At the end, the configuration file is:
{noformat}
status = TRACE
name = HiveLog4j2
packages = org.apache.hadoop.hive.ql.log

property.hive.log.level = INFO
property.hive.root.logger = SYSLOG
property.hive.query.id = hadoop
property.hive.log.dir = /var/log/bigdata
property.hive.log.file = bigdata.log

appenders = console, SYSLOG

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n

appender.SYSLOG.type = Syslog
appender.SYSLOG.name = SYSLOG
appender.SYSLOG.host = 127.0.0.1
appender.SYSLOG.port = 514
appender.SYSLOG.protocol = UDP
appender.SYSLOG.facility = LOCAL1
appender.SYSLOG.layout.type = loggerFields
appender.SYSLOG.layout.sdId = test
appender.SYSLOG.layout.enterpriseId = 18060
appender.SYSLOG.layout.pairs.type = KeyValuePair
appender.SYSLOG.layout.pairs.key = service
appender.SYSLOG.layout.pairs.value = hiveserver2
appender.SYSLOG.layout.pairs.key = loglevel
appender.SYSLOG.layout.pairs.value = %p
appender.SYSLOG.layout.pairs.key = message
appender.SYSLOG.layout.pairs.value = %c%m%n

loggers = NIOServerCnxn, ClientCnxnSocketNIO, DataNucleus, Datastore, JPOX

logger.NIOServerCnxn.name = org.apache.zookeeper.server.NIOServerCnxn
logger.NIOServerCnxn.level = WARN

logger.ClientCnxnSocketNIO.name = org.apache.zookeeper.ClientCnxnSocketNIO
logger.ClientCnxnSocketNIO.level = WARN

logger.DataNucleus.name = DataNucleus
logger.DataNucleus.level = ERROR

logger.Datastore.name = Datastore
logger.Datastore.level = ERROR

logger.JPOX.name = JPOX
logger.JPOX.level = ERROR

rootLogger.level = ${sys:hive.log.level}
rootLogger.appenderRefs = root
rootLogger.appenderRef.root.ref = ${sys:hive.root.logger}
{noformat}
Unfortunately, the logs are still written in a generic pattern layout. The 
KeyValuePairs are not used. The log4j logs are:
{noformat}
2016-05-19 10:36:14,866 main DEBUG Initializing configuration 
org.apache.logging.log4j.core.config.properties.PropertiesConfiguration@5433a329
2016-05-19 10:36:16,575 main DEBUG Took 1.706004 seconds to load 3 plugins from 
package org.apache.hadoop.hive.ql.log
2016-05-19 10:36:16,575 main DEBUG PluginManager 'Core' found 80 plugins
2016-05-19 10:36:16,576 main DEBUG PluginManager 'Level' found 0 plugins
2016-05-19 10:36:16,578 main DEBUG Building Plugin[name=property, 
class=org.apache.logging.log4j.core.config.Property]. Searching for builder 
factory method...
2016-05-19 10:36:16,583 main DEBUG No builder factory method found in class 
org.apache.logging.log4j.core.config.Property. Going to try finding a factory 
method instead.
2016-05-19 10:36:16,583 main DEBUG Still building Plugin[name=property, 
class=org.apache.logging.log4j.core.config.Property]. Searching for factory 
method...
2016-05-19 10:36:16,584 main DEBUG Found factory method [createProperty]: 
public static org.apache.logging.log4j.core.config.Property 
org.apache.logging.log4j.core.config.Property.createProperty(java.lang.String,java.lang.String).
2016-05-19 10:36:16,611 main DEBUG TypeConverterRegistry initializing.
2016-05-19 10:36:16,611 main DEBUG PluginManager 'TypeConverter' found 21 
plugins
2016-05-19 10:36:16,636 main DEBUG Calling createProperty on class 
org.apache.logging.log4j.core.config.Property for element Property with 
params(name="hive.log.file", value="bigdata.log")
2016-05-19 10:36:16,636 main DEBUG Built Plugin[name=property] OK from factory 
method.
2016-05-19 10:36:16,636 main DEBUG Building Plugin[name=property, 
class=org.apache.logging.log4j.core.config.Property]. Searching for builder 
factory method...
2016-05-19 10:36:16,637 main DEBUG No builder factory method found in class 
org.apache.logging.log4j.core.config.Property. Going to try finding a factory 
method instead.
2016-05-19 10:36:16,6

[jira] [Created] (HIVE-13789) Repeatedly checking configuration in TextRecordWriter/Reader hurts performance

2016-05-19 Thread Rui Li (JIRA)
Rui Li created HIVE-13789:
-

 Summary: Repeatedly checking configuration in 
TextRecordWriter/Reader hurts performance
 Key: HIVE-13789
 URL: https://issues.apache.org/jira/browse/HIVE-13789
 Project: Hive
  Issue Type: Improvement
Reporter: Rui Li
Assignee: Rui Li
Priority: Minor


We check configuration to decide whether to escape certain characters each time 
write/read a record for custom scripts.
In our benchmark this becomes a hot spot method. And fixing it improves the 
execution of the custom script by 7% (3TB TPCx-BB dataset).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)