[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-10-01 Thread Lefty Leverenz (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lefty Leverenz updated HIVE-1:
--
Labels: TODOC2.2  (was: )

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
>  Labels: TODOC2.2
> Fix For: 2.2.0
>
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch, 
> HIVE-1.3.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-28 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Pushed to master. Thanks Xuefu for reviewing.

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Fix For: 2.2.0
>
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch, 
> HIVE-1.3.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Status: Patch Available  (was: In Progress)

Patch-3: the test case passed locally. Not sure why it failed here. Modified 
the test case slightly to see what's going on. 

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch, 
> HIVE-1.3.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Attachment: HIVE-1.3.patch

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch, 
> HIVE-1.3.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Status: In Progress  (was: Patch Available)

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Attachment: HIVE-1.2.patch

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Attachment: (was: HIVE-1.2.patch)

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Status: Patch Available  (was: In Progress)

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Attachment: HIVE-1.2.patch

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Status: In Progress  (was: Patch Available)

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Attachment: (was: HIVE-1.2.patch)

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Status: Patch Available  (was: In Progress)

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-26 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Status: In Progress  (was: Patch Available)

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-23 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Attachment: HIVE-1.2.patch

Patch-2: added retry logic if the port is used and also changed to throw 
exception if the configuration is incorrect rather than silently uses 0, since 
I feel that that could confuse the users more. 

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch, HIVE-1.2.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-22 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Component/s: Spark

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Spark
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-22 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Attachment: HIVE-1.1.patch

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-22 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Attachment: (was: HIVE-1.1.patch)

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-22 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Status: Patch Available  (was: Open)

Patch-1: made the change to add a configuration for the port range.

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-12222) Define port range in property for RPCServer

2016-09-22 Thread Aihua Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-1:

Attachment: HIVE-1.1.patch

> Define port range in property for RPCServer
> ---
>
> Key: HIVE-1
> URL: https://issues.apache.org/jira/browse/HIVE-1
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Affects Versions: 1.2.1
> Environment: Apache Hadoop 2.7.0
> Apache Hive 1.2.1
> Apache Spark 1.5.1
>Reporter: Andrew Lee
>Assignee: Aihua Xu
> Attachments: HIVE-1.1.patch
>
>
> Creating this JIRA after discussin with Xuefu on the dev mailing list. Would 
> need some help to review and update the fields in this JIRA ticket, thanks.
> I notice that in 
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> The port number is assigned with 0 which means it will be a random port every 
> time when the RPC Server is created to talk to Spark in the same session.
> Because of this, this is causing problems to configure firewall between the 
> HiveCLI RPC Server and Spark due to unpredictable port numbers here. In other 
> word, users need to open all hive ports range 
> from Data Node => HiveCLI (edge node).
> {code}
>  this.channel = new ServerBootstrap()
>   .group(group)
>   .channel(NioServerSocketChannel.class)
>   .childHandler(new ChannelInitializer() {
>   @Override
>   public void initChannel(SocketChannel ch) throws Exception {
> SaslServerHandler saslHandler = new SaslServerHandler(config);
> final Rpc newRpc = Rpc.createServer(saslHandler, config, ch, 
> group);
> saslHandler.rpc = newRpc;
> Runnable cancelTask = new Runnable() {
> @Override
> public void run() {
>   LOG.warn("Timed out waiting for hello from client.");
>   newRpc.close();
> }
> };
> saslHandler.cancelTask = group.schedule(cancelTask,
> RpcServer.this.config.getServerConnectTimeoutMs(),
> TimeUnit.MILLISECONDS);
>   }
>   })
> {code}
> 2 Main reasons.
> - Most users (what I see and encounter) use HiveCLI as a command line tool, 
> and in order to use that, they need to login to the edge node (via SSH). Now, 
> here comes the interesting part.
> Could be true or not, but this is what I observe and encounter from time to 
> time. Most users will abuse the resource on that edge node (increasing 
> HADOOP_HEAPSIZE, dumping output to local disk, running huge python workflow, 
> etc), this may cause the HS2 process to run into OOME, choke and die, etc. 
> various resource issues including others like login, etc.
> - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly 
> available. This makes sense to run it on the gateway node or a service node 
> and separated from the HiveCLI.
> The logs are located in different location, monitoring and auditing is easier 
> to run HS2 with a daemon user account, etc. so we don't want users to run 
> HiveCLI where HS2 is running.
> It's better to isolate the resource this way to avoid any memory, file 
> handlers, disk space, issues.
> From a security standpoint, 
> - Since users can login to edge node (via SSH), the security on the edge node 
> needs to be fortified and enhanced. Therefore, all the FW comes in and 
> auditing.
> - Regulation/compliance for auditing is another requirement to monitor all 
> traffic, specifying ports and locking down the ports makes it easier since we 
> can focus
> on a range to monitor and audit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)