[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.

2019-02-12 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766740#comment-16766740
 ] 

ASF subversion and git services commented on IMPALA-7214:
-

Commit 5b32a0d60110be7c21184819c2dffbb7cbff750f in impala's branch 
refs/heads/master from Alex Rodoni
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5b32a0d ]

IMPALA-7214: [DOCS] More on decoupling impala and DataNodes

Change-Id: I4b6f1c704c1e328af9f0beec73f8b6b61fba992e
Reviewed-on: http://gerrit.cloudera.org:8080/12457
Tested-by: Impala Public Jenkins 
Reviewed-by: Tim Armstrong 


> Update Impala docs to reflect coordinator/executor separation and decoupling 
> from DataNodes.
> 
>
> Key: IMPALA-7214
> URL: https://issues.apache.org/jira/browse/IMPALA-7214
> Project: IMPALA
>  Issue Type: Bug
>  Components: Docs
>Affects Versions: Impala 2.12.0
>Reporter: Tim Armstrong
>Assignee: Alex Rodoni
>Priority: Major
> Fix For: Impala 3.2.0
>
>
> The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I 
> think this stems from the original deployment practice of always colocating 
> Impala daemons with HDFS datanodes so that HDFS data could always be read 
> from a local DataNode. 
> I'm a bit pedantic so the conflation feels wrong to me regardless, but I 
> think this will become increasingly confusing as alternative deployments 
> without colocated HDFS DataNodes become more common (e.g. running against S3, 
> running with a separate HDFS service).
> E.g. picking an example at random:
> {noformat}
> In Impala 1.4.0 and higher, the LIMIT clause is now 
> optional (rather than required) for
> queries that use the ORDER BY clause. Impala 
> automatically uses a temporary disk work area
> to perform the sort if the sort operation would otherwise exceed the 
> Impala memory limit for a particular
> DataNode.
> {noformat}
> This is wrong because the memory limit is for an Impala daemon, which is the 
> process that does the actual sorting. So here I think it should be "Impala 
> daemon" instead of "DataNode".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.

2019-02-12 Thread Alex Rodoni (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766427#comment-16766427
 ] 

Alex Rodoni commented on IMPALA-7214:
-

https://gerrit.cloudera.org/#/c/12457/

> Update Impala docs to reflect coordinator/executor separation and decoupling 
> from DataNodes.
> 
>
> Key: IMPALA-7214
> URL: https://issues.apache.org/jira/browse/IMPALA-7214
> Project: IMPALA
>  Issue Type: Bug
>  Components: Docs
>Affects Versions: Impala 2.12.0
>Reporter: Tim Armstrong
>Assignee: Alex Rodoni
>Priority: Major
>
> The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I 
> think this stems from the original deployment practice of always colocating 
> Impala daemons with HDFS datanodes so that HDFS data could always be read 
> from a local DataNode. 
> I'm a bit pedantic so the conflation feels wrong to me regardless, but I 
> think this will become increasingly confusing as alternative deployments 
> without colocated HDFS DataNodes become more common (e.g. running against S3, 
> running with a separate HDFS service).
> E.g. picking an example at random:
> {noformat}
> In Impala 1.4.0 and higher, the LIMIT clause is now 
> optional (rather than required) for
> queries that use the ORDER BY clause. Impala 
> automatically uses a temporary disk work area
> to perform the sort if the sort operation would otherwise exceed the 
> Impala memory limit for a particular
> DataNode.
> {noformat}
> This is wrong because the memory limit is for an Impala daemon, which is the 
> process that does the actual sorting. So here I think it should be "Impala 
> daemon" instead of "DataNode".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.

2019-02-11 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765657#comment-16765657
 ] 

ASF subversion and git services commented on IMPALA-7214:
-

Commit 697a15b341186046d8fae3a2139f1ad13d304734 in impala's branch 
refs/heads/master from Alex Rodoni
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=697a15b ]

IMPALA-7214: [DOCS] Update Impala docs to decouple Impala and DataNodes

- Take 1: Let's review these docs before we go clean up many more.

Change-Id: I1c91f7975c09dae9908591eeeac0d55e5355b2d4
Reviewed-on: http://gerrit.cloudera.org:8080/12400
Reviewed-by: Alex Rodoni 
Tested-by: Impala Public Jenkins 


> Update Impala docs to reflect coordinator/executor separation and decoupling 
> from DataNodes.
> 
>
> Key: IMPALA-7214
> URL: https://issues.apache.org/jira/browse/IMPALA-7214
> Project: IMPALA
>  Issue Type: Bug
>  Components: Docs
>Affects Versions: Impala 2.12.0
>Reporter: Tim Armstrong
>Assignee: Alex Rodoni
>Priority: Major
>
> The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I 
> think this stems from the original deployment practice of always colocating 
> Impala daemons with HDFS datanodes so that HDFS data could always be read 
> from a local DataNode. 
> I'm a bit pedantic so the conflation feels wrong to me regardless, but I 
> think this will become increasingly confusing as alternative deployments 
> without colocated HDFS DataNodes become more common (e.g. running against S3, 
> running with a separate HDFS service).
> E.g. picking an example at random:
> {noformat}
> In Impala 1.4.0 and higher, the LIMIT clause is now 
> optional (rather than required) for
> queries that use the ORDER BY clause. Impala 
> automatically uses a temporary disk work area
> to perform the sort if the sort operation would otherwise exceed the 
> Impala memory limit for a particular
> DataNode.
> {noformat}
> This is wrong because the memory limit is for an Impala daemon, which is the 
> process that does the actual sorting. So here I think it should be "Impala 
> daemon" instead of "DataNode".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.

2019-02-07 Thread Alex Rodoni (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763216#comment-16763216
 ] 

Alex Rodoni commented on IMPALA-7214:
-

https://gerrit.cloudera.org/#/c/12400/

> Update Impala docs to reflect coordinator/executor separation and decoupling 
> from DataNodes.
> 
>
> Key: IMPALA-7214
> URL: https://issues.apache.org/jira/browse/IMPALA-7214
> Project: IMPALA
>  Issue Type: Bug
>  Components: Docs
>Affects Versions: Impala 2.12.0
>Reporter: Tim Armstrong
>Assignee: Alex Rodoni
>Priority: Major
>
> The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I 
> think this stems from the original deployment practice of always colocating 
> Impala daemons with HDFS datanodes so that HDFS data could always be read 
> from a local DataNode. 
> I'm a bit pedantic so the conflation feels wrong to me regardless, but I 
> think this will become increasingly confusing as alternative deployments 
> without colocated HDFS DataNodes become more common (e.g. running against S3, 
> running with a separate HDFS service).
> E.g. picking an example at random:
> {noformat}
> In Impala 1.4.0 and higher, the LIMIT clause is now 
> optional (rather than required) for
> queries that use the ORDER BY clause. Impala 
> automatically uses a temporary disk work area
> to perform the sort if the sort operation would otherwise exceed the 
> Impala memory limit for a particular
> DataNode.
> {noformat}
> This is wrong because the memory limit is for an Impala daemon, which is the 
> process that does the actual sorting. So here I think it should be "Impala 
> daemon" instead of "DataNode".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.

2018-12-26 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729187#comment-16729187
 ] 

Tim Armstrong commented on IMPALA-7214:
---

I started to have a look at this, and I think it's not just a terminology 
issue, there's a pervasive assumption in a lot of the docs of a particular 
topology - having a coordinator+executor colocated with each HDFS DataNode. 
E.g. an example is this sentence: "You can submit a query to the Impala daemon 
running on any DataNode", which is outdated both because of the 
coordinator/executor separate and because of decouple compute clusters.

So I think we need to first update docs to reflect the coordinator/executor 
separation. Maybe they need to be defined in topics/impala_components.xml. I 
think that introductory material also needs to introduce the traditional 
colocated HDFS storage/compute model, where each executor runs on the same host 
as a DataNode, but also mention that other deployment models are possible (like 
a compute cluster that reads remotely from HDFS, S3, ADLS, etc).

Then there are a bunch of instances of the terminology that I think are really 
referring to an Impala daemon, or are unnecessary. e.g.  
* "For information about establishing a connection to a DataNode running the 
impalad daemon.." should probably  read "For information about 
establishing a connection to a coordinator Impala Daemon..."
* "Where practical, co-locate the tablet servers on the same hosts as the 
DataNodes, although that is not required."->  "Where practical, co-locate the 
tablet servers on the same hosts as the Impala Daemons, although that is not 
required."
* "If necessary, specify one of the following configuration options when 
starting the impalad daemon on each DataNode:" -> "If 
necessary, specify one of the following configuration options when starting the 
impalad daemon:"
* In impala_order_by.xml, I think each instance of DataNode actually should be 
"Impala Daemon" or, probably more precisely,  "Executor Impala Daemon"
* "For some operations, such as joins and combining intermediate results into a 
final result set, data is transmitted across the network from one DataNode to 
another." -> this should be "Impala Daemon"

There are a bunch of instances of "DataNode" that are referring to the HDFS 
process or service, or the host that is running the HDFS process, which should 
be ok to leave for now (there are some that are a little unclear).

I found one instance were it was referring to the host running the Impala 
Daemon: "Increase the overall memory capacity of each DataNode at the hardware 
level." I'd rephrase that as "Add more memory to the hosts running Impala 
Daemons".

[~arodoni_cloudera] Maybe we can fix up the instances where it's definitely 
referring to the Impala daemon, then do another pass to see how it looks?

[~philip][~joemcdonnell] FYI, just an example of some of the things we might 
need to update in docs for remote reads.

> Update Impala docs to reflect coordinator/executor separation and decoupling 
> from DataNodes.
> 
>
> Key: IMPALA-7214
> URL: https://issues.apache.org/jira/browse/IMPALA-7214
> Project: IMPALA
>  Issue Type: Bug
>  Components: Docs
>Affects Versions: Impala 2.12.0
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>
> The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I 
> think this stems from the original deployment practice of always colocating 
> Impala daemons with HDFS datanodes so that HDFS data could always be read 
> from a local DataNode. 
> I'm a bit pedantic so the conflation feels wrong to me regardless, but I 
> think this will become increasingly confusing as alternative deployments 
> without colocated HDFS DataNodes become more common (e.g. running against S3, 
> running with a separate HDFS service).
> E.g. picking an example at random:
> {noformat}
> In Impala 1.4.0 and higher, the LIMIT clause is now 
> optional (rather than required) for
> queries that use the ORDER BY clause. Impala 
> automatically uses a temporary disk work area
> to perform the sort if the sort operation would otherwise exceed the 
> Impala memory limit for a particular
> DataNode.
> {noformat}
> This is wrong because the memory limit is for an Impala daemon, which is the 
> process that does the actual sorting. So here I think it should be "Impala 
> daemon" instead of "DataNode".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org