[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.
[ https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766740#comment-16766740 ] ASF subversion and git services commented on IMPALA-7214: - Commit 5b32a0d60110be7c21184819c2dffbb7cbff750f in impala's branch refs/heads/master from Alex Rodoni [ https://gitbox.apache.org/repos/asf?p=impala.git;h=5b32a0d ] IMPALA-7214: [DOCS] More on decoupling impala and DataNodes Change-Id: I4b6f1c704c1e328af9f0beec73f8b6b61fba992e Reviewed-on: http://gerrit.cloudera.org:8080/12457 Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong > Update Impala docs to reflect coordinator/executor separation and decoupling > from DataNodes. > > > Key: IMPALA-7214 > URL: https://issues.apache.org/jira/browse/IMPALA-7214 > Project: IMPALA > Issue Type: Bug > Components: Docs >Affects Versions: Impala 2.12.0 >Reporter: Tim Armstrong >Assignee: Alex Rodoni >Priority: Major > Fix For: Impala 3.2.0 > > > The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I > think this stems from the original deployment practice of always colocating > Impala daemons with HDFS datanodes so that HDFS data could always be read > from a local DataNode. > I'm a bit pedantic so the conflation feels wrong to me regardless, but I > think this will become increasingly confusing as alternative deployments > without colocated HDFS DataNodes become more common (e.g. running against S3, > running with a separate HDFS service). > E.g. picking an example at random: > {noformat} > In Impala 1.4.0 and higher, the LIMIT clause is now > optional (rather than required) for > queries that use the ORDER BY clause. Impala > automatically uses a temporary disk work area > to perform the sort if the sort operation would otherwise exceed the > Impala memory limit for a particular > DataNode. > {noformat} > This is wrong because the memory limit is for an Impala daemon, which is the > process that does the actual sorting. So here I think it should be "Impala > daemon" instead of "DataNode". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.
[ https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766427#comment-16766427 ] Alex Rodoni commented on IMPALA-7214: - https://gerrit.cloudera.org/#/c/12457/ > Update Impala docs to reflect coordinator/executor separation and decoupling > from DataNodes. > > > Key: IMPALA-7214 > URL: https://issues.apache.org/jira/browse/IMPALA-7214 > Project: IMPALA > Issue Type: Bug > Components: Docs >Affects Versions: Impala 2.12.0 >Reporter: Tim Armstrong >Assignee: Alex Rodoni >Priority: Major > > The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I > think this stems from the original deployment practice of always colocating > Impala daemons with HDFS datanodes so that HDFS data could always be read > from a local DataNode. > I'm a bit pedantic so the conflation feels wrong to me regardless, but I > think this will become increasingly confusing as alternative deployments > without colocated HDFS DataNodes become more common (e.g. running against S3, > running with a separate HDFS service). > E.g. picking an example at random: > {noformat} > In Impala 1.4.0 and higher, the LIMIT clause is now > optional (rather than required) for > queries that use the ORDER BY clause. Impala > automatically uses a temporary disk work area > to perform the sort if the sort operation would otherwise exceed the > Impala memory limit for a particular > DataNode. > {noformat} > This is wrong because the memory limit is for an Impala daemon, which is the > process that does the actual sorting. So here I think it should be "Impala > daemon" instead of "DataNode". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.
[ https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765657#comment-16765657 ] ASF subversion and git services commented on IMPALA-7214: - Commit 697a15b341186046d8fae3a2139f1ad13d304734 in impala's branch refs/heads/master from Alex Rodoni [ https://gitbox.apache.org/repos/asf?p=impala.git;h=697a15b ] IMPALA-7214: [DOCS] Update Impala docs to decouple Impala and DataNodes - Take 1: Let's review these docs before we go clean up many more. Change-Id: I1c91f7975c09dae9908591eeeac0d55e5355b2d4 Reviewed-on: http://gerrit.cloudera.org:8080/12400 Reviewed-by: Alex Rodoni Tested-by: Impala Public Jenkins > Update Impala docs to reflect coordinator/executor separation and decoupling > from DataNodes. > > > Key: IMPALA-7214 > URL: https://issues.apache.org/jira/browse/IMPALA-7214 > Project: IMPALA > Issue Type: Bug > Components: Docs >Affects Versions: Impala 2.12.0 >Reporter: Tim Armstrong >Assignee: Alex Rodoni >Priority: Major > > The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I > think this stems from the original deployment practice of always colocating > Impala daemons with HDFS datanodes so that HDFS data could always be read > from a local DataNode. > I'm a bit pedantic so the conflation feels wrong to me regardless, but I > think this will become increasingly confusing as alternative deployments > without colocated HDFS DataNodes become more common (e.g. running against S3, > running with a separate HDFS service). > E.g. picking an example at random: > {noformat} > In Impala 1.4.0 and higher, the LIMIT clause is now > optional (rather than required) for > queries that use the ORDER BY clause. Impala > automatically uses a temporary disk work area > to perform the sort if the sort operation would otherwise exceed the > Impala memory limit for a particular > DataNode. > {noformat} > This is wrong because the memory limit is for an Impala daemon, which is the > process that does the actual sorting. So here I think it should be "Impala > daemon" instead of "DataNode". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.
[ https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763216#comment-16763216 ] Alex Rodoni commented on IMPALA-7214: - https://gerrit.cloudera.org/#/c/12400/ > Update Impala docs to reflect coordinator/executor separation and decoupling > from DataNodes. > > > Key: IMPALA-7214 > URL: https://issues.apache.org/jira/browse/IMPALA-7214 > Project: IMPALA > Issue Type: Bug > Components: Docs >Affects Versions: Impala 2.12.0 >Reporter: Tim Armstrong >Assignee: Alex Rodoni >Priority: Major > > The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I > think this stems from the original deployment practice of always colocating > Impala daemons with HDFS datanodes so that HDFS data could always be read > from a local DataNode. > I'm a bit pedantic so the conflation feels wrong to me regardless, but I > think this will become increasingly confusing as alternative deployments > without colocated HDFS DataNodes become more common (e.g. running against S3, > running with a separate HDFS service). > E.g. picking an example at random: > {noformat} > In Impala 1.4.0 and higher, the LIMIT clause is now > optional (rather than required) for > queries that use the ORDER BY clause. Impala > automatically uses a temporary disk work area > to perform the sort if the sort operation would otherwise exceed the > Impala memory limit for a particular > DataNode. > {noformat} > This is wrong because the memory limit is for an Impala daemon, which is the > process that does the actual sorting. So here I think it should be "Impala > daemon" instead of "DataNode". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7214) Update Impala docs to reflect coordinator/executor separation and decoupling from DataNodes.
[ https://issues.apache.org/jira/browse/IMPALA-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729187#comment-16729187 ] Tim Armstrong commented on IMPALA-7214: --- I started to have a look at this, and I think it's not just a terminology issue, there's a pervasive assumption in a lot of the docs of a particular topology - having a coordinator+executor colocated with each HDFS DataNode. E.g. an example is this sentence: "You can submit a query to the Impala daemon running on any DataNode", which is outdated both because of the coordinator/executor separate and because of decouple compute clusters. So I think we need to first update docs to reflect the coordinator/executor separation. Maybe they need to be defined in topics/impala_components.xml. I think that introductory material also needs to introduce the traditional colocated HDFS storage/compute model, where each executor runs on the same host as a DataNode, but also mention that other deployment models are possible (like a compute cluster that reads remotely from HDFS, S3, ADLS, etc). Then there are a bunch of instances of the terminology that I think are really referring to an Impala daemon, or are unnecessary. e.g. * "For information about establishing a connection to a DataNode running the impalad daemon.." should probably read "For information about establishing a connection to a coordinator Impala Daemon..." * "Where practical, co-locate the tablet servers on the same hosts as the DataNodes, although that is not required."-> "Where practical, co-locate the tablet servers on the same hosts as the Impala Daemons, although that is not required." * "If necessary, specify one of the following configuration options when starting the impalad daemon on each DataNode:" -> "If necessary, specify one of the following configuration options when starting the impalad daemon:" * In impala_order_by.xml, I think each instance of DataNode actually should be "Impala Daemon" or, probably more precisely, "Executor Impala Daemon" * "For some operations, such as joins and combining intermediate results into a final result set, data is transmitted across the network from one DataNode to another." -> this should be "Impala Daemon" There are a bunch of instances of "DataNode" that are referring to the HDFS process or service, or the host that is running the HDFS process, which should be ok to leave for now (there are some that are a little unclear). I found one instance were it was referring to the host running the Impala Daemon: "Increase the overall memory capacity of each DataNode at the hardware level." I'd rephrase that as "Add more memory to the hosts running Impala Daemons". [~arodoni_cloudera] Maybe we can fix up the instances where it's definitely referring to the Impala daemon, then do another pass to see how it looks? [~philip][~joemcdonnell] FYI, just an example of some of the things we might need to update in docs for remote reads. > Update Impala docs to reflect coordinator/executor separation and decoupling > from DataNodes. > > > Key: IMPALA-7214 > URL: https://issues.apache.org/jira/browse/IMPALA-7214 > Project: IMPALA > Issue Type: Bug > Components: Docs >Affects Versions: Impala 2.12.0 >Reporter: Tim Armstrong >Assignee: Tim Armstrong >Priority: Major > > The docs tend to conflate DataNodes (a HDFS service) and Impala daemons. I > think this stems from the original deployment practice of always colocating > Impala daemons with HDFS datanodes so that HDFS data could always be read > from a local DataNode. > I'm a bit pedantic so the conflation feels wrong to me regardless, but I > think this will become increasingly confusing as alternative deployments > without colocated HDFS DataNodes become more common (e.g. running against S3, > running with a separate HDFS service). > E.g. picking an example at random: > {noformat} > In Impala 1.4.0 and higher, the LIMIT clause is now > optional (rather than required) for > queries that use the ORDER BY clause. Impala > automatically uses a temporary disk work area > to perform the sort if the sort operation would otherwise exceed the > Impala memory limit for a particular > DataNode. > {noformat} > This is wrong because the memory limit is for an Impala daemon, which is the > process that does the actual sorting. So here I think it should be "Impala > daemon" instead of "DataNode". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org