[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn
[ https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-22814: - Docs Text: (was: https://github.com/apache/spark/pull/1) https://github.com/apache/spark/pull/1 > JDBC support date/timestamp type as partitionColumn > --- > > Key: SPARK-22814 > URL: https://issues.apache.org/jira/browse/SPARK-22814 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 1.6.2, 2.2.1 >Reporter: Yuechen Chen > Original Estimate: 168h > Remaining Estimate: 168h > > In spark, you can partition MySQL queries by partitionColumn. > val df = (spark.read.jdbc(url=jdbcUrl, > table="employees", > columnName="emp_no", > lowerBound=1L, > upperBound=10L, > numPartitions=100, > connectionProperties=connectionProperties)) > display(df) > But, partitionColumn must be a numeric column from the table. > However, there are lots of table, which has no primary key, and has some > date/timestamp indexes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-22814) JDBC support date/timestamp type as partitionColumn
[ https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-22814: - Comment: was deleted (was: https://github.com/apache/spark/pull/1) > JDBC support date/timestamp type as partitionColumn > --- > > Key: SPARK-22814 > URL: https://issues.apache.org/jira/browse/SPARK-22814 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 1.6.2, 2.2.1 >Reporter: Yuechen Chen > Original Estimate: 168h > Remaining Estimate: 168h > > In spark, you can partition MySQL queries by partitionColumn. > val df = (spark.read.jdbc(url=jdbcUrl, > table="employees", > columnName="emp_no", > lowerBound=1L, > upperBound=10L, > numPartitions=100, > connectionProperties=connectionProperties)) > display(df) > But, partitionColumn must be a numeric column from the table. > However, there are lots of table, which has no primary key, and has some > date/timestamp indexes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn
[ https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-22814: - Docs Text: https://github.com/apache/spark/pull/1 External issue URL: (was: https://github.com/apache/spark/pull/1) > JDBC support date/timestamp type as partitionColumn > --- > > Key: SPARK-22814 > URL: https://issues.apache.org/jira/browse/SPARK-22814 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 1.6.2, 2.2.1 >Reporter: Yuechen Chen > Original Estimate: 168h > Remaining Estimate: 168h > > In spark, you can partition MySQL queries by partitionColumn. > val df = (spark.read.jdbc(url=jdbcUrl, > table="employees", > columnName="emp_no", > lowerBound=1L, > upperBound=10L, > numPartitions=100, > connectionProperties=connectionProperties)) > display(df) > But, partitionColumn must be a numeric column from the table. > However, there are lots of table, which has no primary key, and has some > date/timestamp indexes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn
[ https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-22814: - External issue URL: https://github.com/apache/spark/pull/1 > JDBC support date/timestamp type as partitionColumn > --- > > Key: SPARK-22814 > URL: https://issues.apache.org/jira/browse/SPARK-22814 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 1.6.2, 2.2.1 >Reporter: Yuechen Chen > Original Estimate: 168h > Remaining Estimate: 168h > > In spark, you can partition MySQL queries by partitionColumn. > val df = (spark.read.jdbc(url=jdbcUrl, > table="employees", > columnName="emp_no", > lowerBound=1L, > upperBound=10L, > numPartitions=100, > connectionProperties=connectionProperties)) > display(df) > But, partitionColumn must be a numeric column from the table. > However, there are lots of table, which has no primary key, and has some > date/timestamp indexes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22814) JDBC support date/timestamp type as partitionColumn
[ https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293680#comment-16293680 ] Yuechen Chen commented on SPARK-22814: -- https://github.com/apache/spark/pull/1 > JDBC support date/timestamp type as partitionColumn > --- > > Key: SPARK-22814 > URL: https://issues.apache.org/jira/browse/SPARK-22814 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 1.6.2, 2.2.1 >Reporter: Yuechen Chen > Original Estimate: 168h > Remaining Estimate: 168h > > In spark, you can partition MySQL queries by partitionColumn. > val df = (spark.read.jdbc(url=jdbcUrl, > table="employees", > columnName="emp_no", > lowerBound=1L, > upperBound=10L, > numPartitions=100, > connectionProperties=connectionProperties)) > display(df) > But, partitionColumn must be a numeric column from the table. > However, there are lots of table, which has no primary key, and has some > date/timestamp indexes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22814) JDBC support date/timestamp type as partitionColumn
Yuechen Chen created SPARK-22814: Summary: JDBC support date/timestamp type as partitionColumn Key: SPARK-22814 URL: https://issues.apache.org/jira/browse/SPARK-22814 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.2.1, 1.6.2 Reporter: Yuechen Chen In spark, you can partition MySQL queries by partitionColumn. val df = (spark.read.jdbc(url=jdbcUrl, table="employees", columnName="emp_no", lowerBound=1L, upperBound=10L, numPartitions=100, connectionProperties=connectionProperties)) display(df) But, partitionColumn must be a numeric column from the table. However, there are lots of table, which has no primary key, and has some date/timestamp indexes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007561#comment-16007561 ] Yuechen Chen commented on SPARK-20608: -- I try this solution, but meet with some problems. I configging dfs.nameservices in hdfs-site.xml in my test machine, and hadoop client works: hdfs dfs -ls hdfs://mycluster/path But by spark-submit, it failed by following exception. 17/05/12 10:33:57 INFO Client: Submitting application application_1487208985618_23772 to ResourceManager 17/05/12 10:33:59 INFO Client: Application report for application_1487208985618_23772 (state: FAILED) 17/05/12 10:33:59 INFO Client: client token: N/A diagnostics: Unable to map logical nameservice URI 'hdfs://mycluster' to a NameNode. Local configuration does not have a failover proxy provider configured. ApplicationMaster host: N/A ApplicationMaster RPC port: -1 Should make the same nameservices configged in YARN, which means the remote nameservice should also config in resource manager in YARN? I'm not so clearly about that. Since putting the namespace address is the only recommended solution to support HDFS, may someone solve this problem(if it's a bug) or give some examples in SPARK wiki? > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004778#comment-16004778 ] Yuechen Chen commented on SPARK-20608: -- I know what you mean and that's exactly right. But since Spark provide "yarn.spark.access.namenodes" config, Spark may recommend two ways to support saving data to remote HDFS: 1) as you said, by config remote namespace mapping in hdfs-site.xml, and just submit it to Spark without any SparkConf.(may be partly recommended for HA) 2) by config yarn.spark.access.namenodes=remotehdfs.(may support HA not well) For the second way, if standby namenodes is allowed to be include in yarn.spark.access.namenodes, this is easier way to make HA, even though Spark App may still failed if namenode failover during the job of saving to remote HDFS. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002123#comment-16002123 ] Yuechen Chen commented on SPARK-20608: -- All said, I think it is unreasonable that spark application fails when one of yarn.spark.access.namenodes cannot be accessed. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002119#comment-16002119 ] Yuechen Chen commented on SPARK-20608: -- Thanks [~liuml07], [~vanzin]. My colleague in charge of Hadoop, didn't recommend me to config remote namenode mapping to support HDFS HA. But I will try to evaluate this solution soon. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999415#comment-15999415 ] Yuechen Chen commented on SPARK-20608: -- [~vanzin] By my knowledge, Spark will gain token in two scenario: 1) When submitted to YARN, Spark client will gain token for every spark.yarn.access.namenodes 2) When Spark Application is running on YARN, Spark will renew tokens at regular intervals. So for the problem that if standby namenode actually becomes active, Spark will haven't right token may not be a problem. (I dont know the frequency of renew..) I have to say, even if configed as active and standby addresses, the active namenode make failover when one spark job has already been writing data into remote HDFS, things will fail eventually. But by my config, user does not need to care which namenode is active when they submit spark. Without my config, user should hard code spark.yarn.access.namenodes=hdfs://activeNamenode, which I think is not graceful. About "Doesn't it work if you add the namespace (not the NN addresses) in the config instead?" Can you give some actual examples? > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998197#comment-15998197 ] Yuechen Chen commented on SPARK-20608: -- [~ste...@apache.org] Your worry is reasonable. In our tests, there are two possible exceptions when yarn.spark.access.namenodes=hdfs://activeNamenode,hdfs://standbyNamenode 1) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby 2) Caused by: org.apache.hadoop.ipc.StandbyException: Operation category WRITE is not supported in state standby Maybe RemoteException should be caught by better way. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-20608: - Description: If one Spark Application need to access remote namenodes, yarn.spark.access.namenodes should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if yarn.spark.access.namenodes includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in yarn.spark.access.namenodes, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) was: If one Spark Application need to access remote namenodes, {yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if {yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in {yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-20608: - Description: If one Spark Application need to access remote namenodes, {yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if {yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in {yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) was: If one Spark Application need to access remote namenodes, ${yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if ${yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in ${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > {yarn.spark.access.namenodes} should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if {yarn.spark.access.namenodes} includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > {yarn.spark.access.namenodes}, and my Spark Application can be able to > sustain the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuechen Chen updated SPARK-20608: - Description: If one Spark Application need to access remote namenodes, ${yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if ${yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in ${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: Spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application Codes: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) was: If one Spark Application need to access remote namenodes, ${yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if ${yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in ${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > ${yarn.spark.access.namenodes} should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if ${yarn.spark.access.namenodes} includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > ${yarn.spark.access.namenodes}, and my Spark Application can be able to > sustain the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
Yuechen Chen created SPARK-20608: Summary: Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA Key: SPARK-20608 URL: https://issues.apache.org/jira/browse/SPARK-20608 Project: Spark Issue Type: Bug Components: Spark Submit, YARN Affects Versions: 2.1.0, 2.0.1 Reporter: Yuechen Chen If one Spark Application need to access remote namenodes, ${yarn.spark.access.namenodes} should be only be configged in spark-submit scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. If one hadoop cluster is configured by HA, there would be one active namenode and at least one standby namenode. However, if ${yarn.spark.access.namenodes} includes both active and standby namenodes, Spark Application will be failed for the reason that the standby namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. I think it won't cause any bad effect to config standby namenodes in ${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain the failover of Hadoop namenode. HA Examples: spark-submit script: yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 Spark Application: dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19894) Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack
[ https://issues.apache.org/jira/browse/SPARK-19894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904621#comment-15904621 ] Yuechen Chen commented on SPARK-19894: -- https://github.com/apache/spark/pull/17238 > Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack > - > > Key: SPARK-19894 > URL: https://issues.apache.org/jira/browse/SPARK-19894 > Project: Spark > Issue Type: Bug > Components: Scheduler, YARN >Affects Versions: 2.1.0 > Environment: Yarn-cluster >Reporter: Yuechen Chen > > In YARN-cluster mode, if driver has no rack information on two different > hosts, these two hosts would both be recoginized as "/default-rack", which > may cause some bugs. > For example, if hosts of one executor and one external datasource are unknown > by driver, this two hosts would be recoginized as the same rack > "/default-rack", and then all tasks would be assigned to the executor. > This bug would be avoided, if getRackForHost("unknown host") in YarnScheduler > returns None, not Some("/default-rack"). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19894) Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack
Yuechen Chen created SPARK-19894: Summary: Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack Key: SPARK-19894 URL: https://issues.apache.org/jira/browse/SPARK-19894 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 2.1.0 Environment: Yarn-cluster Reporter: Yuechen Chen In YARN-cluster mode, if driver has no rack information on two different hosts, these two hosts would both be recoginized as "/default-rack", which may cause some bugs. For example, if hosts of one executor and one external datasource are unknown by driver, this two hosts would be recoginized as the same rack "/default-rack", and then all tasks would be assigned to the executor. This bug would be avoided, if getRackForHost("unknown host") in YarnScheduler returns None, not Some("/default-rack"). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org