[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-22814:
-
Docs Text:   (was: https://github.com/apache/spark/pull/1)

https://github.com/apache/spark/pull/1

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-22814:
-
Comment: was deleted

(was: https://github.com/apache/spark/pull/1)

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-22814:
-
 Docs Text: https://github.com/apache/spark/pull/1
External issue URL:   (was: https://github.com/apache/spark/pull/1)

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-22814:
-
External issue URL: https://github.com/apache/spark/pull/1

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293680#comment-16293680
 ] 

Yuechen Chen commented on SPARK-22814:
--

https://github.com/apache/spark/pull/1


> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2017-12-15 Thread Yuechen Chen (JIRA)
Yuechen Chen created SPARK-22814:


 Summary: JDBC support date/timestamp type as partitionColumn
 Key: SPARK-22814
 URL: https://issues.apache.org/jira/browse/SPARK-22814
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.2.1, 1.6.2
Reporter: Yuechen Chen


In spark, you can partition MySQL queries by partitionColumn.
val df = (spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=10L,
numPartitions=100,
connectionProperties=connectionProperties))
display(df)

But, partitionColumn must be a numeric column from the table.
However, there are lots of table, which has no primary key, and has some 
date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-11 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007561#comment-16007561
 ] 

Yuechen Chen commented on SPARK-20608:
--

I try this solution, but meet with some problems.
I configging dfs.nameservices in hdfs-site.xml in my test machine, and hadoop 
client works: hdfs dfs -ls hdfs://mycluster/path
But by spark-submit, it failed by following exception.
17/05/12 10:33:57 INFO Client: Submitting application 
application_1487208985618_23772 to ResourceManager
17/05/12 10:33:59 INFO Client: Application report for 
application_1487208985618_23772 (state: FAILED)
17/05/12 10:33:59 INFO Client: 
 client token: N/A
 diagnostics: Unable to map logical nameservice URI 'hdfs://mycluster' 
to a NameNode. Local configuration does not have a failover proxy provider 
configured.
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
Should make the same nameservices configged in YARN, which means the remote 
nameservice should also config in resource manager in YARN?
I'm not so clearly about that.
Since putting the namespace address is the only recommended solution to support 
HDFS, may someone solve this problem(if it's a bug) or give some examples in 
SPARK wiki?

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-10 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004778#comment-16004778
 ] 

Yuechen Chen commented on SPARK-20608:
--

I know what you mean and that's exactly right. 
But since Spark provide "yarn.spark.access.namenodes" config, Spark may 
recommend two ways to support saving data to remote HDFS:
1) as you said, by config remote namespace mapping in hdfs-site.xml, and just 
submit it to Spark without any SparkConf.(may be partly recommended for HA)
2) by config yarn.spark.access.namenodes=remotehdfs.(may support HA not well)
For the second way,  if standby namenodes is allowed to be include in 
yarn.spark.access.namenodes, this is easier way to make HA, even though Spark 
App may still failed if namenode failover during the job of saving to remote 
HDFS.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-09 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002123#comment-16002123
 ] 

Yuechen Chen commented on SPARK-20608:
--

All said, I think it is unreasonable that spark application fails when one of 
yarn.spark.access.namenodes cannot be accessed.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-09 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002119#comment-16002119
 ] 

Yuechen Chen commented on SPARK-20608:
--

Thanks [~liuml07], [~vanzin]. My colleague in charge of Hadoop, didn't 
recommend me to config remote namenode mapping to support HDFS HA. But I will 
try to evaluate this solution soon.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-06 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999415#comment-15999415
 ] 

Yuechen Chen commented on SPARK-20608:
--

[~vanzin] By my knowledge, Spark will gain token in two scenario:
1) When submitted to YARN, Spark client will gain token for every 
spark.yarn.access.namenodes
2) When Spark Application is running on YARN, Spark will renew tokens at 
regular intervals.
So for the problem that if standby namenode actually becomes active, Spark will 
haven't right token may not be a problem. (I dont know the frequency of renew..)
I have to say, even if configed as active and standby addresses, the active 
namenode make failover when one spark job has already been writing data into 
remote HDFS, things will fail eventually.
But by my config, user does not need to care which namenode is active when they 
submit spark. Without my config, user should hard code 
spark.yarn.access.namenodes=hdfs://activeNamenode, which I think is not 
graceful.

About "Doesn't it work if you add the namespace (not the NN addresses) in the 
config instead?"
Can you give some actual examples?


> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998197#comment-15998197
 ] 

Yuechen Chen commented on SPARK-20608:
--

[~ste...@apache.org] Your worry is reasonable. In our tests, there are two 
possible exceptions when 
yarn.spark.access.namenodes=hdfs://activeNamenode,hdfs://standbyNamenode
1) Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category WRITE is not supported in state standby
2) Caused by: org.apache.hadoop.ipc.StandbyException: Operation category WRITE 
is not supported in state standby
Maybe RemoteException should be caught by better way.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-20608:
-
Description: 
If one Spark Application need to access remote namenodes, 
yarn.spark.access.namenodes should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if yarn.spark.access.namenodes includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)


  was:
If one Spark Application need to access remote namenodes, 
{yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if {yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
{yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-20608:
-
Description: 
If one Spark Application need to access remote namenodes, 
{yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if {yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
{yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)


  was:
If one Spark Application need to access remote namenodes, 
${yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if ${yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> {yarn.spark.access.namenodes} should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if {yarn.spark.access.namenodes} includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> {yarn.spark.access.namenodes}, and my Spark Application can be able to 
> sustain the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Yuechen Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuechen Chen updated SPARK-20608:
-
Description: 
If one Spark Application need to access remote namenodes, 
${yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if ${yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
Spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application Codes:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)


  was:
If one Spark Application need to access remote namenodes, 
${yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if ${yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> ${yarn.spark.access.namenodes} should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if ${yarn.spark.access.namenodes} includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> ${yarn.spark.access.namenodes}, and my Spark Application can be able to 
> sustain the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-05 Thread Yuechen Chen (JIRA)
Yuechen Chen created SPARK-20608:


 Summary: Standby namenodes should be allowed to included in 
yarn.spark.access.namenodes to support HDFS HA
 Key: SPARK-20608
 URL: https://issues.apache.org/jira/browse/SPARK-20608
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit, YARN
Affects Versions: 2.1.0, 2.0.1
Reporter: Yuechen Chen


If one Spark Application need to access remote namenodes, 
${yarn.spark.access.namenodes} should be only be configged in spark-submit 
scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
If one hadoop cluster is configured by HA, there would be one active namenode 
and at least one standby namenode. 
However, if ${yarn.spark.access.namenodes} includes both active and standby 
namenodes, Spark Application will be failed for the reason that the standby 
namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
I think it won't cause any bad effect to config standby namenodes in 
${yarn.spark.access.namenodes}, and my Spark Application can be able to sustain 
the failover of Hadoop namenode.

HA Examples:
spark-submit script: 
yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
Spark Application:
dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19894) Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack

2017-03-09 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904621#comment-15904621
 ] 

Yuechen Chen commented on SPARK-19894:
--

https://github.com/apache/spark/pull/17238

> Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack
> -
>
> Key: SPARK-19894
> URL: https://issues.apache.org/jira/browse/SPARK-19894
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 2.1.0
> Environment: Yarn-cluster
>Reporter: Yuechen Chen
>
> In YARN-cluster mode, if driver has no rack information on two different 
> hosts, these two hosts would both be recoginized as "/default-rack", which 
> may cause some bugs.
> For example, if hosts of one executor and one external datasource are unknown 
> by driver, this two hosts would be recoginized as the same rack 
> "/default-rack", and then all tasks would be assigned to the executor.
> This bug would be avoided, if getRackForHost("unknown host") in YarnScheduler 
> returns None, not Some("/default-rack").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19894) Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack

2017-03-09 Thread Yuechen Chen (JIRA)
Yuechen Chen created SPARK-19894:


 Summary: Tasks entirely assigned to one executor on Yarn-cluster 
mode for default-rack
 Key: SPARK-19894
 URL: https://issues.apache.org/jira/browse/SPARK-19894
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 2.1.0
 Environment: Yarn-cluster
Reporter: Yuechen Chen


In YARN-cluster mode, if driver has no rack information on two different hosts, 
these two hosts would both be recoginized as "/default-rack", which may cause 
some bugs.
For example, if hosts of one executor and one external datasource are unknown 
by driver, this two hosts would be recoginized as the same rack 
"/default-rack", and then all tasks would be assigned to the executor.
This bug would be avoided, if getRackForHost("unknown host") in YarnScheduler 
returns None, not Some("/default-rack").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org