[jira] [Commented] (SPARK-23644) SHS with proxy doesn't show applications

2018-03-16 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401538#comment-16401538
 ] 

Saisai Shao commented on SPARK-23644:
-

Fixed in https://github.com/apache/spark/pull/20794.

> SHS with proxy doesn't show applications
> 
>
> Key: SPARK-23644
> URL: https://issues.apache.org/jira/browse/SPARK-23644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> The History server supports being consumed via a proxy using the 
> {{spark.ui.proxyBase}} property. Despite it works fine if you access the 
> proxy using a link which ends with "/", it doesn't show any application if 
> the URL accessed doesn't end with "/", eg. if you access SHS using 
> {{https://yourproxy.whatever:1234/path/to/historyserver/}} it works fine, but 
> it you access it using 
> {{https://yourproxy.whatever:1234/path/to/historyserver}} no application is 
> shown.
> The cause of this is that the call to the REST API to get the list of the 
> application is a relative path call. So in the second case, instead of 
> performing a GET to 
> {{https://yourproxy.whatever:1234/path/to/historyserver/api/v1/applications}},
>  it performs a call to 
> {{https://yourproxy.whatever:1234/path/to/api/v1/applications}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23644) SHS with proxy doesn't show applications

2018-03-16 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-23644.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0

> SHS with proxy doesn't show applications
> 
>
> Key: SPARK-23644
> URL: https://issues.apache.org/jira/browse/SPARK-23644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> The History server supports being consumed via a proxy using the 
> {{spark.ui.proxyBase}} property. Despite it works fine if you access the 
> proxy using a link which ends with "/", it doesn't show any application if 
> the URL accessed doesn't end with "/", eg. if you access SHS using 
> {{https://yourproxy.whatever:1234/path/to/historyserver/}} it works fine, but 
> it you access it using 
> {{https://yourproxy.whatever:1234/path/to/historyserver}} no application is 
> shown.
> The cause of this is that the call to the REST API to get the list of the 
> application is a relative path call. So in the second case, instead of 
> performing a GET to 
> {{https://yourproxy.whatever:1234/path/to/historyserver/api/v1/applications}},
>  it performs a call to 
> {{https://yourproxy.whatever:1234/path/to/api/v1/applications}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401525#comment-16401525
 ] 

Felix Cheung edited comment on SPARK-23650 at 3/16/18 7:16 AM:
---

do you mean this?

 

RRunner: Times: boot = 0.010 s, init = 0.005 s, broadcast = 1.894 s, read-input 
= 0.001 s, compute = 0.062 s, write-output = 0.002 s, total = 1.974 s

 

Under the cover it is working with the same R process.

I see

SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

each time it is creating a new broadcast which would then needs to be 
transferred.

IMO there are a few things to look into:
 # it should detect if the broadcast is the same (not sure if it does that)
 # if it is attributed to the same broadcast in use.daemon mode then it perhaps 
doesn't have to transfer it again (but it would need to keep track of the stage 
executed before and broadcast that was sent before etc)
 # data transfer can be faster (SPARK-18924)

However, as of now RRunner simply picks up the broadcast that is pass to it and 
sends it.

 


was (Author: felixcheung):
do you mean this?

 

RRunner: Times: boot = 0.010 s, init = 0.005 s, broadcast = 1.894 s, read-input 
= 0.001 s, compute = 0.062 s, write-output = 0.002 s, total = 1.974 s

 

Under the cover it is working with the same R process.

I see

SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

each time it is creating a new broadcast which would then needs to be 
transferred.

IMO there are a few things to look into:
 # it should detect if the broadcast is the same (not sure if it does that)
 # if it is attributed to the same broadcast in use.daemon mode then it perhaps 
doesn't have to transfer it again
 # data transfer can be faster (SPARK-18924)

However, as of now RRunner simply picks up the broadcast that is pass to it and 
sends it.

 

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401525#comment-16401525
 ] 

Felix Cheung commented on SPARK-23650:
--

do you mean this?

 

RRunner: Times: boot = 0.010 s, init = 0.005 s, broadcast = 1.894 s, read-input 
= 0.001 s, compute = 0.062 s, write-output = 0.002 s, total = 1.974 s

 

Under the cover it is working with the same R process.

I see

SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

each time it is creating a new broadcast which would then needs to be 
transferred.

IMO there are a few things to look into:
 # it should detect if the broadcast is the same (not sure if it does that)
 # if it is attributed to the same broadcast in use.daemon mode then it perhaps 
doesn't have to transfer it again
 # data transfer can be faster (SPARK-18924)

However, as of now RRunner simply picks up the broadcast that is pass to it and 
sends it.

 

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23694) The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory

2018-03-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401490#comment-16401490
 ] 

Apache Spark commented on SPARK-23694:
--

User 'dyf6372' has created a pull request for this issue:
https://github.com/apache/spark/pull/20843

> The staging directory should under hive.exec.stagingdir if we set 
> hive.exec.stagingdir but not under the table directory 
> -
>
> Key: SPARK-23694
> URL: https://issues.apache.org/jira/browse/SPARK-23694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifeng Dong
>Priority: Major
>
> When we set hive.exec.stagingdir but not under the table directory, for 
> example: /tmp/hive-staging, I think the staging directory should under 
> /tmp/hive-staging, not under /tmp/ like /tmp/hive-staging_xxx



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23694) The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory

2018-03-16 Thread Yifeng Dong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifeng Dong updated SPARK-23694:

External issue URL: https://github.com/apache/spark/pull/20843

> The staging directory should under hive.exec.stagingdir if we set 
> hive.exec.stagingdir but not under the table directory 
> -
>
> Key: SPARK-23694
> URL: https://issues.apache.org/jira/browse/SPARK-23694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifeng Dong
>Priority: Major
>
> When we set hive.exec.stagingdir but not under the table directory, for 
> example: /tmp/hive-staging, I think the staging directory should under 
> /tmp/hive-staging, not under /tmp/ like /tmp/hive-staging_xxx



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23694) The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory

2018-03-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23694:


Assignee: (was: Apache Spark)

> The staging directory should under hive.exec.stagingdir if we set 
> hive.exec.stagingdir but not under the table directory 
> -
>
> Key: SPARK-23694
> URL: https://issues.apache.org/jira/browse/SPARK-23694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifeng Dong
>Priority: Major
>
> When we set hive.exec.stagingdir but not under the table directory, for 
> example: /tmp/hive-staging, I think the staging directory should under 
> /tmp/hive-staging, not under /tmp/ like /tmp/hive-staging_xxx



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23694) The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory

2018-03-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23694:


Assignee: Apache Spark

> The staging directory should under hive.exec.stagingdir if we set 
> hive.exec.stagingdir but not under the table directory 
> -
>
> Key: SPARK-23694
> URL: https://issues.apache.org/jira/browse/SPARK-23694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifeng Dong
>Assignee: Apache Spark
>Priority: Major
>
> When we set hive.exec.stagingdir but not under the table directory, for 
> example: /tmp/hive-staging, I think the staging directory should under 
> /tmp/hive-staging, not under /tmp/ like /tmp/hive-staging_xxx



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2