[jira] [Comment Edited] (SPARK-10912) Improve Spark metrics executor.filesystem

2016-11-05 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638836#comment-15638836
 ] 

Yongjia Wang edited comment on SPARK-10912 at 11/5/16 7:09 AM:
---

s3a and hdfs are different "schemes" in Spark's FileSystem.Statistics
I think it is Spark's responsibility to choose what to report, and currently 
only "hdfs" and "file" are reported.
I have been using the attached s3a_metrics.patch to build Spark in order to get 
the s3a metrics reported. I'm not sure whether there is a way to report s3a 
metrics just through configuration (without changing Spark source like what was 
did in the attached patch file).
Now I need to add GoogleHadoopFileSystem's "gs" metrics, please advise the best 
approach.
Thank you.
[~srowen]


was (Author: yongjiaw):
s3a and hdfs are different "schemes" in Spark's FileSystem.Statistics
I think it is Spark's responsibility to choose what to report, and currently 
only "hdfs" and "file" are reported.
I have been using the attached s3a_metrics.patch to build Spark in order to get 
the s3a metrics reported. I'm not sure whether there is a way to report s3a 
metrics just through configuration (without changing Spark source like what was 
did in the attached patch file).
Now I need to add GoogleHadoopFileSystem's "gs" metrics, please advise the best 
approach.
Thank you.

> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>Priority: Minor
> Attachments: s3a_metrics.patch
>
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10912) Improve Spark metrics executor.filesystem

2016-11-05 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang reopened SPARK-10912:
--

> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>Priority: Minor
> Attachments: s3a_metrics.patch
>
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10912) Improve Spark metrics executor.filesystem

2016-11-05 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638836#comment-15638836
 ] 

Yongjia Wang commented on SPARK-10912:
--

s3a and hdfs are different "schemes" in Spark's FileSystem.Statistics
I think it is Spark's responsibility to choose what to report, and currently 
only "hdfs" and "file" are reported.
I have been using the attached s3a_metrics.patch to build Spark in order to get 
the s3a metrics reported. I'm not sure whether there is a way to report s3a 
metrics just through configuration (without changing Spark source like what was 
did in the attached patch file).
Now I need to add GoogleHadoopFileSystem's "gs" metrics, please advise the best 
approach.
Thank you.

> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>Priority: Minor
> Attachments: s3a_metrics.patch
>
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog

2016-08-16 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422887#comment-15422887
 ] 

Yongjia Wang commented on SPARK-16484:
--

Here is my solution using Spark UDAF and UDT
https://github.com/yongjiaw/Spark_HLL

> Incremental Cardinality estimation operations with Hyperloglog
> --
>
> Key: SPARK-16484
> URL: https://issues.apache.org/jira/browse/SPARK-16484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yongjia Wang
>
> Efficient cardinality estimation is very important, and SparkSQL has had 
> approxCountDistinct based on Hyperloglog for quite some time. However, there 
> isn't a way to do incremental estimation. For example, if we want to get 
> updated distinct counts of the last 90 days, we need to do the aggregation 
> for the entire window over and over again. The more efficient way involves 
> serializing the counter for smaller time windows (such as hourly) so the 
> counts can be efficiently updated in an incremental fashion for any time 
> window.
> With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus 
> implementation in the current Spark version, it's easy enough to extend the 
> functionality to include incremental counting, and even other general set 
> operations such as intersection and set difference. Spark API is already as 
> elegant as it can be, but it still takes quite some effort to do a custom 
> implementation of the aforementioned operations which are supposed to be in 
> high demand. I have been searching but failed to find an usable existing 
> solution nor any ongoing effort for this. The closest I got is the following 
> but it does not work with Spark 1.6 due to API changes. 
> https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala
> I wonder if it worth to integrate such operations into SparkSQL. The only 
> problem I see is it depends on serialization of a specific HLL implementation 
> and introduce compatibility issues. But as long as the user is aware of such 
> issue, it should be fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog

2016-07-11 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371709#comment-15371709
 ] 

Yongjia Wang commented on SPARK-16484:
--

Yes, I agree all the building blocks are there and easy enough to put together 
a solution now. I guess what I did is the second approach you mentioned - 
saving the hll++ "buffer" as a byte array column, with a custom UDAF to merge 
them using SQL expression. 
I was trying to say if it worth extending sparksql to include those extra 
UDAFs, making it more accessible for regular spark users. Also doing 
intersection of multiple sets can be tricky, wouldn't it be nice to have it as 
part of sparksql's standard set of functions?

> Incremental Cardinality estimation operations with Hyperloglog
> --
>
> Key: SPARK-16484
> URL: https://issues.apache.org/jira/browse/SPARK-16484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yongjia Wang
>
> Efficient cardinality estimation is very important, and SparkSQL has had 
> approxCountDistinct based on Hyperloglog for quite some time. However, there 
> isn't a way to do incremental estimation. For example, if we want to get 
> updated distinct counts of the last 90 days, we need to do the aggregation 
> for the entire window over and over again. The more efficient way involves 
> serializing the counter for smaller time windows (such as hourly) so the 
> counts can be efficiently updated in an incremental fashion for any time 
> window.
> With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus 
> implementation in the current Spark version, it's easy enough to extend the 
> functionality to include incremental counting, and even other general set 
> operations such as intersection and set difference. Spark API is already as 
> elegant as it can be, but it still takes quite some effort to do a custom 
> implementation of the aforementioned operations which are supposed to be in 
> high demand. I have been searching but failed to find an usable existing 
> solution nor any ongoing effort for this. The closest I got is the following 
> but it does not work with Spark 1.6 due to API changes. 
> https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala
> I wonder if it worth to integrate such operations into SparkSQL. The only 
> problem I see is it depends on serialization of a specific HLL implementation 
> and introduce compatibility issues. But as long as the user is aware of such 
> issue, it should be fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog

2016-07-11 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-16484:


 Summary: Incremental Cardinality estimation operations with 
Hyperloglog
 Key: SPARK-16484
 URL: https://issues.apache.org/jira/browse/SPARK-16484
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yongjia Wang


Efficient cardinality estimation is very important, and SparkSQL has had 
approxCountDistinct based on Hyperloglog for quite some time. However, there 
isn't a way to do incremental estimation. For example, if we want to get 
updated distinct counts of the last 90 days, we need to do the aggregation for 
the entire window over and over again. The more efficient way involves 
serializing the counter for smaller time windows (such as hourly) so the counts 
can be efficiently updated in an incremental fashion for any time window.
With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus 
implementation in the current Spark version, it's easy enough to extend the 
functionality to include incremental counting, and even other general set 
operations such as intersection and set difference. Spark API is already as 
elegant as it can be, but it still takes quite some effort to do a custom 
implementation of the aforementioned operations which are supposed to be in 
high demand. I have been searching but failed to find an usable existing 
solution nor any ongoing effort for this. The closest I got is the following 
but it does not work with Spark 1.6 due to API changes. 
https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala

I wonder if it worth to integrate such operations into SparkSQL. The only 
problem I see is it depends on serialization of a specific HLL implementation 
and introduce compatibility issues. But as long as the user is aware of such 
issue, it should be fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11824) WebUI does not render descriptions with 'bad' HTML, throws console error

2015-11-30 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032602#comment-15032602
 ] 

Yongjia Wang commented on SPARK-11824:
--

Looks this is the right escaper.
StringEscapeUtils.escapeXml
https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html#escapeXml(java.lang.String)

> WebUI does not render descriptions with 'bad' HTML, throws console error
> 
>
> Key: SPARK-11824
> URL: https://issues.apache.org/jira/browse/SPARK-11824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 1.5.2
> Environment: RHEL 6, Java 1.7, Mesos 0.25.0
>Reporter: Andy Robb
>Priority: Minor
>  Labels: starter
>
> When using SparkSQL CLI, running a query with less-than or greater-than 
> symbols in a query, viewing the Web UI will throw the following console 
> warning. (The table and column names have been changed from the actual query.)
> This occurs across CLI invocations. The warning is thrown each time the UI is 
> refreshed, both during query execution and after the query is complete. 
> {noformat}
> 15/11/18 10:45:31 WARN ui.UIUtils: Invalid job description: select count(1) 
> from table1 where date >= '2015-11-01' and date <= '2015-11-15' 
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 114; The content 
> of elements must consist of well-formed character data or markup.
>   at 
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
>   at 
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1436)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.startOfMarkup(XMLDocumentFragmentScannerImpl.java:2636)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2734)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332)
>   at scala.xml.factory.XMLLoader$class.loadXML(XMLLoader.scala:40)
>   at scala.xml.XML$.loadXML(XML.scala:57)
>   at scala.xml.factory.XMLLoader$class.loadString(XMLLoader.scala:59)
>   at scala.xml.XML$.loadString(XML.scala:57)
>   at org.apache.spark.ui.UIUtils$.makeDescription(UIUtils.scala:417)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$5$$anonfun$apply$1.apply(StageTable.scala:118)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$5$$anonfun$apply$1.apply(StageTable.scala:116)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$5.apply(StageTable.scala:116)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$5.apply(StageTable.scala:115)
>   at scala.Option.flatMap(Option.scala:170)
>   at 
> org.apache.spark.ui.jobs.StageTableBase.makeDescription(StageTable.scala:115)
>   at 
> org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:177)
>   at 
> org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:195)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:60)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:60)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:69)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:69)
>   at 

[jira] [Commented] (SPARK-11824) WebUI does not render descriptions with 'bad' HTML, throws console error

2015-11-29 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15031199#comment-15031199
 ] 

Yongjia Wang commented on SPARK-11824:
--

Yes, this is annoying and not just for CLI. 
But you can turn off the annoying message like this
LogManager.getLogger("org.apache.spark.ui").setLevel(Level.ERROR)

> WebUI does not render descriptions with 'bad' HTML, throws console error
> 
>
> Key: SPARK-11824
> URL: https://issues.apache.org/jira/browse/SPARK-11824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 1.5.2
> Environment: RHEL 6, Java 1.7, Mesos 0.25.0
>Reporter: Andy Robb
>Priority: Minor
>
> When using SparkSQL CLI, running a query with less-than or greater-than 
> symbols in a query, viewing the Web UI will throw the following console 
> warning. (The table and column names have been changed from the actual query.)
> This occurs across CLI invocations. The warning is thrown each time the UI is 
> refreshed, both during query execution and after the query is complete. 
> {noformat}
> 15/11/18 10:45:31 WARN ui.UIUtils: Invalid job description: select count(1) 
> from table1 where date >= '2015-11-01' and date <= '2015-11-15' 
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 114; The content 
> of elements must consist of well-formed character data or markup.
>   at 
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
>   at 
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1436)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.startOfMarkup(XMLDocumentFragmentScannerImpl.java:2636)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2734)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332)
>   at scala.xml.factory.XMLLoader$class.loadXML(XMLLoader.scala:40)
>   at scala.xml.XML$.loadXML(XML.scala:57)
>   at scala.xml.factory.XMLLoader$class.loadString(XMLLoader.scala:59)
>   at scala.xml.XML$.loadString(XML.scala:57)
>   at org.apache.spark.ui.UIUtils$.makeDescription(UIUtils.scala:417)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$5$$anonfun$apply$1.apply(StageTable.scala:118)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$5$$anonfun$apply$1.apply(StageTable.scala:116)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$5.apply(StageTable.scala:116)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$5.apply(StageTable.scala:115)
>   at scala.Option.flatMap(Option.scala:170)
>   at 
> org.apache.spark.ui.jobs.StageTableBase.makeDescription(StageTable.scala:115)
>   at 
> org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:177)
>   at 
> org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:195)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:60)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:60)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:69)
>   at 
> org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:69)
>   at scala.collection.immutable.Stream.map(Stream.scala:376)
>   at 
> 

[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version

2015-10-30 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982662#comment-14982662
 ] 

Yongjia Wang commented on SPARK-11413:
--

yea, the fix is to update the joda.version number from 2.5 to 2.8.1 or to the 
newest 2.9 in Spark's pom.xml.
This PR would just be a 1 line change in pom.xml
It's only an issue when compiling with java 1.8u60, due to the changes in time 
format, and s3 request in some way relies on the time format in its header.
Those links in my comment should explain better. It's no new dependency, but 
can be considered a java8 s3 combo bug.
So this only affects the combination of java 8u60 + using s3a, which should 
become quite common and hits more people.
If I understand correctly, building with java7 or java8 before u60 should be 
fine, even running with jre after java 8u60.



> Java 8 build has problem with joda-time and s3 request, should bump joda-time 
> version
> -
>
> Key: SPARK-11413
> URL: https://issues.apache.org/jira/browse/SPARK-11413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Yongjia Wang
>Priority: Minor
>
> Joda-time has problems with formatting time zones starting with Java 1.8u60, 
> and this will cause s3 request to fail. It is said to have been fixed at 
> joda-time 2.8.1.
> Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, 
> should set -Djoda.version=2.8.1 or above.
> I was hit by this problem, and -Djoda.version=2.9 worked.
> I don't see any reason not to bump up joda-time version in pom.xml
> Should I create a pull request for this? It is trivial.
> https://github.com/aws/aws-sdk-java/issues/484 
> https://github.com/aws/aws-sdk-java/issues/444
> http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version

2015-10-30 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982755#comment-14982755
 ] 

Yongjia Wang edited comment on SPARK-11413 at 10/30/15 4:02 PM:


My last statement was wrong.
It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 or below 
was loaded at run time.
The following error was reproduced when I use the official spark release and 
run with JRE 1.8.0_65 (Oracle Corporation)
Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's 
due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that 
caused this problem.

com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires 
a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error 
Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 
8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt
at 
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at 
com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at 
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738)
at 
com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653)


was (Author: yongjiaw):
My last statement was wrong.
It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 or below 
was loaded at run time.
The following error was reproduced when I use the official spark release with 
JRE 1.8.0_65 (Oracle Corporation)
Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's 
due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that 
caused this problem.

com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires 
a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error 
Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 
8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt
at 
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at 
com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at 
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738)
at 
com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653)

> Java 8 build has problem with joda-time and s3 request, should bump joda-time 
> version
> -
>
> Key: SPARK-11413
> URL: https://issues.apache.org/jira/browse/SPARK-11413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Yongjia Wang
>Priority: Minor
>
> Joda-time has problems with formatting time zones starting with Java 1.8u60, 
> and this will cause s3 request to fail. It is said to have been fixed at 
> joda-time 2.8.1.
> Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, 
> should set -Djoda.version=2.8.1 or above.
> I was hit by this problem, and -Djoda.version=2.9 worked.
> I don't see any reason not to bump up joda-time version in pom.xml
> Should I create a pull request for this? It is trivial.
> https://github.com/aws/aws-sdk-java/issues/484 
> https://github.com/aws/aws-sdk-java/issues/444
> http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version

2015-10-30 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982802#comment-14982802
 ] 

Yongjia Wang commented on SPARK-11413:
--

I see.
I don't know, is it safe to assume joda-time is backward compatible as far as 
Spark is concerned?

> Java 8 build has problem with joda-time and s3 request, should bump joda-time 
> version
> -
>
> Key: SPARK-11413
> URL: https://issues.apache.org/jira/browse/SPARK-11413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Yongjia Wang
>Priority: Minor
>
> Joda-time has problems with formatting time zones starting with Java 1.8u60, 
> and this will cause s3 request to fail. It is said to have been fixed at 
> joda-time 2.8.1.
> Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, 
> should set -Djoda.version=2.8.1 or above.
> I was hit by this problem, and -Djoda.version=2.9 worked.
> I don't see any reason not to bump up joda-time version in pom.xml
> Should I create a pull request for this? It is trivial.
> https://github.com/aws/aws-sdk-java/issues/484 
> https://github.com/aws/aws-sdk-java/issues/444
> http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version

2015-10-30 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982755#comment-14982755
 ] 

Yongjia Wang commented on SPARK-11413:
--

My last statement was wrong.
It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 and 
belong was loaded at run time by s3 client.
The following error was reproduced when I use the official spark release with 
JRE 1.8.0_65 (Oracle Corporation)
Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's 
due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that 
caused this problem.

com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires 
a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error 
Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 
8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt
at 
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at 
com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at 
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738)
at 
com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653)

> Java 8 build has problem with joda-time and s3 request, should bump joda-time 
> version
> -
>
> Key: SPARK-11413
> URL: https://issues.apache.org/jira/browse/SPARK-11413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Yongjia Wang
>Priority: Minor
>
> Joda-time has problems with formatting time zones starting with Java 1.8u60, 
> and this will cause s3 request to fail. It is said to have been fixed at 
> joda-time 2.8.1.
> Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, 
> should set -Djoda.version=2.8.1 or above.
> I was hit by this problem, and -Djoda.version=2.9 worked.
> I don't see any reason not to bump up joda-time version in pom.xml
> Should I create a pull request for this? It is trivial.
> https://github.com/aws/aws-sdk-java/issues/484 
> https://github.com/aws/aws-sdk-java/issues/444
> http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version

2015-10-30 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982755#comment-14982755
 ] 

Yongjia Wang edited comment on SPARK-11413 at 10/30/15 4:02 PM:


My last statement was wrong.
It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 or below 
was loaded at run time.
The following error was reproduced when I use the official spark release with 
JRE 1.8.0_65 (Oracle Corporation)
Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's 
due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that 
caused this problem.

com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires 
a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error 
Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 
8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt
at 
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at 
com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at 
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738)
at 
com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653)


was (Author: yongjiaw):
My last statement was wrong.
It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 and 
belong was loaded at run time by s3 client.
The following error was reproduced when I use the official spark release with 
JRE 1.8.0_65 (Oracle Corporation)
Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's 
due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that 
caused this problem.

com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires 
a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error 
Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 
8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt
at 
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at 
com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at 
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738)
at 
com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653)

> Java 8 build has problem with joda-time and s3 request, should bump joda-time 
> version
> -
>
> Key: SPARK-11413
> URL: https://issues.apache.org/jira/browse/SPARK-11413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Yongjia Wang
>Priority: Minor
>
> Joda-time has problems with formatting time zones starting with Java 1.8u60, 
> and this will cause s3 request to fail. It is said to have been fixed at 
> joda-time 2.8.1.
> Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, 
> should set -Djoda.version=2.8.1 or above.
> I was hit by this problem, and -Djoda.version=2.9 worked.
> I don't see any reason not to bump up joda-time version in pom.xml
> Should I create a pull request for this? It is trivial.
> https://github.com/aws/aws-sdk-java/issues/484 
> https://github.com/aws/aws-sdk-java/issues/444
> http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version

2015-10-30 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982874#comment-14982874
 ] 

Yongjia Wang commented on SPARK-11413:
--

I can follow up with a PR first. The latest joda-time release is 2.9. Do you 
suggest just using 2.8.1 as the fix with minimum changes?


> Java 8 build has problem with joda-time and s3 request, should bump joda-time 
> version
> -
>
> Key: SPARK-11413
> URL: https://issues.apache.org/jira/browse/SPARK-11413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Yongjia Wang
>Priority: Minor
>
> Joda-time has problems with formatting time zones starting with Java 1.8u60, 
> and this will cause s3 request to fail. It is said to have been fixed at 
> joda-time 2.8.1.
> Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, 
> should set -Djoda.version=2.8.1 or above.
> I was hit by this problem, and -Djoda.version=2.9 worked.
> I don't see any reason not to bump up joda-time version in pom.xml
> Should I create a pull request for this? It is trivial.
> https://github.com/aws/aws-sdk-java/issues/484 
> https://github.com/aws/aws-sdk-java/issues/444
> http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version

2015-10-30 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982909#comment-14982909
 ] 

Yongjia Wang commented on SPARK-11413:
--

There is no more transitive dependencies from joda-time. So as long as it 
passes the tests, it should be good?

> Java 8 build has problem with joda-time and s3 request, should bump joda-time 
> version
> -
>
> Key: SPARK-11413
> URL: https://issues.apache.org/jira/browse/SPARK-11413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Yongjia Wang
>Priority: Minor
>
> Joda-time has problems with formatting time zones starting with Java 1.8u60, 
> and this will cause s3 request to fail. It is said to have been fixed at 
> joda-time 2.8.1.
> Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, 
> should set -Djoda.version=2.8.1 or above.
> I was hit by this problem, and -Djoda.version=2.9 worked.
> I don't see any reason not to bump up joda-time version in pom.xml
> Should I create a pull request for this? It is trivial.
> https://github.com/aws/aws-sdk-java/issues/484 
> https://github.com/aws/aws-sdk-java/issues/444
> http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version

2015-10-29 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-11413:


 Summary: Java 8 build has problem with joda-time and s3 request, 
should bump joda-time version
 Key: SPARK-11413
 URL: https://issues.apache.org/jira/browse/SPARK-11413
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Yongjia Wang
Priority: Minor


Joda-time has problems with formatting time zones starting with Java 1.8u60, 
and this will cause s3 request to fail. It is said to have been fixed at 
joda-time 2.8.1.
Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, 
should set -Djoda.version=2.8.1 or above.
I was hit by this problem, and -Djoda.version=2.9 worked.
I don't see any reason not to bump up joda-time version in pom.xml
Should I create a pull request for this? It is trivial.

https://github.com/aws/aws-sdk-java/issues/484 
https://github.com/aws/aws-sdk-java/issues/444
http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11354) Expose custom log4j to executor page in Spark standalone cluster

2015-10-27 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11354:
-
Attachment: custom log4j on executor page.png

> Expose custom log4j to executor page in Spark standalone cluster 
> -
>
> Key: SPARK-11354
> URL: https://issues.apache.org/jira/browse/SPARK-11354
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Yongjia Wang
> Attachments: custom log4j on executor page.png
>
>
> Spark use log4j, which is very flexible. However, on the executor page in 
> standalone cluster, only stdout and stderr are shown in the UI. In the 
> default log4j profile, all messages are forwarded to System.err which is in 
> turned written to the file stderr in executor directory. Similarly, stdout is 
> written to the stdout file in executor directory. 
> It would be very useful to show all the file appenders configured in a custom 
> log4j profile. Right now, these file appenders are written to the executor 
> directory, but they are not exposed to the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11354) Expose custom log4j to executor page in Spark standalone cluster

2015-10-27 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-11354:


 Summary: Expose custom log4j to executor page in Spark standalone 
cluster 
 Key: SPARK-11354
 URL: https://issues.apache.org/jira/browse/SPARK-11354
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Yongjia Wang


Spark use log4j, which is very flexible. However, on the executor page in 
standalone cluster, only stdout and stderr are shown in the UI. In the default 
log4j profile, all messages are forwarded to System.err which is in turned 
written to the file stderr in executor directory. Similarly, stdout is written 
to the stdout file in executor directory. 
It would be very useful to show all the file appenders configured in a custom 
log4j profile. Right now, these file appenders are written to the executor 
directory, but they are not exposed to the UI.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11175:
-
Description: 
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.

  was:
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD to achieve. However, it will mess up with streaming statistics since 
the batch will finish immediately even the jobs it launched are still running 
in another thread. This can further affect resuming from checkpoint, since all 
batches are completed right away even the actual threaded jobs may fail and 
checkpoint only resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.


> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. 
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> Therefore, I would propose to make the default behavior to just run all jobs 
> of the current batch concurrently, and mark batch completion when all the 
> jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-11175:


 Summary: Concurrent execution of JobSet within a batch in Spark 
streaming
 Key: SPARK-11175
 URL: https://issues.apache.org/jira/browse/SPARK-11175
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Yongjia Wang


Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD to achieve. However, it will mess up with streaming statistics since 
the batch will finish immediately even the jobs it launched are still running 
in another thread. This can further affect resuming from checkpoint, since all 
batches are completed right away even the actual threaded jobs may fail and 
checkpoint only resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11152:
-
Priority: Major  (was: Minor)

> Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint 
> -
>
> Key: SPARK-11152
> URL: https://issues.apache.org/jira/browse/SPARK-11152
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Reporter: Yongjia Wang
>
> When a streaming job is resumed from a checkpoint at batch time x, and say 
> the current time when we resume this streaming job is x+10. In this scenario, 
> since Spark will schedule the missing batches from x+1 to x+10 without any 
> metadata, the behavior is to pack up all the backlogged inputs into batch 
> x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. 
> This results in tiny batches that capture inputs only during the back to back 
> scheduling intervals. This behavior is very reasonable. However, the 
> streaming UI does not show correctly the input sizes for all these makeup 
> batches - they are all 0 from batch x to x+10. Fixing this would be very 
> helpful. This happens when I use Kafka direct streaming, I assume this would 
> happen for all other streaming sources as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11175:
-
Description: 
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. The current behavior is that these jobs end 
up in an invisible job queue to be submitted one by one.
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
3. Instead of running multiple Dstreams in one streaming context, just run them 
in separate streaming context (separate Spark applications). Putting aside the 
extra deployment overhead, when working with Spark standalone cluster which 
only has FIFO scheduler across applications, the resource has to be set in 
advance and it won't automatically adjust with resizing the cluster.

Therefore, I think there is a good use case to make the default behavior just 
run all jobs of the current batch concurrently, and mark batch completion when 
all the jobs complete.

  was:
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. The current behavior is that these jobs end 
up in an invisible job queue to be submitted one by one.
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.


> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run 
> them in separate streaming context (separate Spark applications). Putting 
> aside the extra deployment overhead, when working with Spark standalone 
> cluster which only has FIFO scheduler across applications, the resource has 
> to be set in advance and it won't automatically adjust with resizing the 
> cluster.
> Therefore, I think there is a good use case to make the default behavior just 
> run all jobs of the current batch concurrently, and mark batch completion 
> 

[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11175:
-
Description: 
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. The current behavior is that these jobs end 
up in an invisible job queue to be submitted one by one.
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.

  was:
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.


> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> Therefore, I would propose to make the default behavior to just run all jobs 
> of the current batch concurrently, and mark batch completion when all the 
> jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962774#comment-14962774
 ] 

Yongjia Wang commented on SPARK-11175:
--

nice. should have found this. Thank you

> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run 
> them in separate streaming context (separate Spark applications). Putting 
> aside the extra deployment overhead, when working with Spark standalone 
> cluster which only has FIFO scheduler across applications, the resource has 
> to be set in advance and it won't automatically adjust with resizing the 
> cluster.
> Therefore, I think there is a good use case to make the default behavior just 
> run all jobs of the current batch concurrently, and mark batch completion 
> when all the jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang closed SPARK-11175.

Resolution: Not A Problem

> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run 
> them in separate streaming context (separate Spark applications). Putting 
> aside the extra deployment overhead, when working with Spark standalone 
> cluster which only has FIFO scheduler across applications, the resource has 
> to be set in advance and it won't automatically adjust with resizing the 
> cluster.
> Therefore, I think there is a good use case to make the default behavior just 
> run all jobs of the current batch concurrently, and mark batch completion 
> when all the jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint

2015-10-16 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11152:
-
Description: When a streaming job is resumed from a checkpoint at batch 
time x, and say the current time when we resume this streaming job is x+10. In 
this scenario, since Spark will schedule the missing batches from x+1 to x+10 
without any metadata, the behavior is to pack up all the backlogged inputs into 
batch x+1, then assign any new inputs into x+2 to x+10 immediately without 
waiting. This results in tiny batches that capture inputs only during the back 
to back scheduling intervals. This behavior is very reasonable. However, the 
streaming UI does not show correctly the input sizes for all these makeup 
batches - they are all 0 from batch x to x+10. Fixing this would be very 
helpful. This happens when I use Kafka direct streaming, I assume this would 
happen for all other streaming sources as well.  (was: When a streaming job 
starts from a checkpoint at batch time x, and say the current time when we 
resume this streaming job is x+10. In this scenario, since Spark will schedule 
the missing batches from x+1 to x+10 without any metadata, the behavior is to 
pack up all the backlogged inputs into batch x+1, then assign any new inputs 
into x+2 to x+10 immediately without waiting. This results in tiny batches that 
capture inputs only during the back to back scheduling intervals. This behavior 
is very reasonable. However, the streaming UI does not show correctly the input 
sizes for all these makeup batches - they are all 0 from batch x to x+10. 
Fixing this would be very helpful. This happens when I use Kafka direct 
streaming, I assume this would happen for all other streaming sources as well.)

> Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint 
> -
>
> Key: SPARK-11152
> URL: https://issues.apache.org/jira/browse/SPARK-11152
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Reporter: Yongjia Wang
>Priority: Minor
>
> When a streaming job is resumed from a checkpoint at batch time x, and say 
> the current time when we resume this streaming job is x+10. In this scenario, 
> since Spark will schedule the missing batches from x+1 to x+10 without any 
> metadata, the behavior is to pack up all the backlogged inputs into batch 
> x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. 
> This results in tiny batches that capture inputs only during the back to back 
> scheduling intervals. This behavior is very reasonable. However, the 
> streaming UI does not show correctly the input sizes for all these makeup 
> batches - they are all 0 from batch x to x+10. Fixing this would be very 
> helpful. This happens when I use Kafka direct streaming, I assume this would 
> happen for all other streaming sources as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint

2015-10-16 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-11152:


 Summary: Streaming UI: Input sizes are 0 for makeup batches 
started from a checkpoint 
 Key: SPARK-11152
 URL: https://issues.apache.org/jira/browse/SPARK-11152
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Reporter: Yongjia Wang
Priority: Minor


When a streaming job starts from a checkpoint at batch time x, and say the 
current time when we resume this streaming job is x+10. In this scenario, since 
Spark will schedule the missing batches from x+1 to x+10 without any metadata, 
the behavior is to pack up all the backlogged inputs into batch x+1, then 
assign any new inputs into x+2 to x+10 immediately without waiting. This 
results in tiny batches that capture inputs only during the back to back 
scheduling intervals. This behavior is very reasonable. However, the streaming 
UI does not show correctly the input sizes for all these makeup batches - they 
are all 0 from batch x to x+10. Fixing this would be very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint

2015-10-16 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11152:
-
Description: When a streaming job starts from a checkpoint at batch time x, 
and say the current time when we resume this streaming job is x+10. In this 
scenario, since Spark will schedule the missing batches from x+1 to x+10 
without any metadata, the behavior is to pack up all the backlogged inputs into 
batch x+1, then assign any new inputs into x+2 to x+10 immediately without 
waiting. This results in tiny batches that capture inputs only during the back 
to back scheduling intervals. This behavior is very reasonable. However, the 
streaming UI does not show correctly the input sizes for all these makeup 
batches - they are all 0 from batch x to x+10. Fixing this would be very 
helpful. This happens when I use Kafka direct streaming, I assume this would 
happen for all other streaming sources as well.  (was: When a streaming job 
starts from a checkpoint at batch time x, and say the current time when we 
resume this streaming job is x+10. In this scenario, since Spark will schedule 
the missing batches from x+1 to x+10 without any metadata, the behavior is to 
pack up all the backlogged inputs into batch x+1, then assign any new inputs 
into x+2 to x+10 immediately without waiting. This results in tiny batches that 
capture inputs only during the back to back scheduling intervals. This behavior 
is very reasonable. However, the streaming UI does not show correctly the input 
sizes for all these makeup batches - they are all 0 from batch x to x+10. 
Fixing this would be very helpful.)

> Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint 
> -
>
> Key: SPARK-11152
> URL: https://issues.apache.org/jira/browse/SPARK-11152
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Reporter: Yongjia Wang
>Priority: Minor
>
> When a streaming job starts from a checkpoint at batch time x, and say the 
> current time when we resume this streaming job is x+10. In this scenario, 
> since Spark will schedule the missing batches from x+1 to x+10 without any 
> metadata, the behavior is to pack up all the backlogged inputs into batch 
> x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. 
> This results in tiny batches that capture inputs only during the back to back 
> scheduling intervals. This behavior is very reasonable. However, the 
> streaming UI does not show correctly the input sizes for all these makeup 
> batches - they are all 0 from batch x to x+10. Fixing this would be very 
> helpful. This happens when I use Kafka direct streaming, I assume this would 
> happen for all other streaming sources as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10912) Improve Spark metrics executor.filesystem

2015-10-05 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-10912:
-
Attachment: s3a_metrics.patch

Adding s3a is fairly straightforward. I guess the reason it's not included is 
because s3a support (via hadoop-aws.jar) is not part of the default hadoop 
distribution due to licensing issues. I created a patch to enable s3a metrics, 
both on the executors and on the driver. Reporting shuffle statistics requires 
more thoughts, although all the numbers are already collected in 
TaskMetrics.scala (input, output, shuffle, local, remote, spill, records, 
bytes, etc). I think it would make sense to report the aggregated metrics per 
executor across all tasks, so it's easy to have an overall sense of disk I/O 
and network traffic.

> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>Priority: Minor
> Attachments: s3a_metrics.patch
>
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10912) Improve Spark metrics executor.filesystem

2015-10-02 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-10912:
-
Description: 
In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: "hdfs" 
and "file". I started using s3 as the persistent storage with Spark standalone 
cluster in EC2, and s3 read/write metrics do not appear anywhere. The 'file' 
metric appears to be only for driver reading local file, it would be nice to 
also report shuffle read/write metrics, so it can help with optimization.
I think these 2 things (s3 and shuffle) are very useful and cover all the 
missing information about Spark IO especially for s3 setup.

  was:
In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: "hdfs" 
and "file". I started using s3 as the persistent storage with Spark standalone 
cluster in EC2, and s3 read/write metrics do not appear anywhere. The 'file' 
metric appears to be only for driver reading local file, it would be nice to 
also report shuffle read/write metrics, so it can help understand things like 
if a Spark job becomes IO bound.
I think these 2 things (s3 and shuffle) are very useful and cover all the 
missing information about Spark IO especially for s3 setup.


> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10912) Improve Spark metrics executor.filesystem

2015-10-02 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-10912:


 Summary: Improve Spark metrics executor.filesystem
 Key: SPARK-10912
 URL: https://issues.apache.org/jira/browse/SPARK-10912
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.5.0
Reporter: Yongjia Wang


In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: "hdfs" 
and "file". I started using s3 as the persistent storage with Spark standalone 
cluster in EC2, and s3 read/write metrics do not appear anywhere. The 'file' 
metric appears to be only for driver reading local file, it would be nice to 
also report shuffle read/write metrics, so it can help understand things like 
if a Spark job becomes IO bound.
I think these 2 things (s3 and shuffle) are very useful and cover all the 
missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-10-01 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940702#comment-14940702
 ] 

Yongjia Wang commented on SPARK-5874:
-

The functionality about force save/load all pipeline components is very 
important. In the design doc, it says to do this in 1.4 for the new 
Transformer/Estimator framework under the .ml package. We are at 1.5.0 right 
now, nothing happened on that path. I wonder if there was some major conceptual 
changes or just workload/resource issue.


> How to improve the current ML pipeline API?
> ---
>
> Key: SPARK-5874
> URL: https://issues.apache.org/jira/browse/SPARK-5874
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.
> Design doc (WIP): 
> https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5152) Let metrics.properties file take an hdfs:// path

2015-09-22 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902883#comment-14902883
 ] 

Yongjia Wang edited comment on SPARK-5152 at 9/22/15 6:00 PM:
--

I voted for this. 
This would enable configuring metrics or log4j properties of all the workers 
just from one placing when submitting the job. Without it, you will have to 
setup on each of the workers. If hdfs:// can be supported, I assume s3n:// 
s3a:// would all be supported since they go through the same interface.
Alternatively, it's probably even better if there is a way, to specify through 
"conf" spark properties as in the spark-submit command line, to upload custom 
files to spark executor's working directory before the executor process starts. 
the "spark.files" option upload the files lazily when the first task starts, 
which is too late for configuration.


was (Author: yongjiaw):
I voted for this. 
It enables configuring metrics or log4j properties of all the workers just from 
the driver. Without it, you will have to setup on each of the workers.
Alternatively, it's probably even better if there is a way, to specify through 
"conf" spark properties in the spark-submit command line, to upload custom 
files to spark executor's working directory before the executor process starts. 
the "spark.files" option upload the files lazily when the first task starts, 
which is too late for configuration.

> Let metrics.properties file take an hdfs:// path
> 
>
> Key: SPARK-5152
> URL: https://issues.apache.org/jira/browse/SPARK-5152
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>
> From my reading of [the 
> code|https://github.com/apache/spark/blob/06dc4b5206a578065ebbb6bb8d54246ca007397f/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L53],
>  the {{spark.metrics.conf}} property must be a path that is resolvable on the 
> local filesystem of each executor.
> Running a Spark job with {{--conf 
> spark.metrics.conf=hdfs://host1.domain.com/path/metrics.properties}} logs 
> many errors (~1 per executor, presumably?) like:
> {code}
> 15/01/08 13:20:57 ERROR metrics.MetricsConfig: Error loading configure file
> java.io.FileNotFoundException: hdfs:/host1.domain.com/path/metrics.properties 
> (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:53)
> at 
> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:92)
> at 
> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:218)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:329)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:181)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:131)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:60)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> {code}
> which seems consistent with the idea that it's looking on the local 
> filesystem and not parsing the "scheme" portion of the URL.
> Letting all executors get their {{metrics.properties}} files from one 
> location on HDFS would be an improvement, right?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5152) Let metrics.properties file take an hdfs:// path

2015-09-22 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902883#comment-14902883
 ] 

Yongjia Wang commented on SPARK-5152:
-

I voted for this. 
It enables configuring metrics or log4j properties of all the workers just from 
the driver. Without it, you will have to setup on each of the workers.
Alternatively, it's probably even better if there is a way, to specify through 
"conf" spark properties in the spark-submit command line, to upload custom 
files to spark executor's working directory before the executor process starts. 
the "spark.files" option upload the files lazily when the first task starts, 
which is too late for configuration.

> Let metrics.properties file take an hdfs:// path
> 
>
> Key: SPARK-5152
> URL: https://issues.apache.org/jira/browse/SPARK-5152
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>
> From my reading of [the 
> code|https://github.com/apache/spark/blob/06dc4b5206a578065ebbb6bb8d54246ca007397f/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L53],
>  the {{spark.metrics.conf}} property must be a path that is resolvable on the 
> local filesystem of each executor.
> Running a Spark job with {{--conf 
> spark.metrics.conf=hdfs://host1.domain.com/path/metrics.properties}} logs 
> many errors (~1 per executor, presumably?) like:
> {code}
> 15/01/08 13:20:57 ERROR metrics.MetricsConfig: Error loading configure file
> java.io.FileNotFoundException: hdfs:/host1.domain.com/path/metrics.properties 
> (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:53)
> at 
> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:92)
> at 
> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:218)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:329)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:181)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:131)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:60)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> {code}
> which seems consistent with the idea that it's looking on the local 
> filesystem and not parsing the "scheme" portion of the URL.
> Letting all executors get their {{metrics.properties}} files from one 
> location on HDFS would be an improvement, right?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-3512:
---

 Summary: yarn-client through socks proxy
 Key: SPARK-3512
 URL: https://issues.apache.org/jira/browse/SPARK-3512
 Project: Spark
  Issue Type: Wish
  Components: YARN
Reporter: Yongjia Wang


I believe this would be a common scenario that the yarn cluster runs behind a 
firewall, while people want to run spark driver locally for best interactivity 
experience. For example, using ipython notebook, or more fancy IDEs, etc. A 
potential solution is to setup socks proxy on the local machine outside of the 
firewall through shh tunneling into some work station inside the firewall. Then 
the client only needs to talk through this proxy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. For example, using ipython notebook, or more 
fancy IDEs, etc. A potential solution is to setup socks proxy on the local 
machine outside of the firewall through shh tunneling into some work station 
inside the firewall. Then the spark yarn-client only needs to talk through this 
proxy.  (was: I believe this would be a common scenario that the yarn cluster 
runs behind a firewall, while people want to run spark driver locally for best 
interactivity experience. For example, using ipython notebook, or more fancy 
IDEs, etc. A potential solution is to setup socks proxy on the local machine 
outside of the firewall through shh tunneling into some work station inside the 
firewall. Then the client only needs to talk through this proxy.)

 yarn-client through socks proxy
 ---

 Key: SPARK-3512
 URL: https://issues.apache.org/jira/browse/SPARK-3512
 Project: Spark
  Issue Type: Wish
  Components: YARN
Reporter: Yongjia Wang

 I believe this would be a common scenario that the yarn cluster runs behind a 
 firewall, while people want to run spark driver locally for best 
 interactivity experience. For example, using ipython notebook, or more fancy 
 IDEs, etc. A potential solution is to setup socks proxy on the local machine 
 outside of the firewall through shh tunneling into some work station inside 
 the firewall. Then the spark yarn-client only needs to talk through this 
 proxy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. For example, using ipython notebook, or more 
fancy IDEs, etc. A potential solution is to setup socks proxy on the local 
machine outside of the firewall through shh tunneling into some work station 
inside the firewall. Then the spark yarn-client only needs to talk to the 
cluster through this proxy without changing any configurations.  (was: I 
believe this would be a common scenario that the yarn cluster runs behind a 
firewall, while people want to run spark driver locally for best interactivity 
experience. For example, using ipython notebook, or more fancy IDEs, etc. A 
potential solution is to setup socks proxy on the local machine outside of the 
firewall through shh tunneling into some work station inside the firewall. Then 
the spark yarn-client only needs to talk through this proxy.)

 yarn-client through socks proxy
 ---

 Key: SPARK-3512
 URL: https://issues.apache.org/jira/browse/SPARK-3512
 Project: Spark
  Issue Type: Wish
  Components: YARN
Reporter: Yongjia Wang

 I believe this would be a common scenario that the yarn cluster runs behind a 
 firewall, while people want to run spark driver locally for best 
 interactivity experience. For example, using ipython notebook, or more fancy 
 IDEs, etc. A potential solution is to setup socks proxy on the local machine 
 outside of the firewall through shh tunneling into some work station inside 
 the firewall. Then the spark yarn-client only needs to talk to the cluster 
 through this proxy without changing any configurations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. You would have full control of local 
resource that can be accessed by the client as opposed to be limited to the 
spark-shell if you would do the conventional way to ssh to the remote host 
inside the firewall. For example, using ipython notebook, or more fancy IDEs, 
etc. Installing anything you want on the remote host is usually not an option. 
A potential solution is to setup socks proxy on the local machine outside of 
the firewall through shh tunneling (ssh -D port user@remote-host) into some 
work station inside the firewall. Then the spark yarn-client only needs to talk 
to the cluster through this proxy without the need to changing any 
configurations. Does this sound feasible?  (was: I believe this would be a 
common scenario that the yarn cluster runs behind a firewall, while people want 
to run spark driver locally for best interactivity experience. For example, 
using ipython notebook, or more fancy IDEs, etc. A potential solution is to 
setup socks proxy on the local machine outside of the firewall through shh 
tunneling into some work station inside the firewall. Then the spark 
yarn-client only needs to talk to the cluster through this proxy without 
changing any configurations.)

 yarn-client through socks proxy
 ---

 Key: SPARK-3512
 URL: https://issues.apache.org/jira/browse/SPARK-3512
 Project: Spark
  Issue Type: Wish
  Components: YARN
Reporter: Yongjia Wang

 I believe this would be a common scenario that the yarn cluster runs behind a 
 firewall, while people want to run spark driver locally for best 
 interactivity experience. You would have full control of local resource that 
 can be accessed by the client as opposed to be limited to the spark-shell if 
 you would do the conventional way to ssh to the remote host inside the 
 firewall. For example, using ipython notebook, or more fancy IDEs, etc. 
 Installing anything you want on the remote host is usually not an option. A 
 potential solution is to setup socks proxy on the local machine outside of 
 the firewall through shh tunneling (ssh -D port user@remote-host) into some 
 work station inside the firewall. Then the spark yarn-client only needs to 
 talk to the cluster through this proxy without the need to changing any 
 configurations. Does this sound feasible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. You would have full control of local 
resource that can be accessed by the client as opposed to be limited to the 
spark-shell if you would do the conventional way to ssh to the remote host 
inside the firewall. For example, using ipython notebook, or more fancy IDEs, 
etc. Installing anything you want on the remote host is usually not an option. 
A potential solution is to setup socks proxy on your local machine outside of 
the firewall through shh tunneling (ssh -D local-proxy-port 
user@remote-host) into some work station inside the firewall. Then the 
spark yarn-client only needs to talk to the cluster through this proxy without 
the need of changing any configurations. Does this sound feasible?  (was: I 
believe this would be a common scenario that the yarn cluster runs behind a 
firewall, while people want to run spark driver locally for best interactivity 
experience. You would have full control of local resource that can be accessed 
by the client as opposed to be limited to the spark-shell if you would do the 
conventional way to ssh to the remote host inside the firewall. For example, 
using ipython notebook, or more fancy IDEs, etc. Installing anything you want 
on the remote host is usually not an option. A potential solution is to setup 
socks proxy on the local machine outside of the firewall through shh tunneling 
(ssh -D port user@remote-host) into some work station inside the firewall. Then 
the spark yarn-client only needs to talk to the cluster through this proxy 
without the need to changing any configurations. Does this sound feasible?)

 yarn-client through socks proxy
 ---

 Key: SPARK-3512
 URL: https://issues.apache.org/jira/browse/SPARK-3512
 Project: Spark
  Issue Type: Wish
  Components: YARN
Reporter: Yongjia Wang

 I believe this would be a common scenario that the yarn cluster runs behind a 
 firewall, while people want to run spark driver locally for best 
 interactivity experience. You would have full control of local resource that 
 can be accessed by the client as opposed to be limited to the spark-shell if 
 you would do the conventional way to ssh to the remote host inside the 
 firewall. For example, using ipython notebook, or more fancy IDEs, etc. 
 Installing anything you want on the remote host is usually not an option. A 
 potential solution is to setup socks proxy on your local machine outside of 
 the firewall through shh tunneling (ssh -D local-proxy-port 
 user@remote-host) into some work station inside the firewall. Then the 
 spark yarn-client only needs to talk to the cluster through this proxy 
 without the need of changing any configurations. Does this sound feasible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. You would have full control of local 
resource that can be accessed by the client as opposed to be limited to the 
spark-shell if you would do the conventional way to ssh to the remote host 
inside the firewall. For example, using ipython notebook, or more fancy IDEs, 
etc. Installing anything you want on the remote host is usually not an option. 
A potential solution is to setup socks proxy on your local machine outside of 
the firewall through shh tunneling (ssh -D local-proxy-port 
user@remote-host) into some work station inside the firewall. Then the 
spark yarn-client only needs to talk to the cluster through this proxy without 
the need of changing any configurations. Does this sound feasible? Maybe VPN is 
the right solution?  (was: I believe this would be a common scenario that the 
yarn cluster runs behind a firewall, while people want to run spark driver 
locally for best interactivity experience. You would have full control of local 
resource that can be accessed by the client as opposed to be limited to the 
spark-shell if you would do the conventional way to ssh to the remote host 
inside the firewall. For example, using ipython notebook, or more fancy IDEs, 
etc. Installing anything you want on the remote host is usually not an option. 
A potential solution is to setup socks proxy on your local machine outside of 
the firewall through shh tunneling (ssh -D local-proxy-port 
user@remote-host) into some work station inside the firewall. Then the 
spark yarn-client only needs to talk to the cluster through this proxy without 
the need of changing any configurations. Does this sound feasible?)

 yarn-client through socks proxy
 ---

 Key: SPARK-3512
 URL: https://issues.apache.org/jira/browse/SPARK-3512
 Project: Spark
  Issue Type: Wish
  Components: YARN
Reporter: Yongjia Wang

 I believe this would be a common scenario that the yarn cluster runs behind a 
 firewall, while people want to run spark driver locally for best 
 interactivity experience. You would have full control of local resource that 
 can be accessed by the client as opposed to be limited to the spark-shell if 
 you would do the conventional way to ssh to the remote host inside the 
 firewall. For example, using ipython notebook, or more fancy IDEs, etc. 
 Installing anything you want on the remote host is usually not an option. A 
 potential solution is to setup socks proxy on your local machine outside of 
 the firewall through shh tunneling (ssh -D local-proxy-port 
 user@remote-host) into some work station inside the firewall. Then the 
 spark yarn-client only needs to talk to the cluster through this proxy 
 without the need of changing any configurations. Does this sound feasible? 
 Maybe VPN is the right solution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org