[jira] [Comment Edited] (SPARK-10912) Improve Spark metrics executor.filesystem
[ https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638836#comment-15638836 ] Yongjia Wang edited comment on SPARK-10912 at 11/5/16 7:09 AM: --- s3a and hdfs are different "schemes" in Spark's FileSystem.Statistics I think it is Spark's responsibility to choose what to report, and currently only "hdfs" and "file" are reported. I have been using the attached s3a_metrics.patch to build Spark in order to get the s3a metrics reported. I'm not sure whether there is a way to report s3a metrics just through configuration (without changing Spark source like what was did in the attached patch file). Now I need to add GoogleHadoopFileSystem's "gs" metrics, please advise the best approach. Thank you. [~srowen] was (Author: yongjiaw): s3a and hdfs are different "schemes" in Spark's FileSystem.Statistics I think it is Spark's responsibility to choose what to report, and currently only "hdfs" and "file" are reported. I have been using the attached s3a_metrics.patch to build Spark in order to get the s3a metrics reported. I'm not sure whether there is a way to report s3a metrics just through configuration (without changing Spark source like what was did in the attached patch file). Now I need to add GoogleHadoopFileSystem's "gs" metrics, please advise the best approach. Thank you. > Improve Spark metrics executor.filesystem > - > > Key: SPARK-10912 > URL: https://issues.apache.org/jira/browse/SPARK-10912 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.5.0 >Reporter: Yongjia Wang >Priority: Minor > Attachments: s3a_metrics.patch > > > In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: > "hdfs" and "file". I started using s3 as the persistent storage with Spark > standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. > The 'file' metric appears to be only for driver reading local file, it would > be nice to also report shuffle read/write metrics, so it can help with > optimization. > I think these 2 things (s3 and shuffle) are very useful and cover all the > missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-10912) Improve Spark metrics executor.filesystem
[ https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang reopened SPARK-10912: -- > Improve Spark metrics executor.filesystem > - > > Key: SPARK-10912 > URL: https://issues.apache.org/jira/browse/SPARK-10912 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.5.0 >Reporter: Yongjia Wang >Priority: Minor > Attachments: s3a_metrics.patch > > > In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: > "hdfs" and "file". I started using s3 as the persistent storage with Spark > standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. > The 'file' metric appears to be only for driver reading local file, it would > be nice to also report shuffle read/write metrics, so it can help with > optimization. > I think these 2 things (s3 and shuffle) are very useful and cover all the > missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10912) Improve Spark metrics executor.filesystem
[ https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638836#comment-15638836 ] Yongjia Wang commented on SPARK-10912: -- s3a and hdfs are different "schemes" in Spark's FileSystem.Statistics I think it is Spark's responsibility to choose what to report, and currently only "hdfs" and "file" are reported. I have been using the attached s3a_metrics.patch to build Spark in order to get the s3a metrics reported. I'm not sure whether there is a way to report s3a metrics just through configuration (without changing Spark source like what was did in the attached patch file). Now I need to add GoogleHadoopFileSystem's "gs" metrics, please advise the best approach. Thank you. > Improve Spark metrics executor.filesystem > - > > Key: SPARK-10912 > URL: https://issues.apache.org/jira/browse/SPARK-10912 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.5.0 >Reporter: Yongjia Wang >Priority: Minor > Attachments: s3a_metrics.patch > > > In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: > "hdfs" and "file". I started using s3 as the persistent storage with Spark > standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. > The 'file' metric appears to be only for driver reading local file, it would > be nice to also report shuffle read/write metrics, so it can help with > optimization. > I think these 2 things (s3 and shuffle) are very useful and cover all the > missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog
[ https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422887#comment-15422887 ] Yongjia Wang commented on SPARK-16484: -- Here is my solution using Spark UDAF and UDT https://github.com/yongjiaw/Spark_HLL > Incremental Cardinality estimation operations with Hyperloglog > -- > > Key: SPARK-16484 > URL: https://issues.apache.org/jira/browse/SPARK-16484 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yongjia Wang > > Efficient cardinality estimation is very important, and SparkSQL has had > approxCountDistinct based on Hyperloglog for quite some time. However, there > isn't a way to do incremental estimation. For example, if we want to get > updated distinct counts of the last 90 days, we need to do the aggregation > for the entire window over and over again. The more efficient way involves > serializing the counter for smaller time windows (such as hourly) so the > counts can be efficiently updated in an incremental fashion for any time > window. > With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus > implementation in the current Spark version, it's easy enough to extend the > functionality to include incremental counting, and even other general set > operations such as intersection and set difference. Spark API is already as > elegant as it can be, but it still takes quite some effort to do a custom > implementation of the aforementioned operations which are supposed to be in > high demand. I have been searching but failed to find an usable existing > solution nor any ongoing effort for this. The closest I got is the following > but it does not work with Spark 1.6 due to API changes. > https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala > I wonder if it worth to integrate such operations into SparkSQL. The only > problem I see is it depends on serialization of a specific HLL implementation > and introduce compatibility issues. But as long as the user is aware of such > issue, it should be fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog
[ https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371709#comment-15371709 ] Yongjia Wang commented on SPARK-16484: -- Yes, I agree all the building blocks are there and easy enough to put together a solution now. I guess what I did is the second approach you mentioned - saving the hll++ "buffer" as a byte array column, with a custom UDAF to merge them using SQL expression. I was trying to say if it worth extending sparksql to include those extra UDAFs, making it more accessible for regular spark users. Also doing intersection of multiple sets can be tricky, wouldn't it be nice to have it as part of sparksql's standard set of functions? > Incremental Cardinality estimation operations with Hyperloglog > -- > > Key: SPARK-16484 > URL: https://issues.apache.org/jira/browse/SPARK-16484 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yongjia Wang > > Efficient cardinality estimation is very important, and SparkSQL has had > approxCountDistinct based on Hyperloglog for quite some time. However, there > isn't a way to do incremental estimation. For example, if we want to get > updated distinct counts of the last 90 days, we need to do the aggregation > for the entire window over and over again. The more efficient way involves > serializing the counter for smaller time windows (such as hourly) so the > counts can be efficiently updated in an incremental fashion for any time > window. > With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus > implementation in the current Spark version, it's easy enough to extend the > functionality to include incremental counting, and even other general set > operations such as intersection and set difference. Spark API is already as > elegant as it can be, but it still takes quite some effort to do a custom > implementation of the aforementioned operations which are supposed to be in > high demand. I have been searching but failed to find an usable existing > solution nor any ongoing effort for this. The closest I got is the following > but it does not work with Spark 1.6 due to API changes. > https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala > I wonder if it worth to integrate such operations into SparkSQL. The only > problem I see is it depends on serialization of a specific HLL implementation > and introduce compatibility issues. But as long as the user is aware of such > issue, it should be fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog
Yongjia Wang created SPARK-16484: Summary: Incremental Cardinality estimation operations with Hyperloglog Key: SPARK-16484 URL: https://issues.apache.org/jira/browse/SPARK-16484 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yongjia Wang Efficient cardinality estimation is very important, and SparkSQL has had approxCountDistinct based on Hyperloglog for quite some time. However, there isn't a way to do incremental estimation. For example, if we want to get updated distinct counts of the last 90 days, we need to do the aggregation for the entire window over and over again. The more efficient way involves serializing the counter for smaller time windows (such as hourly) so the counts can be efficiently updated in an incremental fashion for any time window. With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus implementation in the current Spark version, it's easy enough to extend the functionality to include incremental counting, and even other general set operations such as intersection and set difference. Spark API is already as elegant as it can be, but it still takes quite some effort to do a custom implementation of the aforementioned operations which are supposed to be in high demand. I have been searching but failed to find an usable existing solution nor any ongoing effort for this. The closest I got is the following but it does not work with Spark 1.6 due to API changes. https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala I wonder if it worth to integrate such operations into SparkSQL. The only problem I see is it depends on serialization of a specific HLL implementation and introduce compatibility issues. But as long as the user is aware of such issue, it should be fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11824) WebUI does not render descriptions with 'bad' HTML, throws console error
[ https://issues.apache.org/jira/browse/SPARK-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032602#comment-15032602 ] Yongjia Wang commented on SPARK-11824: -- Looks this is the right escaper. StringEscapeUtils.escapeXml https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html#escapeXml(java.lang.String) > WebUI does not render descriptions with 'bad' HTML, throws console error > > > Key: SPARK-11824 > URL: https://issues.apache.org/jira/browse/SPARK-11824 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 1.5.2 > Environment: RHEL 6, Java 1.7, Mesos 0.25.0 >Reporter: Andy Robb >Priority: Minor > Labels: starter > > When using SparkSQL CLI, running a query with less-than or greater-than > symbols in a query, viewing the Web UI will throw the following console > warning. (The table and column names have been changed from the actual query.) > This occurs across CLI invocations. The warning is thrown each time the UI is > refreshed, both during query execution and after the query is complete. > {noformat} > 15/11/18 10:45:31 WARN ui.UIUtils: Invalid job description: select count(1) > from table1 where date >= '2015-11-01' and date <= '2015-11-15' > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 114; The content > of elements must consist of well-formed character data or markup. > at > com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198) > at > com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) > at > com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) > at > com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) > at > com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1436) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.startOfMarkup(XMLDocumentFragmentScannerImpl.java:2636) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2734) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:648) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332) > at scala.xml.factory.XMLLoader$class.loadXML(XMLLoader.scala:40) > at scala.xml.XML$.loadXML(XML.scala:57) > at scala.xml.factory.XMLLoader$class.loadString(XMLLoader.scala:59) > at scala.xml.XML$.loadString(XML.scala:57) > at org.apache.spark.ui.UIUtils$.makeDescription(UIUtils.scala:417) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$5$$anonfun$apply$1.apply(StageTable.scala:118) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$5$$anonfun$apply$1.apply(StageTable.scala:116) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$5.apply(StageTable.scala:116) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$5.apply(StageTable.scala:115) > at scala.Option.flatMap(Option.scala:170) > at > org.apache.spark.ui.jobs.StageTableBase.makeDescription(StageTable.scala:115) > at > org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:177) > at > org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:195) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:60) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:60) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:69) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:69) > at
[jira] [Commented] (SPARK-11824) WebUI does not render descriptions with 'bad' HTML, throws console error
[ https://issues.apache.org/jira/browse/SPARK-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15031199#comment-15031199 ] Yongjia Wang commented on SPARK-11824: -- Yes, this is annoying and not just for CLI. But you can turn off the annoying message like this LogManager.getLogger("org.apache.spark.ui").setLevel(Level.ERROR) > WebUI does not render descriptions with 'bad' HTML, throws console error > > > Key: SPARK-11824 > URL: https://issues.apache.org/jira/browse/SPARK-11824 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 1.5.2 > Environment: RHEL 6, Java 1.7, Mesos 0.25.0 >Reporter: Andy Robb >Priority: Minor > > When using SparkSQL CLI, running a query with less-than or greater-than > symbols in a query, viewing the Web UI will throw the following console > warning. (The table and column names have been changed from the actual query.) > This occurs across CLI invocations. The warning is thrown each time the UI is > refreshed, both during query execution and after the query is complete. > {noformat} > 15/11/18 10:45:31 WARN ui.UIUtils: Invalid job description: select count(1) > from table1 where date >= '2015-11-01' and date <= '2015-11-15' > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 114; The content > of elements must consist of well-formed character data or markup. > at > com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198) > at > com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) > at > com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) > at > com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) > at > com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1436) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.startOfMarkup(XMLDocumentFragmentScannerImpl.java:2636) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2734) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:648) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332) > at scala.xml.factory.XMLLoader$class.loadXML(XMLLoader.scala:40) > at scala.xml.XML$.loadXML(XML.scala:57) > at scala.xml.factory.XMLLoader$class.loadString(XMLLoader.scala:59) > at scala.xml.XML$.loadString(XML.scala:57) > at org.apache.spark.ui.UIUtils$.makeDescription(UIUtils.scala:417) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$5$$anonfun$apply$1.apply(StageTable.scala:118) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$5$$anonfun$apply$1.apply(StageTable.scala:116) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$5.apply(StageTable.scala:116) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$5.apply(StageTable.scala:115) > at scala.Option.flatMap(Option.scala:170) > at > org.apache.spark.ui.jobs.StageTableBase.makeDescription(StageTable.scala:115) > at > org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:177) > at > org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:195) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:60) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:60) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:69) > at > org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:69) > at scala.collection.immutable.Stream.map(Stream.scala:376) > at >
[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version
[ https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982662#comment-14982662 ] Yongjia Wang commented on SPARK-11413: -- yea, the fix is to update the joda.version number from 2.5 to 2.8.1 or to the newest 2.9 in Spark's pom.xml. This PR would just be a 1 line change in pom.xml It's only an issue when compiling with java 1.8u60, due to the changes in time format, and s3 request in some way relies on the time format in its header. Those links in my comment should explain better. It's no new dependency, but can be considered a java8 s3 combo bug. So this only affects the combination of java 8u60 + using s3a, which should become quite common and hits more people. If I understand correctly, building with java7 or java8 before u60 should be fine, even running with jre after java 8u60. > Java 8 build has problem with joda-time and s3 request, should bump joda-time > version > - > > Key: SPARK-11413 > URL: https://issues.apache.org/jira/browse/SPARK-11413 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Yongjia Wang >Priority: Minor > > Joda-time has problems with formatting time zones starting with Java 1.8u60, > and this will cause s3 request to fail. It is said to have been fixed at > joda-time 2.8.1. > Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, > should set -Djoda.version=2.8.1 or above. > I was hit by this problem, and -Djoda.version=2.9 worked. > I don't see any reason not to bump up joda-time version in pom.xml > Should I create a pull request for this? It is trivial. > https://github.com/aws/aws-sdk-java/issues/484 > https://github.com/aws/aws-sdk-java/issues/444 > http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version
[ https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982755#comment-14982755 ] Yongjia Wang edited comment on SPARK-11413 at 10/30/15 4:02 PM: My last statement was wrong. It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 or below was loaded at run time. The following error was reproduced when I use the official spark release and run with JRE 1.8.0_65 (Oracle Corporation) Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that caused this problem. com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182) at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738) at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653) was (Author: yongjiaw): My last statement was wrong. It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 or below was loaded at run time. The following error was reproduced when I use the official spark release with JRE 1.8.0_65 (Oracle Corporation) Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that caused this problem. com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182) at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738) at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653) > Java 8 build has problem with joda-time and s3 request, should bump joda-time > version > - > > Key: SPARK-11413 > URL: https://issues.apache.org/jira/browse/SPARK-11413 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Yongjia Wang >Priority: Minor > > Joda-time has problems with formatting time zones starting with Java 1.8u60, > and this will cause s3 request to fail. It is said to have been fixed at > joda-time 2.8.1. > Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, > should set -Djoda.version=2.8.1 or above. > I was hit by this problem, and -Djoda.version=2.9 worked. > I don't see any reason not to bump up joda-time version in pom.xml > Should I create a pull request for this? It is trivial. > https://github.com/aws/aws-sdk-java/issues/484 > https://github.com/aws/aws-sdk-java/issues/444 > http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version
[ https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982802#comment-14982802 ] Yongjia Wang commented on SPARK-11413: -- I see. I don't know, is it safe to assume joda-time is backward compatible as far as Spark is concerned? > Java 8 build has problem with joda-time and s3 request, should bump joda-time > version > - > > Key: SPARK-11413 > URL: https://issues.apache.org/jira/browse/SPARK-11413 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Yongjia Wang >Priority: Minor > > Joda-time has problems with formatting time zones starting with Java 1.8u60, > and this will cause s3 request to fail. It is said to have been fixed at > joda-time 2.8.1. > Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, > should set -Djoda.version=2.8.1 or above. > I was hit by this problem, and -Djoda.version=2.9 worked. > I don't see any reason not to bump up joda-time version in pom.xml > Should I create a pull request for this? It is trivial. > https://github.com/aws/aws-sdk-java/issues/484 > https://github.com/aws/aws-sdk-java/issues/444 > http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version
[ https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982755#comment-14982755 ] Yongjia Wang commented on SPARK-11413: -- My last statement was wrong. It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 and belong was loaded at run time by s3 client. The following error was reproduced when I use the official spark release with JRE 1.8.0_65 (Oracle Corporation) Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that caused this problem. com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182) at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738) at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653) > Java 8 build has problem with joda-time and s3 request, should bump joda-time > version > - > > Key: SPARK-11413 > URL: https://issues.apache.org/jira/browse/SPARK-11413 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Yongjia Wang >Priority: Minor > > Joda-time has problems with formatting time zones starting with Java 1.8u60, > and this will cause s3 request to fail. It is said to have been fixed at > joda-time 2.8.1. > Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, > should set -Djoda.version=2.8.1 or above. > I was hit by this problem, and -Djoda.version=2.9 worked. > I don't see any reason not to bump up joda-time version in pom.xml > Should I create a pull request for this? It is trivial. > https://github.com/aws/aws-sdk-java/issues/484 > https://github.com/aws/aws-sdk-java/issues/444 > http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version
[ https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982755#comment-14982755 ] Yongjia Wang edited comment on SPARK-11413 at 10/30/15 4:02 PM: My last statement was wrong. It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 or below was loaded at run time. The following error was reproduced when I use the official spark release with JRE 1.8.0_65 (Oracle Corporation) Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that caused this problem. com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182) at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738) at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653) was (Author: yongjiaw): My last statement was wrong. It's a problem when using jre 8u60 or above, and when joda-time 2.8.0 and belong was loaded at run time by s3 client. The following error was reproduced when I use the official spark release with JRE 1.8.0_65 (Oracle Corporation) Note that the aws-java-sdk-s3 library already depends on joda-time-2.8.1. It's due to Spark assembly's inclusion of joda-time-2.5 which was loaded by JVM that caused this problem. com.amazonaws.services.s3.model.AmazonS3Exception: AWS authentication requires a valid Date or x-amz-date header (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: E0D714AF923221DD), S3 Extended Request ID: 8OJlBMYKVzwrEXHANpBTsiLRmaFn2lHQ4sAvkU/HF66wWGVj8VR2/Jh4Wl8QuFUt at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182) at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738) at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653) > Java 8 build has problem with joda-time and s3 request, should bump joda-time > version > - > > Key: SPARK-11413 > URL: https://issues.apache.org/jira/browse/SPARK-11413 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Yongjia Wang >Priority: Minor > > Joda-time has problems with formatting time zones starting with Java 1.8u60, > and this will cause s3 request to fail. It is said to have been fixed at > joda-time 2.8.1. > Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, > should set -Djoda.version=2.8.1 or above. > I was hit by this problem, and -Djoda.version=2.9 worked. > I don't see any reason not to bump up joda-time version in pom.xml > Should I create a pull request for this? It is trivial. > https://github.com/aws/aws-sdk-java/issues/484 > https://github.com/aws/aws-sdk-java/issues/444 > http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version
[ https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982874#comment-14982874 ] Yongjia Wang commented on SPARK-11413: -- I can follow up with a PR first. The latest joda-time release is 2.9. Do you suggest just using 2.8.1 as the fix with minimum changes? > Java 8 build has problem with joda-time and s3 request, should bump joda-time > version > - > > Key: SPARK-11413 > URL: https://issues.apache.org/jira/browse/SPARK-11413 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Yongjia Wang >Priority: Minor > > Joda-time has problems with formatting time zones starting with Java 1.8u60, > and this will cause s3 request to fail. It is said to have been fixed at > joda-time 2.8.1. > Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, > should set -Djoda.version=2.8.1 or above. > I was hit by this problem, and -Djoda.version=2.9 worked. > I don't see any reason not to bump up joda-time version in pom.xml > Should I create a pull request for this? It is trivial. > https://github.com/aws/aws-sdk-java/issues/484 > https://github.com/aws/aws-sdk-java/issues/444 > http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version
[ https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982909#comment-14982909 ] Yongjia Wang commented on SPARK-11413: -- There is no more transitive dependencies from joda-time. So as long as it passes the tests, it should be good? > Java 8 build has problem with joda-time and s3 request, should bump joda-time > version > - > > Key: SPARK-11413 > URL: https://issues.apache.org/jira/browse/SPARK-11413 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Yongjia Wang >Priority: Minor > > Joda-time has problems with formatting time zones starting with Java 1.8u60, > and this will cause s3 request to fail. It is said to have been fixed at > joda-time 2.8.1. > Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, > should set -Djoda.version=2.8.1 or above. > I was hit by this problem, and -Djoda.version=2.9 worked. > I don't see any reason not to bump up joda-time version in pom.xml > Should I create a pull request for this? It is trivial. > https://github.com/aws/aws-sdk-java/issues/484 > https://github.com/aws/aws-sdk-java/issues/444 > http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version
Yongjia Wang created SPARK-11413: Summary: Java 8 build has problem with joda-time and s3 request, should bump joda-time version Key: SPARK-11413 URL: https://issues.apache.org/jira/browse/SPARK-11413 Project: Spark Issue Type: Improvement Components: Build Reporter: Yongjia Wang Priority: Minor Joda-time has problems with formatting time zones starting with Java 1.8u60, and this will cause s3 request to fail. It is said to have been fixed at joda-time 2.8.1. Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, should set -Djoda.version=2.8.1 or above. I was hit by this problem, and -Djoda.version=2.9 worked. I don't see any reason not to bump up joda-time version in pom.xml Should I create a pull request for this? It is trivial. https://github.com/aws/aws-sdk-java/issues/484 https://github.com/aws/aws-sdk-java/issues/444 http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11354) Expose custom log4j to executor page in Spark standalone cluster
[ https://issues.apache.org/jira/browse/SPARK-11354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-11354: - Attachment: custom log4j on executor page.png > Expose custom log4j to executor page in Spark standalone cluster > - > > Key: SPARK-11354 > URL: https://issues.apache.org/jira/browse/SPARK-11354 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Yongjia Wang > Attachments: custom log4j on executor page.png > > > Spark use log4j, which is very flexible. However, on the executor page in > standalone cluster, only stdout and stderr are shown in the UI. In the > default log4j profile, all messages are forwarded to System.err which is in > turned written to the file stderr in executor directory. Similarly, stdout is > written to the stdout file in executor directory. > It would be very useful to show all the file appenders configured in a custom > log4j profile. Right now, these file appenders are written to the executor > directory, but they are not exposed to the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11354) Expose custom log4j to executor page in Spark standalone cluster
Yongjia Wang created SPARK-11354: Summary: Expose custom log4j to executor page in Spark standalone cluster Key: SPARK-11354 URL: https://issues.apache.org/jira/browse/SPARK-11354 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Yongjia Wang Spark use log4j, which is very flexible. However, on the executor page in standalone cluster, only stdout and stderr are shown in the UI. In the default log4j profile, all messages are forwarded to System.err which is in turned written to the file stderr in executor directory. Similarly, stdout is written to the stdout file in executor directory. It would be very useful to show all the file appenders configured in a custom log4j profile. Right now, these file appenders are written to the executor directory, but they are not exposed to the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-11175: - Description: Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. I went through a few hacks: 1. launch the rdd action into a new thread from the function passed to foreachRDD. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch. 2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable. Therefore, I would propose to make the default behavior to just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete. was: Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. I went through a few hacks: 1. launch the rdd action into a new thread from the function passed to foreachRDD to achieve. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch. 2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable. Therefore, I would propose to make the default behavior to just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete. > Concurrent execution of JobSet within a batch in Spark streaming > > > Key: SPARK-11175 > URL: https://issues.apache.org/jira/browse/SPARK-11175 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Yongjia Wang > > Spark StreamingContext can register multiple independent Input DStreams (such > as from different Kafka topics) that results in multiple independent jobs for > each batch. These jobs should better be run concurrently to maximally take > advantage of available resources. > I went through a few hacks: > 1. launch the rdd action into a new thread from the function passed to > foreachRDD. However, it will mess up with streaming statistics since the > batch will finish immediately even the jobs it launched are still running in > another thread. This can further affect resuming from checkpoint, since all > batches are completed right away even the actual threaded jobs may fail and > checkpoint only resume the last batch. > 2. It's possible by just using foreachRDD and the available APIs to block the > Jobset to wait for all threads to join, but doing this would mess up with > closure serialization, and make checkpoint not usable. > Therefore, I would propose to make the default behavior to just run all jobs > of the current batch concurrently, and mark batch completion when all the > jobs complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming
Yongjia Wang created SPARK-11175: Summary: Concurrent execution of JobSet within a batch in Spark streaming Key: SPARK-11175 URL: https://issues.apache.org/jira/browse/SPARK-11175 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Yongjia Wang Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. I went through a few hacks: 1. launch the rdd action into a new thread from the function passed to foreachRDD to achieve. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch. 2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable. Therefore, I would propose to make the default behavior to just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint
[ https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-11152: - Priority: Major (was: Minor) > Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint > - > > Key: SPARK-11152 > URL: https://issues.apache.org/jira/browse/SPARK-11152 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Reporter: Yongjia Wang > > When a streaming job is resumed from a checkpoint at batch time x, and say > the current time when we resume this streaming job is x+10. In this scenario, > since Spark will schedule the missing batches from x+1 to x+10 without any > metadata, the behavior is to pack up all the backlogged inputs into batch > x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. > This results in tiny batches that capture inputs only during the back to back > scheduling intervals. This behavior is very reasonable. However, the > streaming UI does not show correctly the input sizes for all these makeup > batches - they are all 0 from batch x to x+10. Fixing this would be very > helpful. This happens when I use Kafka direct streaming, I assume this would > happen for all other streaming sources as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-11175: - Description: Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. The current behavior is that these jobs end up in an invisible job queue to be submitted one by one. I went through a few hacks: 1. launch the rdd action into a new thread from the function passed to foreachRDD. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch. 2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable. 3. Instead of running multiple Dstreams in one streaming context, just run them in separate streaming context (separate Spark applications). Putting aside the extra deployment overhead, when working with Spark standalone cluster which only has FIFO scheduler across applications, the resource has to be set in advance and it won't automatically adjust with resizing the cluster. Therefore, I think there is a good use case to make the default behavior just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete. was: Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. The current behavior is that these jobs end up in an invisible job queue to be submitted one by one. I went through a few hacks: 1. launch the rdd action into a new thread from the function passed to foreachRDD. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch. 2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable. Therefore, I would propose to make the default behavior to just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete. > Concurrent execution of JobSet within a batch in Spark streaming > > > Key: SPARK-11175 > URL: https://issues.apache.org/jira/browse/SPARK-11175 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Yongjia Wang > > Spark StreamingContext can register multiple independent Input DStreams (such > as from different Kafka topics) that results in multiple independent jobs for > each batch. These jobs should better be run concurrently to maximally take > advantage of available resources. The current behavior is that these jobs end > up in an invisible job queue to be submitted one by one. > I went through a few hacks: > 1. launch the rdd action into a new thread from the function passed to > foreachRDD. However, it will mess up with streaming statistics since the > batch will finish immediately even the jobs it launched are still running in > another thread. This can further affect resuming from checkpoint, since all > batches are completed right away even the actual threaded jobs may fail and > checkpoint only resume the last batch. > 2. It's possible by just using foreachRDD and the available APIs to block the > Jobset to wait for all threads to join, but doing this would mess up with > closure serialization, and make checkpoint not usable. > 3. Instead of running multiple Dstreams in one streaming context, just run > them in separate streaming context (separate Spark applications). Putting > aside the extra deployment overhead, when working with Spark standalone > cluster which only has FIFO scheduler across applications, the resource has > to be set in advance and it won't automatically adjust with resizing the > cluster. > Therefore, I think there is a good use case to make the default behavior just > run all jobs of the current batch concurrently, and mark batch completion >
[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-11175: - Description: Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. The current behavior is that these jobs end up in an invisible job queue to be submitted one by one. I went through a few hacks: 1. launch the rdd action into a new thread from the function passed to foreachRDD. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch. 2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable. Therefore, I would propose to make the default behavior to just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete. was: Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. I went through a few hacks: 1. launch the rdd action into a new thread from the function passed to foreachRDD. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch. 2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable. Therefore, I would propose to make the default behavior to just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete. > Concurrent execution of JobSet within a batch in Spark streaming > > > Key: SPARK-11175 > URL: https://issues.apache.org/jira/browse/SPARK-11175 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Yongjia Wang > > Spark StreamingContext can register multiple independent Input DStreams (such > as from different Kafka topics) that results in multiple independent jobs for > each batch. These jobs should better be run concurrently to maximally take > advantage of available resources. The current behavior is that these jobs end > up in an invisible job queue to be submitted one by one. > I went through a few hacks: > 1. launch the rdd action into a new thread from the function passed to > foreachRDD. However, it will mess up with streaming statistics since the > batch will finish immediately even the jobs it launched are still running in > another thread. This can further affect resuming from checkpoint, since all > batches are completed right away even the actual threaded jobs may fail and > checkpoint only resume the last batch. > 2. It's possible by just using foreachRDD and the available APIs to block the > Jobset to wait for all threads to join, but doing this would mess up with > closure serialization, and make checkpoint not usable. > Therefore, I would propose to make the default behavior to just run all jobs > of the current batch concurrently, and mark batch completion when all the > jobs complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962774#comment-14962774 ] Yongjia Wang commented on SPARK-11175: -- nice. should have found this. Thank you > Concurrent execution of JobSet within a batch in Spark streaming > > > Key: SPARK-11175 > URL: https://issues.apache.org/jira/browse/SPARK-11175 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Yongjia Wang > > Spark StreamingContext can register multiple independent Input DStreams (such > as from different Kafka topics) that results in multiple independent jobs for > each batch. These jobs should better be run concurrently to maximally take > advantage of available resources. The current behavior is that these jobs end > up in an invisible job queue to be submitted one by one. > I went through a few hacks: > 1. launch the rdd action into a new thread from the function passed to > foreachRDD. However, it will mess up with streaming statistics since the > batch will finish immediately even the jobs it launched are still running in > another thread. This can further affect resuming from checkpoint, since all > batches are completed right away even the actual threaded jobs may fail and > checkpoint only resume the last batch. > 2. It's possible by just using foreachRDD and the available APIs to block the > Jobset to wait for all threads to join, but doing this would mess up with > closure serialization, and make checkpoint not usable. > 3. Instead of running multiple Dstreams in one streaming context, just run > them in separate streaming context (separate Spark applications). Putting > aside the extra deployment overhead, when working with Spark standalone > cluster which only has FIFO scheduler across applications, the resource has > to be set in advance and it won't automatically adjust with resizing the > cluster. > Therefore, I think there is a good use case to make the default behavior just > run all jobs of the current batch concurrently, and mark batch completion > when all the jobs complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang closed SPARK-11175. Resolution: Not A Problem > Concurrent execution of JobSet within a batch in Spark streaming > > > Key: SPARK-11175 > URL: https://issues.apache.org/jira/browse/SPARK-11175 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Yongjia Wang > > Spark StreamingContext can register multiple independent Input DStreams (such > as from different Kafka topics) that results in multiple independent jobs for > each batch. These jobs should better be run concurrently to maximally take > advantage of available resources. The current behavior is that these jobs end > up in an invisible job queue to be submitted one by one. > I went through a few hacks: > 1. launch the rdd action into a new thread from the function passed to > foreachRDD. However, it will mess up with streaming statistics since the > batch will finish immediately even the jobs it launched are still running in > another thread. This can further affect resuming from checkpoint, since all > batches are completed right away even the actual threaded jobs may fail and > checkpoint only resume the last batch. > 2. It's possible by just using foreachRDD and the available APIs to block the > Jobset to wait for all threads to join, but doing this would mess up with > closure serialization, and make checkpoint not usable. > 3. Instead of running multiple Dstreams in one streaming context, just run > them in separate streaming context (separate Spark applications). Putting > aside the extra deployment overhead, when working with Spark standalone > cluster which only has FIFO scheduler across applications, the resource has > to be set in advance and it won't automatically adjust with resizing the > cluster. > Therefore, I think there is a good use case to make the default behavior just > run all jobs of the current batch concurrently, and mark batch completion > when all the jobs complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint
[ https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-11152: - Description: When a streaming job is resumed from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful. This happens when I use Kafka direct streaming, I assume this would happen for all other streaming sources as well. (was: When a streaming job starts from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful. This happens when I use Kafka direct streaming, I assume this would happen for all other streaming sources as well.) > Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint > - > > Key: SPARK-11152 > URL: https://issues.apache.org/jira/browse/SPARK-11152 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Reporter: Yongjia Wang >Priority: Minor > > When a streaming job is resumed from a checkpoint at batch time x, and say > the current time when we resume this streaming job is x+10. In this scenario, > since Spark will schedule the missing batches from x+1 to x+10 without any > metadata, the behavior is to pack up all the backlogged inputs into batch > x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. > This results in tiny batches that capture inputs only during the back to back > scheduling intervals. This behavior is very reasonable. However, the > streaming UI does not show correctly the input sizes for all these makeup > batches - they are all 0 from batch x to x+10. Fixing this would be very > helpful. This happens when I use Kafka direct streaming, I assume this would > happen for all other streaming sources as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint
Yongjia Wang created SPARK-11152: Summary: Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint Key: SPARK-11152 URL: https://issues.apache.org/jira/browse/SPARK-11152 Project: Spark Issue Type: Bug Components: Streaming, Web UI Reporter: Yongjia Wang Priority: Minor When a streaming job starts from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint
[ https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-11152: - Description: When a streaming job starts from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful. This happens when I use Kafka direct streaming, I assume this would happen for all other streaming sources as well. (was: When a streaming job starts from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful.) > Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint > - > > Key: SPARK-11152 > URL: https://issues.apache.org/jira/browse/SPARK-11152 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Reporter: Yongjia Wang >Priority: Minor > > When a streaming job starts from a checkpoint at batch time x, and say the > current time when we resume this streaming job is x+10. In this scenario, > since Spark will schedule the missing batches from x+1 to x+10 without any > metadata, the behavior is to pack up all the backlogged inputs into batch > x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. > This results in tiny batches that capture inputs only during the back to back > scheduling intervals. This behavior is very reasonable. However, the > streaming UI does not show correctly the input sizes for all these makeup > batches - they are all 0 from batch x to x+10. Fixing this would be very > helpful. This happens when I use Kafka direct streaming, I assume this would > happen for all other streaming sources as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10912) Improve Spark metrics executor.filesystem
[ https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-10912: - Attachment: s3a_metrics.patch Adding s3a is fairly straightforward. I guess the reason it's not included is because s3a support (via hadoop-aws.jar) is not part of the default hadoop distribution due to licensing issues. I created a patch to enable s3a metrics, both on the executors and on the driver. Reporting shuffle statistics requires more thoughts, although all the numbers are already collected in TaskMetrics.scala (input, output, shuffle, local, remote, spill, records, bytes, etc). I think it would make sense to report the aggregated metrics per executor across all tasks, so it's easy to have an overall sense of disk I/O and network traffic. > Improve Spark metrics executor.filesystem > - > > Key: SPARK-10912 > URL: https://issues.apache.org/jira/browse/SPARK-10912 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.5.0 >Reporter: Yongjia Wang >Priority: Minor > Attachments: s3a_metrics.patch > > > In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: > "hdfs" and "file". I started using s3 as the persistent storage with Spark > standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. > The 'file' metric appears to be only for driver reading local file, it would > be nice to also report shuffle read/write metrics, so it can help with > optimization. > I think these 2 things (s3 and shuffle) are very useful and cover all the > missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10912) Improve Spark metrics executor.filesystem
[ https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-10912: - Description: In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: "hdfs" and "file". I started using s3 as the persistent storage with Spark standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. The 'file' metric appears to be only for driver reading local file, it would be nice to also report shuffle read/write metrics, so it can help with optimization. I think these 2 things (s3 and shuffle) are very useful and cover all the missing information about Spark IO especially for s3 setup. was: In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: "hdfs" and "file". I started using s3 as the persistent storage with Spark standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. The 'file' metric appears to be only for driver reading local file, it would be nice to also report shuffle read/write metrics, so it can help understand things like if a Spark job becomes IO bound. I think these 2 things (s3 and shuffle) are very useful and cover all the missing information about Spark IO especially for s3 setup. > Improve Spark metrics executor.filesystem > - > > Key: SPARK-10912 > URL: https://issues.apache.org/jira/browse/SPARK-10912 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.5.0 >Reporter: Yongjia Wang > > In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: > "hdfs" and "file". I started using s3 as the persistent storage with Spark > standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. > The 'file' metric appears to be only for driver reading local file, it would > be nice to also report shuffle read/write metrics, so it can help with > optimization. > I think these 2 things (s3 and shuffle) are very useful and cover all the > missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10912) Improve Spark metrics executor.filesystem
Yongjia Wang created SPARK-10912: Summary: Improve Spark metrics executor.filesystem Key: SPARK-10912 URL: https://issues.apache.org/jira/browse/SPARK-10912 Project: Spark Issue Type: Improvement Affects Versions: 1.5.0 Reporter: Yongjia Wang In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: "hdfs" and "file". I started using s3 as the persistent storage with Spark standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. The 'file' metric appears to be only for driver reading local file, it would be nice to also report shuffle read/write metrics, so it can help understand things like if a Spark job becomes IO bound. I think these 2 things (s3 and shuffle) are very useful and cover all the missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940702#comment-14940702 ] Yongjia Wang commented on SPARK-5874: - The functionality about force save/load all pipeline components is very important. In the design doc, it says to do this in 1.4 for the new Transformer/Estimator framework under the .ml package. We are at 1.5.0 right now, nothing happened on that path. I wonder if there was some major conceptual changes or just workload/resource issue. > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5152) Let metrics.properties file take an hdfs:// path
[ https://issues.apache.org/jira/browse/SPARK-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902883#comment-14902883 ] Yongjia Wang edited comment on SPARK-5152 at 9/22/15 6:00 PM: -- I voted for this. This would enable configuring metrics or log4j properties of all the workers just from one placing when submitting the job. Without it, you will have to setup on each of the workers. If hdfs:// can be supported, I assume s3n:// s3a:// would all be supported since they go through the same interface. Alternatively, it's probably even better if there is a way, to specify through "conf" spark properties as in the spark-submit command line, to upload custom files to spark executor's working directory before the executor process starts. the "spark.files" option upload the files lazily when the first task starts, which is too late for configuration. was (Author: yongjiaw): I voted for this. It enables configuring metrics or log4j properties of all the workers just from the driver. Without it, you will have to setup on each of the workers. Alternatively, it's probably even better if there is a way, to specify through "conf" spark properties in the spark-submit command line, to upload custom files to spark executor's working directory before the executor process starts. the "spark.files" option upload the files lazily when the first task starts, which is too late for configuration. > Let metrics.properties file take an hdfs:// path > > > Key: SPARK-5152 > URL: https://issues.apache.org/jira/browse/SPARK-5152 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams > > From my reading of [the > code|https://github.com/apache/spark/blob/06dc4b5206a578065ebbb6bb8d54246ca007397f/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L53], > the {{spark.metrics.conf}} property must be a path that is resolvable on the > local filesystem of each executor. > Running a Spark job with {{--conf > spark.metrics.conf=hdfs://host1.domain.com/path/metrics.properties}} logs > many errors (~1 per executor, presumably?) like: > {code} > 15/01/08 13:20:57 ERROR metrics.MetricsConfig: Error loading configure file > java.io.FileNotFoundException: hdfs:/host1.domain.com/path/metrics.properties > (No such file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(FileInputStream.java:146) > at java.io.FileInputStream.(FileInputStream.java:101) > at > org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:53) > at > org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:92) > at > org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:218) > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:329) > at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:181) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:131) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:60) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > {code} > which seems consistent with the idea that it's looking on the local > filesystem and not parsing the "scheme" portion of the URL. > Letting all executors get their {{metrics.properties}} files from one > location on HDFS would be an improvement, right? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5152) Let metrics.properties file take an hdfs:// path
[ https://issues.apache.org/jira/browse/SPARK-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902883#comment-14902883 ] Yongjia Wang commented on SPARK-5152: - I voted for this. It enables configuring metrics or log4j properties of all the workers just from the driver. Without it, you will have to setup on each of the workers. Alternatively, it's probably even better if there is a way, to specify through "conf" spark properties in the spark-submit command line, to upload custom files to spark executor's working directory before the executor process starts. the "spark.files" option upload the files lazily when the first task starts, which is too late for configuration. > Let metrics.properties file take an hdfs:// path > > > Key: SPARK-5152 > URL: https://issues.apache.org/jira/browse/SPARK-5152 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Ryan Williams > > From my reading of [the > code|https://github.com/apache/spark/blob/06dc4b5206a578065ebbb6bb8d54246ca007397f/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L53], > the {{spark.metrics.conf}} property must be a path that is resolvable on the > local filesystem of each executor. > Running a Spark job with {{--conf > spark.metrics.conf=hdfs://host1.domain.com/path/metrics.properties}} logs > many errors (~1 per executor, presumably?) like: > {code} > 15/01/08 13:20:57 ERROR metrics.MetricsConfig: Error loading configure file > java.io.FileNotFoundException: hdfs:/host1.domain.com/path/metrics.properties > (No such file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(FileInputStream.java:146) > at java.io.FileInputStream.(FileInputStream.java:101) > at > org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:53) > at > org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:92) > at > org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:218) > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:329) > at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:181) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:131) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:60) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > {code} > which seems consistent with the idea that it's looking on the local > filesystem and not parsing the "scheme" portion of the URL. > Letting all executors get their {{metrics.properties}} files from one > location on HDFS would be an improvement, right? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3512) yarn-client through socks proxy
Yongjia Wang created SPARK-3512: --- Summary: yarn-client through socks proxy Key: SPARK-3512 URL: https://issues.apache.org/jira/browse/SPARK-3512 Project: Spark Issue Type: Wish Components: YARN Reporter: Yongjia Wang I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. For example, using ipython notebook, or more fancy IDEs, etc. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling into some work station inside the firewall. Then the client only needs to talk through this proxy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3512) yarn-client through socks proxy
[ https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-3512: Description: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. For example, using ipython notebook, or more fancy IDEs, etc. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling into some work station inside the firewall. Then the spark yarn-client only needs to talk through this proxy. (was: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. For example, using ipython notebook, or more fancy IDEs, etc. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling into some work station inside the firewall. Then the client only needs to talk through this proxy.) yarn-client through socks proxy --- Key: SPARK-3512 URL: https://issues.apache.org/jira/browse/SPARK-3512 Project: Spark Issue Type: Wish Components: YARN Reporter: Yongjia Wang I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. For example, using ipython notebook, or more fancy IDEs, etc. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling into some work station inside the firewall. Then the spark yarn-client only needs to talk through this proxy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3512) yarn-client through socks proxy
[ https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-3512: Description: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. For example, using ipython notebook, or more fancy IDEs, etc. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without changing any configurations. (was: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. For example, using ipython notebook, or more fancy IDEs, etc. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling into some work station inside the firewall. Then the spark yarn-client only needs to talk through this proxy.) yarn-client through socks proxy --- Key: SPARK-3512 URL: https://issues.apache.org/jira/browse/SPARK-3512 Project: Spark Issue Type: Wish Components: YARN Reporter: Yongjia Wang I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. For example, using ipython notebook, or more fancy IDEs, etc. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without changing any configurations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3512) yarn-client through socks proxy
[ https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-3512: Description: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. You would have full control of local resource that can be accessed by the client as opposed to be limited to the spark-shell if you would do the conventional way to ssh to the remote host inside the firewall. For example, using ipython notebook, or more fancy IDEs, etc. Installing anything you want on the remote host is usually not an option. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling (ssh -D port user@remote-host) into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without the need to changing any configurations. Does this sound feasible? (was: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. For example, using ipython notebook, or more fancy IDEs, etc. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without changing any configurations.) yarn-client through socks proxy --- Key: SPARK-3512 URL: https://issues.apache.org/jira/browse/SPARK-3512 Project: Spark Issue Type: Wish Components: YARN Reporter: Yongjia Wang I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. You would have full control of local resource that can be accessed by the client as opposed to be limited to the spark-shell if you would do the conventional way to ssh to the remote host inside the firewall. For example, using ipython notebook, or more fancy IDEs, etc. Installing anything you want on the remote host is usually not an option. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling (ssh -D port user@remote-host) into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without the need to changing any configurations. Does this sound feasible? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3512) yarn-client through socks proxy
[ https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-3512: Description: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. You would have full control of local resource that can be accessed by the client as opposed to be limited to the spark-shell if you would do the conventional way to ssh to the remote host inside the firewall. For example, using ipython notebook, or more fancy IDEs, etc. Installing anything you want on the remote host is usually not an option. A potential solution is to setup socks proxy on your local machine outside of the firewall through shh tunneling (ssh -D local-proxy-port user@remote-host) into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without the need of changing any configurations. Does this sound feasible? (was: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. You would have full control of local resource that can be accessed by the client as opposed to be limited to the spark-shell if you would do the conventional way to ssh to the remote host inside the firewall. For example, using ipython notebook, or more fancy IDEs, etc. Installing anything you want on the remote host is usually not an option. A potential solution is to setup socks proxy on the local machine outside of the firewall through shh tunneling (ssh -D port user@remote-host) into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without the need to changing any configurations. Does this sound feasible?) yarn-client through socks proxy --- Key: SPARK-3512 URL: https://issues.apache.org/jira/browse/SPARK-3512 Project: Spark Issue Type: Wish Components: YARN Reporter: Yongjia Wang I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. You would have full control of local resource that can be accessed by the client as opposed to be limited to the spark-shell if you would do the conventional way to ssh to the remote host inside the firewall. For example, using ipython notebook, or more fancy IDEs, etc. Installing anything you want on the remote host is usually not an option. A potential solution is to setup socks proxy on your local machine outside of the firewall through shh tunneling (ssh -D local-proxy-port user@remote-host) into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without the need of changing any configurations. Does this sound feasible? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3512) yarn-client through socks proxy
[ https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-3512: Description: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. You would have full control of local resource that can be accessed by the client as opposed to be limited to the spark-shell if you would do the conventional way to ssh to the remote host inside the firewall. For example, using ipython notebook, or more fancy IDEs, etc. Installing anything you want on the remote host is usually not an option. A potential solution is to setup socks proxy on your local machine outside of the firewall through shh tunneling (ssh -D local-proxy-port user@remote-host) into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without the need of changing any configurations. Does this sound feasible? Maybe VPN is the right solution? (was: I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. You would have full control of local resource that can be accessed by the client as opposed to be limited to the spark-shell if you would do the conventional way to ssh to the remote host inside the firewall. For example, using ipython notebook, or more fancy IDEs, etc. Installing anything you want on the remote host is usually not an option. A potential solution is to setup socks proxy on your local machine outside of the firewall through shh tunneling (ssh -D local-proxy-port user@remote-host) into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without the need of changing any configurations. Does this sound feasible?) yarn-client through socks proxy --- Key: SPARK-3512 URL: https://issues.apache.org/jira/browse/SPARK-3512 Project: Spark Issue Type: Wish Components: YARN Reporter: Yongjia Wang I believe this would be a common scenario that the yarn cluster runs behind a firewall, while people want to run spark driver locally for best interactivity experience. You would have full control of local resource that can be accessed by the client as opposed to be limited to the spark-shell if you would do the conventional way to ssh to the remote host inside the firewall. For example, using ipython notebook, or more fancy IDEs, etc. Installing anything you want on the remote host is usually not an option. A potential solution is to setup socks proxy on your local machine outside of the firewall through shh tunneling (ssh -D local-proxy-port user@remote-host) into some work station inside the firewall. Then the spark yarn-client only needs to talk to the cluster through this proxy without the need of changing any configurations. Does this sound feasible? Maybe VPN is the right solution? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org