[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
[ https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495431#comment-15495431 ] Hyukjin Kwon commented on SPARK-17557: -- Do you mind if I ask a simple file so that I can reproduce this error? If not possible, could you please let me know some steps to reproduce this? I would like to reproduce this and test. If you are going to submit a PR soon, then, I will definitely just refer your PR. > SQL query on parquet table java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary > - > > Key: SPARK-17557 > URL: https://issues.apache.org/jira/browse/SPARK-17557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > Working on 1.6.2, broken on 2.0 > {code} > select * from logs.a where year=2016 and month=9 and day=14 limit 100 > {code} > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
[ https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495413#comment-15495413 ] Hyukjin Kwon commented on SPARK-17545: -- FYI - this is related with https://github.com/apache/spark/pull/14279 if you only meant 2.0 but not the master branch. > Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset > --- > > Key: SPARK-17545 > URL: https://issues.apache.org/jira/browse/SPARK-17545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nathan Beyer > > When parsing a CSV with a date/time column that contains a variant ISO 8601 > that doesn't include a colon in the offset, casting to Timestamp fails. > Here's a simple, example CSV content. > {quote} > time > "2015-07-20T15:09:23.736-0500" > "2015-07-20T15:10:51.687-0500" > "2015-11-21T23:15:01.499-0600" > {quote} > Here's the stack trace that results from processing this data. > {quote} > 16/09/14 15:22:59 ERROR Utils: Aborting task > java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600 > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287) > {quote} > Somewhat related, I believe Python standard libraries can produce this form > of zone offset. The system I got the data from is written in Python. > https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
[ https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495408#comment-15495408 ] Hyukjin Kwon edited comment on SPARK-17545 at 9/16/16 5:20 AM: --- Hi [~nbeyer], the basic ISO format currently follows https://www.w3.org/TR/NOTE-datetime That says {quote} 1997-07-16T19:20:30.45+01:00 {quote} is the right ISO format where timezone is {quote} TZD = time zone designator (Z or +hh:mm or -hh:mm) {quote} To make sure, I double-checked the ISO 8601 - 2004 full specification in http://www.uai.cl/images/sitio/biblioteca/citas/ISO_8601_2004en.pdf That says, {quote} ... the expression shall either be completely in basic format, in which case the minimum number of separators necessary for the required expression is used, or completely in extended format, in which case additional separators shall be used ... {quote} where the basic format is {{20160707T211822+0300}} whereas the extended format is {{2016-07-07T21:18:22+03:00}}. In addition, basic format seems even discouraged in text format {quote} NOTE : The basic format should be avoided in plain text. {quote} Therefore, {{2016-07-07T21:18:22+03:00}} Is the right ISO 8601:2004. whereas {{2016-07-07T21:18:22+0300}} Is not because the zone designator may not be in the basic format when the date and time of day is in the extended format. was (Author: hyukjin.kwon): Hi [~nbeyer], the basic ISO format currently follows https://www.w3.org/TR/NOTE-datetime That says {quote} 1997-07-16T19:20:30.45+01:00 {quote} is the right ISO format where timezone is {quote} TZD = time zone designator (Z or +hh:mm or -hh:mm) {quote} To make sure, I double-checked the ISO 8601 - 2004 full specification in http://www.uai.cl/images/sitio/biblioteca/citas/ISO_8601_2004en.pdf That says, {quote} ... the expression shall either be completely in basic format, in which case the minimum number of separators necessary for the required expression is used, or completely in extended format, in which case additional separators shall be used ... {quote} where the basic format is {{20160707T211822+0300 }} whereas the extended format is {{2016-07-07T21:18:22+03:00}}. In addition, basic format seems even discouraged in text format {quote} NOTE : The basic format should be avoided in plain text. {quote} Therefore, {{2016-07-07T21:18:22+03:00}} Is the right ISO 8601:2004. whereas {{2016-07-07T21:18:22+0300}} Is not because the zone designator may not be in the basic format when the date and time of day is in the extended format. > Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset > --- > > Key: SPARK-17545 > URL: https://issues.apache.org/jira/browse/SPARK-17545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nathan Beyer > > When parsing a CSV with a date/time column that contains a variant ISO 8601 > that doesn't include a colon in the offset, casting to Timestamp fails. > Here's a simple, example CSV content. > {quote} > time > "2015-07-20T15:09:23.736-0500" > "2015-07-20T15:10:51.687-0500" > "2015-11-21T23:15:01.499-0600" > {quote} > Here's the stack trace that results from processing this data. > {quote} > 16/09/14 15:22:59 ERROR Utils: Aborting task > java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600 > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287) > {quote} > Somewhat related, I believe Python standard libraries can produce this form > of zone offset. The system I got the data from is written in Python. > https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
[ https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495410#comment-15495410 ] Hyukjin Kwon commented on SPARK-17545: -- Therefore, IMHO, this is not an issue as we can workaround via `timestampFormat` and `dateFormat` options in current master branch. > Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset > --- > > Key: SPARK-17545 > URL: https://issues.apache.org/jira/browse/SPARK-17545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nathan Beyer > > When parsing a CSV with a date/time column that contains a variant ISO 8601 > that doesn't include a colon in the offset, casting to Timestamp fails. > Here's a simple, example CSV content. > {quote} > time > "2015-07-20T15:09:23.736-0500" > "2015-07-20T15:10:51.687-0500" > "2015-11-21T23:15:01.499-0600" > {quote} > Here's the stack trace that results from processing this data. > {quote} > 16/09/14 15:22:59 ERROR Utils: Aborting task > java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600 > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287) > {quote} > Somewhat related, I believe Python standard libraries can produce this form > of zone offset. The system I got the data from is written in Python. > https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
[ https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495408#comment-15495408 ] Hyukjin Kwon commented on SPARK-17545: -- Hi [~nbeyer], the basic ISO format currently follows https://www.w3.org/TR/NOTE-datetime That says {quote} 1997-07-16T19:20:30.45+01:00 {quote} is the right ISO format where timezone is {quote} TZD = time zone designator (Z or +hh:mm or -hh:mm) {quote} To make sure, I double-checked the ISO 8601 - 2004 full specification in http://www.uai.cl/images/sitio/biblioteca/citas/ISO_8601_2004en.pdf That says, {quote} ... the expression shall either be completely in basic format, in which case the minimum number of separators necessary for the required expression is used, or completely in extended format, in which case additional separators shall be used ... {quote} where the basic format is {{20160707T211822+0300 }} whereas the extended format is {{2016-07-07T21:18:22+03:00}}. In addition, basic format seems even discouraged in text format {quote} NOTE : The basic format should be avoided in plain text. {quote} Therefore, {{2016-07-07T21:18:22+03:00}} Is the right ISO 8601:2004. whereas {{2016-07-07T21:18:22+0300}} Is not because the zone designator may not be in the basic format when the date and time of day is in the extended format. > Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset > --- > > Key: SPARK-17545 > URL: https://issues.apache.org/jira/browse/SPARK-17545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nathan Beyer > > When parsing a CSV with a date/time column that contains a variant ISO 8601 > that doesn't include a colon in the offset, casting to Timestamp fails. > Here's a simple, example CSV content. > {quote} > time > "2015-07-20T15:09:23.736-0500" > "2015-07-20T15:10:51.687-0500" > "2015-11-21T23:15:01.499-0600" > {quote} > Here's the stack trace that results from processing this data. > {quote} > 16/09/14 15:22:59 ERROR Utils: Aborting task > java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600 > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287) > {quote} > Somewhat related, I believe Python standard libraries can produce this form > of zone offset. The system I got the data from is written in Python. > https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17559) PeriodicGraphCheckpointer didnot persist edges as expected in some cases
[ https://issues.apache.org/jira/browse/SPARK-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17559: Assignee: (was: Apache Spark) > PeriodicGraphCheckpointer didnot persist edges as expected in some cases > > > Key: SPARK-17559 > URL: https://issues.apache.org/jira/browse/SPARK-17559 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: ding >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't > persisted. As currently only when vertices's storage level is none, graph is > persisted. However there is a chance vertices's storage level is not none > while edges's is none. Eg. graph created by a outerJoinVertices operation, > vertices is automatically cached while edges is not. In this way, edges will > not be persisted if we use PeriodicGraphCheckpointer do persist. > See below minimum example: >val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], > Int](2, sc) > val users = sc.textFile("data/graphx/users.txt") > .map(line => line.split(",")).map(parts => (parts.head.toLong, > parts.tail)) > val followerGraph = GraphLoader.edgeListFile(sc, > "data/graphx/followers.txt") > val graph = followerGraph.outerJoinVertices(users) { > case (uid, deg, Some(attrList)) => attrList > case (uid, deg, None) => Array.empty[String] > } > graphCheckpointer.update(graph) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17559) PeriodicGraphCheckpointer didnot persist edges as expected in some cases
[ https://issues.apache.org/jira/browse/SPARK-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495370#comment-15495370 ] Apache Spark commented on SPARK-17559: -- User 'dding3' has created a pull request for this issue: https://github.com/apache/spark/pull/15116 > PeriodicGraphCheckpointer didnot persist edges as expected in some cases > > > Key: SPARK-17559 > URL: https://issues.apache.org/jira/browse/SPARK-17559 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: ding >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't > persisted. As currently only when vertices's storage level is none, graph is > persisted. However there is a chance vertices's storage level is not none > while edges's is none. Eg. graph created by a outerJoinVertices operation, > vertices is automatically cached while edges is not. In this way, edges will > not be persisted if we use PeriodicGraphCheckpointer do persist. > See below minimum example: >val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], > Int](2, sc) > val users = sc.textFile("data/graphx/users.txt") > .map(line => line.split(",")).map(parts => (parts.head.toLong, > parts.tail)) > val followerGraph = GraphLoader.edgeListFile(sc, > "data/graphx/followers.txt") > val graph = followerGraph.outerJoinVertices(users) { > case (uid, deg, Some(attrList)) => attrList > case (uid, deg, None) => Array.empty[String] > } > graphCheckpointer.update(graph) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17559) PeriodicGraphCheckpointer didnot persist edges as expected in some cases
[ https://issues.apache.org/jira/browse/SPARK-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17559: Assignee: Apache Spark > PeriodicGraphCheckpointer didnot persist edges as expected in some cases > > > Key: SPARK-17559 > URL: https://issues.apache.org/jira/browse/SPARK-17559 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: ding >Assignee: Apache Spark >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't > persisted. As currently only when vertices's storage level is none, graph is > persisted. However there is a chance vertices's storage level is not none > while edges's is none. Eg. graph created by a outerJoinVertices operation, > vertices is automatically cached while edges is not. In this way, edges will > not be persisted if we use PeriodicGraphCheckpointer do persist. > See below minimum example: >val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], > Int](2, sc) > val users = sc.textFile("data/graphx/users.txt") > .map(line => line.split(",")).map(parts => (parts.head.toLong, > parts.tail)) > val followerGraph = GraphLoader.edgeListFile(sc, > "data/graphx/followers.txt") > val graph = followerGraph.outerJoinVertices(users) { > case (uid, deg, Some(attrList)) => attrList > case (uid, deg, None) => Array.empty[String] > } > graphCheckpointer.update(graph) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17559) PeriodicGraphCheckpointer didnot persist edges as expected in some cases
ding created SPARK-17559: Summary: PeriodicGraphCheckpointer didnot persist edges as expected in some cases Key: SPARK-17559 URL: https://issues.apache.org/jira/browse/SPARK-17559 Project: Spark Issue Type: Bug Components: MLlib Reporter: ding Priority: Minor When use PeriodicGraphCheckpointer to persist graph, sometimes the edge isn't persisted. As currently only when vertices's storage level is none, graph is persisted. However there is a chance vertices's storage level is not none while edges's is none. Eg. graph created by a outerJoinVertices operation, vertices is automatically cached while edges is not. In this way, edges will not be persisted if we use PeriodicGraphCheckpointer do persist. See below minimum example: val graphCheckpointer = new PeriodicGraphCheckpointer[Array[String], Int](2, sc) val users = sc.textFile("data/graphx/users.txt") .map(line => line.split(",")).map(parts => (parts.head.toLong, parts.tail)) val followerGraph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt") val graph = followerGraph.outerJoinVertices(users) { case (uid, deg, Some(attrList)) => attrList case (uid, deg, None) => Array.empty[String] } graphCheckpointer.update(graph) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17522) [MESOS] More even distribution of executors on Mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495243#comment-15495243 ] Saisai Shao edited comment on SPARK-17522 at 9/16/16 3:19 AM: -- [~sunrui] I think the performance is depended on different workloads. For example if your workloads are mainly ETL like workloads, spreading out will better leverage the network and IO bandwidth. But in some other cases like ML, in which CPU plays a dominant role while input data is not large, it is better to put executors together for fast data exchange and iteration. You could refer to Slider for affinity and anti-affinity resource allocation, this could either be done in cluster manager or upstream frameworks. was (Author: jerryshao): [~sunrui] I think the performance is depended on different workloads. For example if your workloads are mainly ETL like workloads, spreading out will better leverage the network and IO bandwidth. But in some other cases like ML, in which CPU plays a dominant role while input data is not large, it is better to put executors together for fast data exchange and iteration. You could refer to Slider for affinity and anti-affinity resource allocation, should could either be done in cluster manager or upstream frameworks. > [MESOS] More even distribution of executors on Mesos cluster > > > Key: SPARK-17522 > URL: https://issues.apache.org/jira/browse/SPARK-17522 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Sun Rui > > The MesosCoarseGrainedSchedulerBackend launch executors in a round-robin way > among accepted offers that are received at once, but it is observed that > typically executors are launched on a small number of slaves. > It is found that MesosCoarseGrainedSchedulerBackend mostly is receiving only > one offer once on a cluster composed of many nodes, so that the round-robin > assignment of executors among offers do not have expected result, which leads > to the fact that executors are located on a smaller number of slave nodes > than expected, which suffers bad data locality. > An experimental slight change to > MesosCoarseGrainedSchedulerBackend::buildMesosTasks() shows better executor > distribution among nodes: > {code} > while (launchTasks) { > launchTasks = false > for (offer <- offers) { > ... >} > + if (conf.getBoolean("spark.deploy.spreadOut", true)) { > +launchTasks = false > + } > } > tasks.toMap > {code} > One of my spark programs can run 30% faster due to this change because of > better data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17522) [MESOS] More even distribution of executors on Mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495243#comment-15495243 ] Saisai Shao commented on SPARK-17522: - [~sunrui] I think the performance is depended on different workloads. For example if your workloads are mainly ETL like workloads, spreading out will better leverage the network and IO bandwidth. But in some other cases like ML, in which CPU plays a dominant role while input data is not large, it is better to put executors together for fast data exchange and iteration. You could refer to Slider for affinity and anti-affinity resource allocation, should could either be done in cluster manager or upstream frameworks. > [MESOS] More even distribution of executors on Mesos cluster > > > Key: SPARK-17522 > URL: https://issues.apache.org/jira/browse/SPARK-17522 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Sun Rui > > The MesosCoarseGrainedSchedulerBackend launch executors in a round-robin way > among accepted offers that are received at once, but it is observed that > typically executors are launched on a small number of slaves. > It is found that MesosCoarseGrainedSchedulerBackend mostly is receiving only > one offer once on a cluster composed of many nodes, so that the round-robin > assignment of executors among offers do not have expected result, which leads > to the fact that executors are located on a smaller number of slave nodes > than expected, which suffers bad data locality. > An experimental slight change to > MesosCoarseGrainedSchedulerBackend::buildMesosTasks() shows better executor > distribution among nodes: > {code} > while (launchTasks) { > launchTasks = false > for (offer <- offers) { > ... >} > + if (conf.getBoolean("spark.deploy.spreadOut", true)) { > +launchTasks = false > + } > } > tasks.toMap > {code} > One of my spark programs can run 30% faster due to this change because of > better data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495131#comment-15495131 ] ding commented on SPARK-5484: - I will work on the issue if nobody took it. > Pregel should checkpoint periodically to avoid StackOverflowError > - > > Key: SPARK-5484 > URL: https://issues.apache.org/jira/browse/SPARK-5484 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > Pregel-based iterative algorithms with more than ~50 iterations begin to slow > down and eventually fail with a StackOverflowError due to Spark's lack of > support for long lineage chains. Instead, Pregel should checkpoint the graph > periodically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17558) Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
[ https://issues.apache.org/jira/browse/SPARK-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495125#comment-15495125 ] Apache Spark commented on SPARK-17558: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/15115 > Bump Hadoop 2.7 version from 2.7.2 to 2.7.3 > --- > > Key: SPARK-17558 > URL: https://issues.apache.org/jira/browse/SPARK-17558 > Project: Spark > Issue Type: New Feature > Components: Build >Reporter: Reynold Xin >Assignee: Reynold Xin > > The new patch release fixes some bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17558) Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
[ https://issues.apache.org/jira/browse/SPARK-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17558: Assignee: Apache Spark (was: Reynold Xin) > Bump Hadoop 2.7 version from 2.7.2 to 2.7.3 > --- > > Key: SPARK-17558 > URL: https://issues.apache.org/jira/browse/SPARK-17558 > Project: Spark > Issue Type: New Feature > Components: Build >Reporter: Reynold Xin >Assignee: Apache Spark > > The new patch release fixes some bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17558) Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
[ https://issues.apache.org/jira/browse/SPARK-17558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17558: Assignee: Reynold Xin (was: Apache Spark) > Bump Hadoop 2.7 version from 2.7.2 to 2.7.3 > --- > > Key: SPARK-17558 > URL: https://issues.apache.org/jira/browse/SPARK-17558 > Project: Spark > Issue Type: New Feature > Components: Build >Reporter: Reynold Xin >Assignee: Reynold Xin > > The new patch release fixes some bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17558) Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
Reynold Xin created SPARK-17558: --- Summary: Bump Hadoop 2.7 version from 2.7.2 to 2.7.3 Key: SPARK-17558 URL: https://issues.apache.org/jira/browse/SPARK-17558 Project: Spark Issue Type: New Feature Components: Build Reporter: Reynold Xin Assignee: Reynold Xin The new patch release fixes some bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) API design: window and session specification
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495032#comment-15495032 ] Tathagata Das commented on SPARK-10816: --- Yeah, its yet to be designed. > API design: window and session specification > > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15472) Add support for writing in `csv`, `json`, `text` formats in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495015#comment-15495015 ] Reynold Xin commented on SPARK-15472: - This is done in 2.0, isn't it? cc [~zsxwing] > Add support for writing in `csv`, `json`, `text` formats in Structured > Streaming > > > Key: SPARK-15472 > URL: https://issues.apache.org/jira/browse/SPARK-15472 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin > > Support for partitioned `parquet` format in FileStreamSink was added in > Spark-14716, now let's add support for partitioned `csv`, 'json', `text` > format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10815) API design: data sources and sinks
[ https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10815: Description: The existing source/sink interface for structured streaming depends on RDDs. This dependency has two issues: 1. The RDD interface is wide and difficult to stabilize across versions. This is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. Ideally, a source/sink implementation created for Spark 2.x should work in Spark 10.x, assuming the JVM is still around. 2. It is difficult to swap in/out a different execution engine. The purpose of this ticket is to create a stable interface that addresses the above two. > API design: data sources and sinks > -- > > Key: SPARK-10815 > URL: https://issues.apache.org/jira/browse/SPARK-10815 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Reynold Xin > > The existing source/sink interface for structured streaming depends on RDDs. > This dependency has two issues: > 1. The RDD interface is wide and difficult to stabilize across versions. This > is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. > Ideally, a source/sink implementation created for Spark 2.x should work in > Spark 10.x, assuming the JVM is still around. > 2. It is difficult to swap in/out a different execution engine. > The purpose of this ticket is to create a stable interface that addresses the > above two. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10815) API design: data sources and sinks
[ https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10815: Description: The existing (in 2.0) source/sink interface for structured streaming depends on RDDs. This dependency has two issues: 1. The RDD interface is wide and difficult to stabilize across versions. This is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. Ideally, a source/sink implementation created for Spark 2.x should work in Spark 10.x, assuming the JVM is still around. 2. It is difficult to swap in/out a different execution engine. The purpose of this ticket is to create a stable interface that addresses the above two. was: The existing source/sink interface for structured streaming depends on RDDs. This dependency has two issues: 1. The RDD interface is wide and difficult to stabilize across versions. This is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. Ideally, a source/sink implementation created for Spark 2.x should work in Spark 10.x, assuming the JVM is still around. 2. It is difficult to swap in/out a different execution engine. The purpose of this ticket is to create a stable interface that addresses the above two. > API design: data sources and sinks > -- > > Key: SPARK-10815 > URL: https://issues.apache.org/jira/browse/SPARK-10815 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Reynold Xin > > The existing (in 2.0) source/sink interface for structured streaming depends > on RDDs. This dependency has two issues: > 1. The RDD interface is wide and difficult to stabilize across versions. This > is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. > Ideally, a source/sink implementation created for Spark 2.x should work in > Spark 10.x, assuming the JVM is still around. > 2. It is difficult to swap in/out a different execution engine. > The purpose of this ticket is to create a stable interface that addresses the > above two. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) API design: window and session specification
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495011#comment-15495011 ] Reynold Xin commented on SPARK-10816: - I guess window specification was done, but session remains to be done? cc [~marmbrus] [~tdas] [~zsxwing] > API design: window and session specification > > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15495006#comment-15495006 ] Reynold Xin commented on SPARK-16407: - The source/sink interface currently depends on RDDs, doesn't it? In that case, it has two issues: 1. The RDD interface is wide and difficult to stabilize across versions. This is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. Ideally, a source/sink implementation created for Spark 2.x should work in Spark 10.x, assuming the JVM is still around. 2. It is difficult to swap in/out a different execution engine. Actually I'm going to move the above into SPARK-10815 and just continue the discussion there. > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15689: Target Version/s: 2.2.0 (was: 2.1.0) > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494955#comment-15494955 ] Reynold Xin commented on SPARK-16534: - [~maver1ck] thanks for the comment. That's a great point. That said, there is a big difference in whether we can implement a feature and allow users to use it for demo/learning, versus whether we realistically do a good job so it is good for use in production 24/7. It is very easy to promise the former and just ignore the latter. In this context, it would be a lot simpler to have the structured streaming architecture working in production 24/7 in the long term than the dstream architecture, **for non JVM languages**. To clarify, I wasn't suggesting streaming in Python should be killed, but rather dstream in Python isn't a good architecture for running 24/7 streaming jobs. > Kafka 0.10 Python support > - > > Key: SPARK-16534 > URL: https://issues.apache.org/jira/browse/SPARK-16534 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494795#comment-15494795 ] Frederick Reiss commented on SPARK-16407: - With respect, I'm not seeing a whole lot of flux in those APIs, or in Structured Streaming in general. SPARK-8360, the JIRA for source and sink API design, has no description, no comments, and no changes since November of 2015. The Structured Streaming design doc hasn't changed appreciably since May. Even short-term tasks in Structured Streaming are very thin on the ground, and PR traffic is minimal. Is something going on behind closed doors that other members of the community are not aware of? > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?
[ https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494715#comment-15494715 ] Josh Rosen commented on SPARK-17544: 3.0.1 release fixing this should be available now: https://github.com/databricks/spark-avro/releases/tag/v3.0.1 > Timeout waiting for connection from pool, DataFrame Reader's not closing S3 > connections? > > > Key: SPARK-17544 > URL: https://issues.apache.org/jira/browse/SPARK-17544 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Amazon EMR, S3, Scala >Reporter: Brady Auen > Labels: newbie > > I have an application that loops through a text file to find files in S3 and > then reads them in, performs some ETL processes, and then writes them out. > This works for around 80 loops until I get this: > {noformat} > 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening > 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro' > for reading > 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: > Timeout waiting for connection from pool > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: > Timeout waiting for connection from pool > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown > Source) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991) > at > com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source) > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313) > at > org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289) > at >
[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494681#comment-15494681 ] Gang Wu commented on SPARK-17477: - Just confirmed that this also doesn't work with vectorized reader. What I did is as follows: 1. Created a flat hive table with schema "name: String, id: Long". But the parquet file which contains 100 rows is using "name: String, id: Int". 2. Then just did a query "select * from table" and show the result. It works fine with DataFrame.count and DataFrame .printSchema() Got the following exception: Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562) at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) at org.apache.spark.sql.Dataset.take(Dataset.scala:2139) at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at org.apache.spark.sql.Dataset.show(Dataset.scala:526) at org.apache.spark.sql.Dataset.show(Dataset.scala:486) at org.apache.spark.sql.Dataset.show(Dataset.scala:495) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int
[jira] [Commented] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
[ https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494666#comment-15494666 ] Nathan Beyer commented on SPARK-17545: -- As a workaround, the following format can be set as an option for dataframe reads: {code}spark.read.option("dateFormat", "-MM-dd'T'HH:mm:ss.SSSXX").csv(path){code} > Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset > --- > > Key: SPARK-17545 > URL: https://issues.apache.org/jira/browse/SPARK-17545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nathan Beyer > > When parsing a CSV with a date/time column that contains a variant ISO 8601 > that doesn't include a colon in the offset, casting to Timestamp fails. > Here's a simple, example CSV content. > {quote} > time > "2015-07-20T15:09:23.736-0500" > "2015-07-20T15:10:51.687-0500" > "2015-11-21T23:15:01.499-0600" > {quote} > Here's the stack trace that results from processing this data. > {quote} > 16/09/14 15:22:59 ERROR Utils: Aborting task > java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600 > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287) > {quote} > Somewhat related, I believe Python standard libraries can produce this form > of zone offset. The system I got the data from is written in Python. > https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset
[ https://issues.apache.org/jira/browse/SPARK-17545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Beyer updated SPARK-17545: - Summary: Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset (was: Spark SQL Catalyst doesn't handle ISO 8601 date with colon in offset) > Spark SQL Catalyst doesn't handle ISO 8601 date without colon in offset > --- > > Key: SPARK-17545 > URL: https://issues.apache.org/jira/browse/SPARK-17545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nathan Beyer > > When parsing a CSV with a date/time column that contains a variant ISO 8601 > that doesn't include a colon in the offset, casting to Timestamp fails. > Here's a simple, example CSV content. > {quote} > time > "2015-07-20T15:09:23.736-0500" > "2015-07-20T15:10:51.687-0500" > "2015-11-21T23:15:01.499-0600" > {quote} > Here's the stack trace that results from processing this data. > {quote} > 16/09/14 15:22:59 ERROR Utils: Aborting task > java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600 > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287) > {quote} > Somewhat related, I believe Python standard libraries can produce this form > of zone offset. The system I got the data from is written in Python. > https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17458: -- Assignee: (was: Herman van Hovell) > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17473) jdbc docker tests are failing with java.lang.AbstractMethodError:
[ https://issues.apache.org/jira/browse/SPARK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17473: Assignee: Apache Spark > jdbc docker tests are failing with java.lang.AbstractMethodError: > - > > Key: SPARK-17473 > URL: https://issues.apache.org/jira/browse/SPARK-17473 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Suresh Thalamati >Assignee: Apache Spark > > build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive-thriftserver > -Phive -DskipTests clean install > build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 > compile test > Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; > support was removed in 8.0 > Discovery starting. > Discovery completed in 200 milliseconds. > Run starting. Expected test count is: 10 > MySQLIntegrationSuite: > Error: > 16/09/06 11:52:00 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, 9.31.117.25, 51868) > *** RUN ABORTED *** > java.lang.AbstractMethodError: > at > org.glassfish.jersey.model.internal.CommonConfig.configureAutoDiscoverableProviders(CommonConfig.java:622) > at > org.glassfish.jersey.client.ClientConfig$State.configureAutoDiscoverableProviders(ClientConfig.java:357) > at > org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:392) > at > org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88) > at > org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120) > at > org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117) > at > org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340) > at > org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726) > at > org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285) > at > org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126) > ... > 16/09/06 11:52:00 INFO SparkContext: Invoking stop() from shutdown hook > 16/09/06 11:52:00 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17473) jdbc docker tests are failing with java.lang.AbstractMethodError:
[ https://issues.apache.org/jira/browse/SPARK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17473: Assignee: (was: Apache Spark) > jdbc docker tests are failing with java.lang.AbstractMethodError: > - > > Key: SPARK-17473 > URL: https://issues.apache.org/jira/browse/SPARK-17473 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Suresh Thalamati > > build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive-thriftserver > -Phive -DskipTests clean install > build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 > compile test > Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; > support was removed in 8.0 > Discovery starting. > Discovery completed in 200 milliseconds. > Run starting. Expected test count is: 10 > MySQLIntegrationSuite: > Error: > 16/09/06 11:52:00 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, 9.31.117.25, 51868) > *** RUN ABORTED *** > java.lang.AbstractMethodError: > at > org.glassfish.jersey.model.internal.CommonConfig.configureAutoDiscoverableProviders(CommonConfig.java:622) > at > org.glassfish.jersey.client.ClientConfig$State.configureAutoDiscoverableProviders(ClientConfig.java:357) > at > org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:392) > at > org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88) > at > org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120) > at > org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117) > at > org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340) > at > org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726) > at > org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285) > at > org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126) > ... > 16/09/06 11:52:00 INFO SparkContext: Invoking stop() from shutdown hook > 16/09/06 11:52:00 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17473) jdbc docker tests are failing with java.lang.AbstractMethodError:
[ https://issues.apache.org/jira/browse/SPARK-17473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494597#comment-15494597 ] Apache Spark commented on SPARK-17473: -- User 'sureshthalamati' has created a pull request for this issue: https://github.com/apache/spark/pull/15114 > jdbc docker tests are failing with java.lang.AbstractMethodError: > - > > Key: SPARK-17473 > URL: https://issues.apache.org/jira/browse/SPARK-17473 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Suresh Thalamati > > build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive-thriftserver > -Phive -DskipTests clean install > build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 > compile test > Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; > support was removed in 8.0 > Discovery starting. > Discovery completed in 200 milliseconds. > Run starting. Expected test count is: 10 > MySQLIntegrationSuite: > Error: > 16/09/06 11:52:00 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, 9.31.117.25, 51868) > *** RUN ABORTED *** > java.lang.AbstractMethodError: > at > org.glassfish.jersey.model.internal.CommonConfig.configureAutoDiscoverableProviders(CommonConfig.java:622) > at > org.glassfish.jersey.client.ClientConfig$State.configureAutoDiscoverableProviders(ClientConfig.java:357) > at > org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:392) > at > org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88) > at > org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120) > at > org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117) > at > org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340) > at > org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726) > at > org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285) > at > org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126) > ... > 16/09/06 11:52:00 INFO SparkContext: Invoking stop() from shutdown hook > 16/09/06 11:52:00 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494591#comment-15494591 ] Andrew Ray commented on SPARK-17458: [~hvanhovell]: My JIRA username is a1ray. > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ray updated SPARK-17458: --- Comment: was deleted (was: [~hvanhovell] It's a1ray) > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16264) Allow the user to use operators on the received DataFrame
[ https://issues.apache.org/jira/browse/SPARK-16264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494587#comment-15494587 ] Jakob Odersky commented on SPARK-16264: --- I just came across this issue through a comment in the ForeachSink. I understand why Sinks would be better off by not knowing about the type of QueryExecution, however I'm not quite sure what you mean by "having something similar to foreachwriter". Is the idea to have only a single foreach sink and expose all custom user sinks as foreach writers? > Allow the user to use operators on the received DataFrame > - > > Key: SPARK-16264 > URL: https://issues.apache.org/jira/browse/SPARK-16264 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu > > Currently Sink cannot apply any operators on the given DataFrame because new > DataFrame created by the operator will use QueryExecution rather than > IncrementalExecution. > There are two options to fix this one: > 1. Merge IncrementalExecution into QueryExecution so that QueryExecution can > also deal with streaming operators. > 2. Make Dataset operators inherits the QueryExecution(IncrementalExecution is > just a subclass of IncrementalExecution) from it's parent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?
[ https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17544. Resolution: Invalid This is caused by a bug in the third-party {{spark-avro}} data source. I'm going to close this JIRA ticket, so let's continue discussion at https://github.com/databricks/spark-avro/issues/156. I have submitted a pull request (https://github.com/databricks/spark-avro/pull/174) to fix this issue and will package a new release of the Avro connector library as soon as my patch is reviewed and merged. > Timeout waiting for connection from pool, DataFrame Reader's not closing S3 > connections? > > > Key: SPARK-17544 > URL: https://issues.apache.org/jira/browse/SPARK-17544 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Amazon EMR, S3, Scala >Reporter: Brady Auen > Labels: newbie > > I have an application that loops through a text file to find files in S3 and > then reads them in, performs some ETL processes, and then writes them out. > This works for around 80 loops until I get this: > {noformat} > 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening > 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro' > for reading > 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: > Timeout waiting for connection from pool > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: > Timeout waiting for connection from pool > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown > Source) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991) > at > com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source) > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780) > at
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494475#comment-15494475 ] Michael Armbrust commented on SPARK-16407: -- Sure, but the bar for compatibility is different for things in explicit user facing APIs (DataFrame/StreamReader/Writer) than for stuff that involves implementing interfaces defined in packages that are documented as internal (org.apache.spark.sql.catalyst.*, org.apache.spark.sql.execution.*) > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494475#comment-15494475 ] Michael Armbrust edited comment on SPARK-16407 at 9/15/16 8:42 PM: --- Sure, but the bar for compatibility is different for things in explicit user facing APIs (DataFrame/StreamReader/Writer) than for stuff that involves implementing interfaces defined in packages that are documented as internal ({{org.apache.spark.sql.catalyst.\*}}, {{org.apache.spark.sql.execution.\*}}) was (Author: marmbrus): Sure, but the bar for compatibility is different for things in explicit user facing APIs (DataFrame/StreamReader/Writer) than for stuff that involves implementing interfaces defined in packages that are documented as internal (org.apache.spark.sql.catalyst.*, org.apache.spark.sql.execution.*) > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17508: Assignee: Apache Spark > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Assignee: Apache Spark >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494418#comment-15494418 ] Apache Spark commented on SPARK-17508: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/15113 > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17508: Assignee: (was: Apache Spark) > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494361#comment-15494361 ] Andrew Ray edited comment on SPARK-17458 at 9/15/16 8:09 PM: - [~hvanhovell] It's a1ray was (Author: a1ray): It's a1ray > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17458: -- Assignee: Herman van Hovell > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494361#comment-15494361 ] Andrew Ray commented on SPARK-17458: It's a1ray > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17458. --- Resolution: Fixed Fix Version/s: 2.1.0 I cannot find the JIRA username of the assignee (Andrew Ray). I'd would be great if someone can provide this. > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
Egor Pahomov created SPARK-17557: Summary: SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary Key: SPARK-17557 URL: https://issues.apache.org/jira/browse/SPARK-17557 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Egor Pahomov Working on 1.6.2, broken on 2.0 {code} select * from logs.a where year=2016 and month=9 and day=14 limit 100 {code} {code} java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17556) Executor side broadcast for broadcast joins
Reynold Xin created SPARK-17556: --- Summary: Executor side broadcast for broadcast joins Key: SPARK-17556 URL: https://issues.apache.org/jira/browse/SPARK-17556 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Reporter: Reynold Xin Currently in Spark SQL, in order to perform a broadcast join, the driver must collect the result of an RDD and then broadcast it. This introduces some extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17364) Can not query hive table starting with number
[ https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17364. --- Resolution: Fixed Assignee: Sean Zhong Fix Version/s: 2.1.0 2.0.1 > Can not query hive table starting with number > - > > Key: SPARK-17364 > URL: https://issues.apache.org/jira/browse/SPARK-17364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Egor Pahomov >Assignee: Sean Zhong > Fix For: 2.0.1, 2.1.0 > > > I can do it with spark-1.6.2 > {code} > SELECT * from temp.20160826_ip_list limit 100 > {code} > {code} > Error: org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', > 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', > 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', > 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19) > == SQL == > SELECT * from temp.20160826_ip_list limit 100 > ---^^^ > SQLState: null > ErrorCode: 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17484) Race condition when cancelling a job during a cache write can lead to block fetch failures
[ https://issues.apache.org/jira/browse/SPARK-17484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17484. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 15085 [https://github.com/apache/spark/pull/15085] > Race condition when cancelling a job during a cache write can lead to block > fetch failures > -- > > Key: SPARK-17484 > URL: https://issues.apache.org/jira/browse/SPARK-17484 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.1, 2.1.0 > > > On a production cluster, I observed the following weird behavior where a > block manager cached a block, the store failed due to a task being killed / > cancelled, and then a subsequent task incorrectly attempted to read the > cached block from the machine where the write failed, eventually leading to a > complete job failure. > Here's the executor log snippet from the machine performing the failed cache: > {code} > 16/09/06 16:10:31 INFO MemoryStore: Block rdd_25_1 stored as values in memory > (estimated size 976.8 MB, free 9.8 GB) > 16/09/06 16:10:31 WARN BlockManager: Putting block rdd_25_1 failed > 16/09/06 16:10:31 INFO Executor: Executor killed task 0.0 in stage 3.0 (TID > 127) > {code} > Here's the exception from the reader in the failed job: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in > stage 46.0 failed 4 times, most recent failure: Lost task 4.3 in stage 46.0 > (TID 1484, 10.69.255.197): org.apache.spark.storage.BlockFetchException: > Failed to fetch block after 1 fetch failures. Most recent failure cause: > at > org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565) > at > org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661) > {code} > I believe that there's a race condition in how we handle cleanup after failed > cache stores. Here's an excerpt from {{BlockManager.doPut()}} > {code} > var blockWasSuccessfullyStored: Boolean = false > val result: Option[T] = try { > val res = putBody(putBlockInfo) > blockWasSuccessfullyStored = res.isEmpty > res > } finally { > if (blockWasSuccessfullyStored) { > if (keepReadLock) { > blockInfoManager.downgradeLock(blockId) > } else { > blockInfoManager.unlock(blockId) > } > } else { > blockInfoManager.removeBlock(blockId) > logWarning(s"Putting block $blockId failed") > } > } > {code} > The only way that I think this "successfully stored followed by immediately > failed" case could appear in our logs is if the local memory store write > succeeds and then an exception (perhaps InterruptedException) causes us to > enter the {{finally}} block's error-cleanup path. The problem is that the > {{finally}} block only cleans up the block's metadata rather than performing > the full cleanup path which would also notify the master that the block is no > longer available at this host. > The fact that the Spark task was not resilient in the face of remote block > fetches is a separate issue which I'll report and fix separately. The scope > of this JIRA, however, is the fact that Spark still attempted reads from a > machine which was missing the block. > In order to fix this problem, I think that the {{finally}} block should > perform more thorough cleanup and should send a "block removed" status update > to the master following any error during the write. This is necessary because > the body of {{doPut()}} may have already notified the master of block > availability. In addition, we can extend the block serving code path to > automatically update the master with "block deleted" statuses whenever the > block manager receives invalid requests for blocks that it doesn't have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17483) Minor refactoring and cleanup in BlockManager block status reporting and block removal
[ https://issues.apache.org/jira/browse/SPARK-17483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17483: --- Fix Version/s: 2.0.1 > Minor refactoring and cleanup in BlockManager block status reporting and > block removal > -- > > Key: SPARK-17483 > URL: https://issues.apache.org/jira/browse/SPARK-17483 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.1, 2.1.0 > > > As a precursor to fixing a block fetch bug, I'd like to split a few small > refactorings in BlockManager into their own patch (hence this JIRA). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17429) spark sql length(1) return error
[ https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17429. --- Resolution: Fixed Assignee: cen yuhai Fix Version/s: 2.1.0 > spark sql length(1) return error > > > Key: SPARK-17429 > URL: https://issues.apache.org/jira/browse/SPARK-17429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai >Assignee: cen yuhai > Fix For: 2.1.0 > > > select length(11); > select length(2.0); > these sql will return errors, but hive is ok. > Error in query: cannot resolve 'length(11)' due to data type mismatch: > argument 1 requires (string or binary) type, however, '11' is of int type.; > line 1 pos 14 > Error in query: cannot resolve 'length(2.0)' due to data type mismatch: > argument 1 requires (string or binary) type, however, '2.0' is of double > type.; line 1 pos 14 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17114) Adding a 'GROUP BY 1' where first column is literal results in wrong answer
[ https://issues.apache.org/jira/browse/SPARK-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17114. --- Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.1.0 2.0.1 > Adding a 'GROUP BY 1' where first column is literal results in wrong answer > --- > > Key: SPARK-17114 > URL: https://issues.apache.org/jira/browse/SPARK-17114 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0 >Reporter: Josh Rosen >Assignee: Herman van Hovell > Labels: correctness > Fix For: 2.0.1, 2.1.0 > > > Consider the following example: > {code} > sc.parallelize(Seq(128, 256)).toDF("int_col").registerTempTable("mytable") > // The following query should return an empty result set because the `IN` > filter condition is always false for this single-row table. > val withoutGroupBy = sqlContext.sql(""" > SELECT 'foo' > FROM mytable > WHERE int_col == 0 > """) > assert(withoutGroupBy.collect().isEmpty, "original query returned wrong > answer") > // After adding a 'GROUP BY 1' the query result should still be empty because > we'd be grouping an empty table: > val withGroupBy = sqlContext.sql(""" > SELECT 'foo' > FROM mytable > WHERE int_col == 0 > GROUP BY 1 > """) > assert(withGroupBy.collect().isEmpty, "adding GROUP BY resulted in wrong > answer") > {code} > Here, this fails the second assertion by returning a single row. It appears > that running {{group by 1}} where column 1 is a constant causes filter > conditions to be ignored. > Both PostgreSQL and SQLite return empty result sets for the query containing > the {{GROUP BY}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write
[ https://issues.apache.org/jira/browse/SPARK-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17547. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 1.6.3 Issue resolved by pull request 15104 [https://github.com/apache/spark/pull/15104] > Temporary shuffle data files may be leaked following exception in write > --- > > Key: SPARK-17547 > URL: https://issues.apache.org/jira/browse/SPARK-17547 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.5.3, 1.6.0, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > SPARK-8029 modified shuffle writers to first stage their data to a temporary > file in the same directory as the final destination file and then to > atomically rename the file at the end of the write job. However, this change > introduced the potential for the temporary output file to be leaked if an > exception occurs during the write because the shuffle writers' existing error > cleanup code doesn't handle this new temp file. > This is easy to fix: we just need to add a {{finally}} block to ensure that > the temporary file is guaranteed to be either moved or deleted before > existing the shuffle write method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17379) Upgrade netty-all to 4.0.41.Final (4.1.5-Final not compatible)
[ https://issues.apache.org/jira/browse/SPARK-17379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17379. -- Resolution: Fixed Fix Version/s: 2.1.0 > Upgrade netty-all to 4.0.41.Final (4.1.5-Final not compatible) > -- > > Key: SPARK-17379 > URL: https://issues.apache.org/jira/browse/SPARK-17379 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.6.2, 2.0.0 >Reporter: Adam Roberts >Assignee: Adam Roberts >Priority: Trivial > Fix For: 2.1.0 > > > We should use the newest version of netty based on info here: > http://netty.io/news/2016/08/29/4-0-41-Final-4-1-5-Final.html, especially > interested in the static initialiser deadlock fix: > https://github.com/netty/netty/pull/5730 > Lots more fixes mentioned so will create the pull request - again a case of > updating the pom and then the dependency files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17451) CoarseGrainedExecutorBackend should inform driver before self-kill
[ https://issues.apache.org/jira/browse/SPARK-17451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17451. -- Resolution: Fixed Assignee: Tejas Patil Fix Version/s: 2.1.0 > CoarseGrainedExecutorBackend should inform driver before self-kill > -- > > Key: SPARK-17451 > URL: https://issues.apache.org/jira/browse/SPARK-17451 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.1.0 > > > `CoarseGrainedExecutorBackend` in some failure cases exits the JVM. While > this does not have any issue, from the driver UI there is no specific reason > captured for this. `exitExecutor` [0] already has the reason which could be > passed on to the driver. > [0] : > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L151 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493936#comment-15493936 ] holdenk commented on SPARK-16407: - I think its important to keep in mind that these APIs are already exposed - you can pass in a custom sink by full class name as a string. So while the current Sink API does indeed expose a micro-batch API that seems somewhat orthogonal to allowing people to specify Sources/Sinks programmatically - the Source/Sink APIs/interface its self needs to be improved but once it still makes sense to allow people to specify their own Sources/Sinks programmatically. It might make sense to go back in and mark them as experimental - but if anything that is a reason to keep improving them. > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17549) InMemoryRelation doesn't scale to large tables
[ https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17549: Assignee: (was: Apache Spark) > InMemoryRelation doesn't scale to large tables > -- > > Key: SPARK-17549 > URL: https://issues.apache.org/jira/browse/SPARK-17549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Marcelo Vanzin > Attachments: create_parquet.scala, example_1.6_post_patch.png, > example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch > > > An {{InMemoryRelation}} is created when you cache a table; but if the table > is large, defined by either having a really large amount of columns, or a > really large amount of partitions (in the file split sense, not the "table > partition" sense), or both, it causes an immense amount of memory to be used > in the driver. > The reason is that it uses an accumulator to collect statistics about each > partition, and instead of summarizing the data in the driver, it keeps *all* > entries in memory. > I'm attaching a script I used to create a parquet file with 20,000 columns > and a single row, which I then copied 500 times so I'd have 500 partitions. > When doing the following: > {code} > sqlContext.read.parquet(...).count() > {code} > Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the > settings I used, but it works.) > I ran spark-shell like this: > {code} > ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g > --conf spark.executor.memory=2g > {code} > And ran: > {code} > sqlContext.read.parquet(...).cache().count() > {code} > You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 > partitions were processed, there were 40 GenericInternalRow objects with > 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage > was: > {code} > 40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB > {code} > (Note: Integer = 20 bytes, Long = 24 bytes.) > If I waited until the end, there would be 500 partitions, so ~ 5GB of memory > to hold the stats. > I'm also attaching a patch I made on top of 1.6 that uses just a long > accumulator to capture the table size; with that patch memory usage on the > driver doesn't keep growing. Also note in the patch that I'm multiplying the > column size by the row count, which I think is a different bug in the > existing code (those stats should be for the whole batch, not just a single > row, right?). I also added {{example_1.6_post_patch.png}} to show the > {{InMemoryRelation}} with the patch. > I also applied a very similar patch on top of Spark 2.0. But there things > blow up even more spectacularly when I try to run the count on the cached > table. It starts with this error: > {noformat} > 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, > vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: > Index: 63235, Size: 1 > (lots of generated code here...) > Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556) > at > org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572) > at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513) > at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644) > at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623) > at org.codehaus.janino.util.ClassFile.(ClassFile.java:280) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883) > ... 54 more > {noformat} > And basically a lot of that going on making the output unreadable, so I just > killed the
[jira] [Commented] (SPARK-17549) InMemoryRelation doesn't scale to large tables
[ https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493929#comment-15493929 ] Apache Spark commented on SPARK-17549: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/15112 > InMemoryRelation doesn't scale to large tables > -- > > Key: SPARK-17549 > URL: https://issues.apache.org/jira/browse/SPARK-17549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Marcelo Vanzin > Attachments: create_parquet.scala, example_1.6_post_patch.png, > example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch > > > An {{InMemoryRelation}} is created when you cache a table; but if the table > is large, defined by either having a really large amount of columns, or a > really large amount of partitions (in the file split sense, not the "table > partition" sense), or both, it causes an immense amount of memory to be used > in the driver. > The reason is that it uses an accumulator to collect statistics about each > partition, and instead of summarizing the data in the driver, it keeps *all* > entries in memory. > I'm attaching a script I used to create a parquet file with 20,000 columns > and a single row, which I then copied 500 times so I'd have 500 partitions. > When doing the following: > {code} > sqlContext.read.parquet(...).count() > {code} > Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the > settings I used, but it works.) > I ran spark-shell like this: > {code} > ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g > --conf spark.executor.memory=2g > {code} > And ran: > {code} > sqlContext.read.parquet(...).cache().count() > {code} > You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 > partitions were processed, there were 40 GenericInternalRow objects with > 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage > was: > {code} > 40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB > {code} > (Note: Integer = 20 bytes, Long = 24 bytes.) > If I waited until the end, there would be 500 partitions, so ~ 5GB of memory > to hold the stats. > I'm also attaching a patch I made on top of 1.6 that uses just a long > accumulator to capture the table size; with that patch memory usage on the > driver doesn't keep growing. Also note in the patch that I'm multiplying the > column size by the row count, which I think is a different bug in the > existing code (those stats should be for the whole batch, not just a single > row, right?). I also added {{example_1.6_post_patch.png}} to show the > {{InMemoryRelation}} with the patch. > I also applied a very similar patch on top of Spark 2.0. But there things > blow up even more spectacularly when I try to run the count on the cached > table. It starts with this error: > {noformat} > 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, > vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: > Index: 63235, Size: 1 > (lots of generated code here...) > Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556) > at > org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572) > at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513) > at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644) > at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623) > at org.codehaus.janino.util.ClassFile.(ClassFile.java:280) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883) > ... 54 more >
[jira] [Assigned] (SPARK-17549) InMemoryRelation doesn't scale to large tables
[ https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17549: Assignee: Apache Spark > InMemoryRelation doesn't scale to large tables > -- > > Key: SPARK-17549 > URL: https://issues.apache.org/jira/browse/SPARK-17549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > Attachments: create_parquet.scala, example_1.6_post_patch.png, > example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch > > > An {{InMemoryRelation}} is created when you cache a table; but if the table > is large, defined by either having a really large amount of columns, or a > really large amount of partitions (in the file split sense, not the "table > partition" sense), or both, it causes an immense amount of memory to be used > in the driver. > The reason is that it uses an accumulator to collect statistics about each > partition, and instead of summarizing the data in the driver, it keeps *all* > entries in memory. > I'm attaching a script I used to create a parquet file with 20,000 columns > and a single row, which I then copied 500 times so I'd have 500 partitions. > When doing the following: > {code} > sqlContext.read.parquet(...).count() > {code} > Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the > settings I used, but it works.) > I ran spark-shell like this: > {code} > ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g > --conf spark.executor.memory=2g > {code} > And ran: > {code} > sqlContext.read.parquet(...).cache().count() > {code} > You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 > partitions were processed, there were 40 GenericInternalRow objects with > 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage > was: > {code} > 40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB > {code} > (Note: Integer = 20 bytes, Long = 24 bytes.) > If I waited until the end, there would be 500 partitions, so ~ 5GB of memory > to hold the stats. > I'm also attaching a patch I made on top of 1.6 that uses just a long > accumulator to capture the table size; with that patch memory usage on the > driver doesn't keep growing. Also note in the patch that I'm multiplying the > column size by the row count, which I think is a different bug in the > existing code (those stats should be for the whole batch, not just a single > row, right?). I also added {{example_1.6_post_patch.png}} to show the > {{InMemoryRelation}} with the patch. > I also applied a very similar patch on top of Spark 2.0. But there things > blow up even more spectacularly when I try to run the count on the cached > table. It starts with this error: > {noformat} > 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, > vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: > Index: 63235, Size: 1 > (lots of generated code here...) > Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556) > at > org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572) > at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513) > at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644) > at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623) > at org.codehaus.janino.util.ClassFile.(ClassFile.java:280) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883) > ... 54 more > {noformat} > And basically a lot of that going on making the output unreadable,
[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493777#comment-15493777 ] Apache Spark commented on SPARK-17458: -- User 'aray' has created a pull request for this issue: https://github.com/apache/spark/pull/15111 > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17458: Assignee: (was: Apache Spark) > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17458: Assignee: Apache Spark > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli >Assignee: Apache Spark > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property
[ https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493395#comment-15493395 ] Jonathan Taws edited comment on SPARK-15917 at 9/15/16 4:05 PM: Hi Andrew, Your 2 suggestions make a lot of sense, I'll work on it straight away. Give me a few days to try them out. By the way, do you have any pointers to where I could look into to produce a meaning warning when there isn't enough resources to comply with all the parameters ? was (Author: jonathantaws): Hi Andrew, Your 2 suggestions make a lot of sense, I'll work on it straight away. Give me a few days to try them out. > Define the number of executors in standalone mode with an easy-to-use property > -- > > Key: SPARK-15917 > URL: https://issues.apache.org/jira/browse/SPARK-15917 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > After stumbling across a few StackOverflow posts around the issue of using a > fixed number of executors in standalone mode (non-YARN), I was wondering if > we could not add an easier way to set this parameter than having to resort to > some calculations based on the number of cores and the memory you have > available on your worker. > For example, let's say I have 8 cores and 30GB of memory available : > - If no option is passed, one executor will be spawned with 8 cores and 1GB > of memory allocated. > - However, if I want to have only *2* executors, and to use 2 cores and 10GB > of memory per executor, I will end up with *3* executors (as the available > memory will limit the number of executors) instead of the 2 I was hoping for. > Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I > want, but would it not be easier to add a {{--num-executors}}-like option to > standalone mode to be able to really fine-tune the configuration ? This > option is already available in YARN mode. > From my understanding, I don't see any other option lying around that can > help achieve this. > This seems to be slightly disturbing for newcomers, and standalone mode is > probably the first thing anyone will use to just try out Spark or test some > configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?
[ https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17544: -- Description: I have an application that loops through a text file to find files in S3 and then reads them in, performs some ETL processes, and then writes them out. This works for around 80 loops until I get this: {noformat} 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro' for reading 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown Source) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313) at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:324) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$18.apply(DataSource.scala:452) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$18.apply(DataSource.scala:458) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:451) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) at com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:26) at
[jira] [Created] (SPARK-17555) ExternalShuffleBlockResolver fails randomly with External Shuffle Service and Dynamic Resource Allocation on Mesos running under Marathon
Brad Willard created SPARK-17555: Summary: ExternalShuffleBlockResolver fails randomly with External Shuffle Service and Dynamic Resource Allocation on Mesos running under Marathon Key: SPARK-17555 URL: https://issues.apache.org/jira/browse/SPARK-17555 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0, 1.6.2, 1.6.1, 1.5.2 Environment: Mesos using docker with external shuffle service running on Marathon. Running code from pyspark shell in client mode. Reporter: Brad Willard External Shuffle Service throws these errors about 90% of the time. It seems to die between stages and work inconsistently with these style of errors about missing files. I've tested this same behavior with all the spark versions listed on the jira using the pre-build hadoop 2.6 distributions from the apache spark download page. I also want to mention everything works successfully with dynamic resource allocation turned off. I have read other related bugs and have tried some of the workaround/suggestions. Seems like some people have blamed the switch from akka to netty which got me testing this in the 1.5* range with no luck. I'm currently running these config option (informed by reading other bugs on jira that seemed related to my problem). These settings have helped it work sometimes instead of never. {code} spark.shuffle.service.port 7338 spark.shuffle.io.numConnectionsPerPeer 4 spark.shuffle.io.connectionTimeout 18000s spark.shuffle.service.enabled true spark.dynamicAllocation.enabled true {code} on the driver for pyspark submit I'm sending along this config {code} --conf spark.mesos.executor.docker.image=docker-registry.x.net/machine-learning/spark:spark-1-6-2v1 \ --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.mesos.coarse=true \ --conf spark.cores.max=100 \ --conf spark.executor.uri=$SPARK_EXECUTOR_URI \ --conf spark.shuffle.service.port=7338 \ --executor-memory 15g {code} Under Marathon I'm pinning each external shuffle service to an agent and starting the service like this. {code} $SPARK_HOME/sbin/start-mesos-shuffle-service.sh && tail -f $SPARK_HOME/logs/spark--org.apache.spark.deploy.mesos.MesosExternalShuffleService* {code} On startup it seems like all is well {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.6.1 /_/ Using Python version 3.5.2 (default, Jul 2 2016 17:52:12) SparkContext available as sc, HiveContext available as sqlContext. >>> 16/09/15 11:35:53 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now >>> TASK_RUNNING 16/09/15 11:35:53 INFO MesosExternalShuffleClient: Successfully registered app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service. 16/09/15 11:35:55 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now TASK_RUNNING 16/09/15 11:35:55 INFO MesosExternalShuffleClient: Successfully registered app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service. 16/09/15 11:35:56 INFO CoarseMesosSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (mesos-agent002.[redacted]:61281) with ID 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1 16/09/15 11:35:56 INFO ExecutorAllocationManager: New executor 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1 has registered (new total is 1) 16/09/15 11:35:56 INFO BlockManagerMasterEndpoint: Registering block manager mesos-agent002.[redacted]:46247 with 10.6 GB RAM, BlockManagerId(55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1, mesos-agent002.[redacted], 46247) 16/09/15 11:35:59 INFO CoarseMesosSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (mesos-agent004.[redacted]:42738) with ID 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0 16/09/15 11:35:59 INFO ExecutorAllocationManager: New executor 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0 has registered (new total is 2) 16/09/15 11:35:59 INFO BlockManagerMasterEndpoint: Registering block manager mesos-agent004.[redacted]:11262 with 10.6 GB RAM, BlockManagerId(55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0, mesos-agent004.[redacted], 11262) {code} I'm running a simple sort command on a spark data frame loaded from hdfs from a parquet file. These are the errors. They happen shortly after startup as agents are being allocated by mesos. {code} 16/09/15 14:49:10 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) from mesos-agent004:7338 java.lang.RuntimeException: java.lang.RuntimeException: Failed to open file: /tmp/blockmgr-29f93bea-9a87-41de-9046-535401e6d4fd/30/shuffle_0_0_0.index at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:286) at
[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?
[ https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493555#comment-15493555 ] Brady Auen commented on SPARK-17544: That looks like my issue too, thanks Josh! > Timeout waiting for connection from pool, DataFrame Reader's not closing S3 > connections? > > > Key: SPARK-17544 > URL: https://issues.apache.org/jira/browse/SPARK-17544 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Amazon EMR, S3, Scala >Reporter: Brady Auen > Labels: newbie > > I have an application that loops through a text file to find files in S3 and > then reads them in, performs some ETL processes, and then writes them out. > This works for around 80 loops until I get this: > > 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening > 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro' > for reading > 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: > Timeout waiting for connection from pool > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: > Timeout waiting for connection from pool > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown > Source) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991) > at > com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source) > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313) > at > org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:324) > at >
[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property
[ https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493399#comment-15493399 ] Jonathan Taws commented on SPARK-15917: --- I didn't create a specific pull request for the moment as I was waiting for your insight to clarify the expected behavior, but I can create a PR now and update it with the latest modifications as I work on it. > Define the number of executors in standalone mode with an easy-to-use property > -- > > Key: SPARK-15917 > URL: https://issues.apache.org/jira/browse/SPARK-15917 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > After stumbling across a few StackOverflow posts around the issue of using a > fixed number of executors in standalone mode (non-YARN), I was wondering if > we could not add an easier way to set this parameter than having to resort to > some calculations based on the number of cores and the memory you have > available on your worker. > For example, let's say I have 8 cores and 30GB of memory available : > - If no option is passed, one executor will be spawned with 8 cores and 1GB > of memory allocated. > - However, if I want to have only *2* executors, and to use 2 cores and 10GB > of memory per executor, I will end up with *3* executors (as the available > memory will limit the number of executors) instead of the 2 I was hoping for. > Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I > want, but would it not be easier to add a {{--num-executors}}-like option to > standalone mode to be able to really fine-tune the configuration ? This > option is already available in YARN mode. > From my understanding, I don't see any other option lying around that can > help achieve this. > This seems to be slightly disturbing for newcomers, and standalone mode is > probably the first thing anyone will use to just try out Spark or test some > configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property
[ https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493395#comment-15493395 ] Jonathan Taws commented on SPARK-15917: --- Hi Andrew, Your 2 suggestions make a lot of sense, I'll work on it straight away. Give me a few days to try them out. > Define the number of executors in standalone mode with an easy-to-use property > -- > > Key: SPARK-15917 > URL: https://issues.apache.org/jira/browse/SPARK-15917 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > After stumbling across a few StackOverflow posts around the issue of using a > fixed number of executors in standalone mode (non-YARN), I was wondering if > we could not add an easier way to set this parameter than having to resort to > some calculations based on the number of cores and the memory you have > available on your worker. > For example, let's say I have 8 cores and 30GB of memory available : > - If no option is passed, one executor will be spawned with 8 cores and 1GB > of memory allocated. > - However, if I want to have only *2* executors, and to use 2 cores and 10GB > of memory per executor, I will end up with *3* executors (as the available > memory will limit the number of executors) instead of the 2 I was hoping for. > Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I > want, but would it not be easier to add a {{--num-executors}}-like option to > standalone mode to be able to really fine-tune the configuration ? This > option is already available in YARN mode. > From my understanding, I don't see any other option lying around that can > help achieve this. > This seems to be slightly disturbing for newcomers, and standalone mode is > probably the first thing anyone will use to just try out Spark or test some > configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493075#comment-15493075 ] Srinivas Rishindra Pothireddi commented on SPARK-17538: --- Hi [~srowen], I updated the description of the jira as per your suggestion. Can you please let me know if you need anymore information > sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0 > - > > Key: SPARK-17538 > URL: https://issues.apache.org/jira/browse/SPARK-17538 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: os - linux > cluster -> yarn and local >Reporter: Srinivas Rishindra Pothireddi > > I have a production job in spark 1.6.2 that registers several dataframes as > tables. > After testing the job in spark 2.0.0, I found that one of the dataframes is > not getting registered as a table. > Line 353 of my code --> self.sqlContext.registerDataFrameAsTable(anonymousDF, > "anonymousTable") > line 354 of my code --> df = self.sqlContext.sql("select AnonymousFiled1, > AnonymousUDF( AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable") > my stacktrace > File "anonymousFile.py", line 354, in anonymousMethod > df = self.sqlContext.sql("select AnonymousFiled1, AnonymousUDF( > AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable") > File > "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py", > line 350, in sql > return self.sparkSession.sql(sqlQuery) > File > "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", > line 541, in sql > return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) > File > "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > AnalysisException: u'Table or view not found: anonymousTable; line 1 pos 61' > The same code is working perfectly fine in spark-1.6.2 > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivas Rishindra Pothireddi updated SPARK-17538: -- Description: I have a production job in spark 1.6.2 that registers several dataframes as tables. After testing the job in spark 2.0.0, I found that one of the dataframes is not getting registered as a table. Line 353 of my code --> self.sqlContext.registerDataFrameAsTable(anonymousDF, "anonymousTable") line 354 of my code --> df = self.sqlContext.sql("select AnonymousFiled1, AnonymousUDF( AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable") my stacktrace File "anonymousFile.py", line 354, in anonymousMethod df = self.sqlContext.sql("select AnonymousFiled1, AnonymousUDF( AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable") File "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py", line 350, in sql return self.sparkSession.sql(sqlQuery) File "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 541, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) AnalysisException: u'Table or view not found: anonymousTable; line 1 pos 61' The same code is working perfectly fine in spark-1.6.2 was: I have a production job in spark 1.6.2 that registers four dataframes as tables. After testing the job in spark 2.0.0 one of the dataframes is not getting registered as a table. output of sqlContext.tableNames() just after registering the fourth dataframe in spark 1.6.2 is temp1,temp2,temp3,temp4 output of sqlContext.tableNames() just after registering the fourth dataframe in spark 2.0.0 is temp1,temp2,temp3 so when the table 'temp4' is used by the job at a later stage an AnalysisException is raised in spark 2.0.0 There are no changes in the code whatsoever. > sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0 > - > > Key: SPARK-17538 > URL: https://issues.apache.org/jira/browse/SPARK-17538 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: os - linux > cluster -> yarn and local >Reporter: Srinivas Rishindra Pothireddi > > I have a production job in spark 1.6.2 that registers several dataframes as > tables. > After testing the job in spark 2.0.0, I found that one of the dataframes is > not getting registered as a table. > Line 353 of my code --> self.sqlContext.registerDataFrameAsTable(anonymousDF, > "anonymousTable") > line 354 of my code --> df = self.sqlContext.sql("select AnonymousFiled1, > AnonymousUDF( AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable") > my stacktrace > File "anonymousFile.py", line 354, in anonymousMethod > df = self.sqlContext.sql("select AnonymousFiled1, AnonymousUDF( > AnonymousFiled1 ) as AnonymousFiled3 from anonymousTable") > File > "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py", > line 350, in sql > return self.sparkSession.sql(sqlQuery) > File > "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", > line 541, in sql > return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) > File > "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/anonymousUser/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > AnalysisException: u'Table or view not found: anonymousTable; line 1 pos 61' > The same code is working perfectly fine in spark-1.6.2 > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16297) Mapping Boolean and string to BIT and NVARCHAR(MAX) for SQL Server jdbc dialect
[ https://issues.apache.org/jira/browse/SPARK-16297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492994#comment-15492994 ] Oussama Mekni commented on SPARK-16297: --- some updates about ? > Mapping Boolean and string to BIT and NVARCHAR(MAX) for SQL Server jdbc > dialect > > > Key: SPARK-16297 > URL: https://issues.apache.org/jira/browse/SPARK-16297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Oussama Mekni >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > Tested with SQLServer 2012 and SQLServer Express: > - Fix mapping of StringType to NVARCHAR(MAX) > - Fix mapping of BooleanTypeto BIT -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492937#comment-15492937 ] Srinivas Rishindra Pothireddi commented on SPARK-17538: --- Hi [~srowen], I will fix this as soon as possible as you suggested. > sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0 > - > > Key: SPARK-17538 > URL: https://issues.apache.org/jira/browse/SPARK-17538 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: os - linux > cluster -> yarn and local >Reporter: Srinivas Rishindra Pothireddi > > I have a production job in spark 1.6.2 that registers four dataframes as > tables. After testing the job in spark 2.0.0 one of the dataframes is not > getting registered as a table. > output of sqlContext.tableNames() just after registering the fourth dataframe > in spark 1.6.2 is > temp1,temp2,temp3,temp4 > output of sqlContext.tableNames() just after registering the fourth dataframe > in spark 2.0.0 is > temp1,temp2,temp3 > so when the table 'temp4' is used by the job at a later stage an > AnalysisException is raised in spark 2.0.0 > There are no changes in the code whatsoever. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17281) Add treeAggregateDepth parameter for AFTSurvivalRegression
[ https://issues.apache.org/jira/browse/SPARK-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492931#comment-15492931 ] Weichen Xu commented on SPARK-17281: because currently, the AFTSuvivalRegression use `treeAggregate` and the `depth` default is 2. but, in many case the AFTSuvivalRegression only trains on low dimemsion data, in such case, the `treeAggregateDepth` set to 1 has better performance. so, I think add this parameter is useful. thanks! > Add treeAggregateDepth parameter for AFTSurvivalRegression > -- > > Key: SPARK-17281 > URL: https://issues.apache.org/jira/browse/SPARK-17281 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Add treeAggregateDepth parameter for AFTSurvivalRegression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17406) Event Timeline will be very slow when there are too many executor events
[ https://issues.apache.org/jira/browse/SPARK-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492919#comment-15492919 ] Apache Spark commented on SPARK-17406: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/15110 > Event Timeline will be very slow when there are too many executor events > > > Key: SPARK-17406 > URL: https://issues.apache.org/jira/browse/SPARK-17406 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.2, 2.0.0 >Reporter: cen yuhai >Assignee: cen yuhai >Priority: Minor > Fix For: 2.1.0 > > Attachments: timeline1.png, timeline2.png > > > The job page will be too slow to open when there are thousands of executor > events(added or removed). I found that in ExecutorsTab file, executorIdToData > will not remove elements, it will increase all the time. Before this pr, it > looks like timeline1.png. After this pr, it looks like timeline2.png(we can > set how many events will be displayed) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17501) Re-register BlockManager again and again
[ https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17501: Assignee: (was: Apache Spark) > Re-register BlockManager again and again > > > Key: SPARK-17501 > URL: https://issues.apache.org/jira/browse/SPARK-17501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: cen yuhai >Priority: Minor > > After many times re-register, executor will exit because of timeout > exception > {code} > 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot > register with driver: > spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168 > org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 > seconds. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Failure.recover(Try.scala:185) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at >
[jira] [Assigned] (SPARK-17501) Re-register BlockManager again and again
[ https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17501: Assignee: Apache Spark > Re-register BlockManager again and again > > > Key: SPARK-17501 > URL: https://issues.apache.org/jira/browse/SPARK-17501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: cen yuhai >Assignee: Apache Spark >Priority: Minor > > After many times re-register, executor will exit because of timeout > exception > {code} > 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot > register with driver: > spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168 > org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 > seconds. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Failure.recover(Try.scala:185) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at >
[jira] [Commented] (SPARK-17501) Re-register BlockManager again and again
[ https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492860#comment-15492860 ] Apache Spark commented on SPARK-17501: -- User 'cenyuhai' has created a pull request for this issue: https://github.com/apache/spark/pull/15109 > Re-register BlockManager again and again > > > Key: SPARK-17501 > URL: https://issues.apache.org/jira/browse/SPARK-17501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: cen yuhai >Priority: Minor > > After many times re-register, executor will exit because of timeout > exception > {code} > 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot > register with driver: > spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168 > org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 > seconds. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Failure.recover(Try.scala:185) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at >
[jira] [Updated] (SPARK-17281) Add treeAggregateDepth parameter for AFTSurvivalRegression
[ https://issues.apache.org/jira/browse/SPARK-17281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17281: -- Priority: Minor (was: Major) What's the use case, just consistency? why not on more jobs than just these? > Add treeAggregateDepth parameter for AFTSurvivalRegression > -- > > Key: SPARK-17281 > URL: https://issues.apache.org/jira/browse/SPARK-17281 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Add treeAggregateDepth parameter for AFTSurvivalRegression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17501) Re-register BlockManager again and again
[ https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484185#comment-15484185 ] cen yuhai edited comment on SPARK-17501 at 9/15/16 9:19 AM: I can't hardly reproduce this error. But maybe I found the root cause. In HeatbeatReceiver, executor is recorded by executorLastSeen. But Blockmanager is recorded by blockManagerInfo in BlockManagerMasterEndpoint.It should not register BlockManager,I think just put it into executorLastSeen which will resolve this problem. was (Author: cenyuhai): I can't hardly reproduce this error. But maybe I found the root cause. In HeatbeatReceiver, executor is record by executorLastSeen. But Blockmanager is record by blockManagerInfo in BlockManagerMasterEndpoint.It should not register BlockManager,Executor need to send RegisterExecutor. > Re-register BlockManager again and again > > > Key: SPARK-17501 > URL: https://issues.apache.org/jira/browse/SPARK-17501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: cen yuhai >Priority: Minor > > After many times re-register, executor will exit because of timeout > exception > {code} > 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot > register with driver: > spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168 > org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 > seconds. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at >
[jira] [Updated] (SPARK-17523) Cannot get Spark build info from spark-core package which built in Windows
[ https://issues.apache.org/jira/browse/SPARK-17523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17523: -- Fix Version/s: (was: 2.0.1) Component/s: (was: Spark Core) > Cannot get Spark build info from spark-core package which built in Windows > -- > > Key: SPARK-17523 > URL: https://issues.apache.org/jira/browse/SPARK-17523 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Yun Tang > Labels: windows > > Currently, if we build Spark, it will generate a > 'spark-version-info.properties' and merged into spark-core_2.11-*.jar. > However, the script 'build/spark-build-info' which generates this file can > only be executed with bash environment. > Without this file, errors like below will happen when submitting Spark > application, which break the whole submitting phrase at beginning. > {code:java} > ERROR ApplicationMaster: Uncaught exception: > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:394) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:247) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:759) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:757) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > Caused by: java.util.concurrent.ExecutionException: Boxed Error > at scala.concurrent.impl.Promise$.resolver(Promise.scala:55) > at > scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244) > at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) > at > scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:648) > Caused by: java.lang.ExceptionInInitializerError > at org.apache.spark.package$.(package.scala:91) > at org.apache.spark.package$.(package.scala) > at > org.apache.spark.SparkContext$$anonfun$3.apply(SparkContext.scala:187) > at > org.apache.spark.SparkContext$$anonfun$3.apply(SparkContext.scala:187) > at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:76) > at org.apache.spark.SparkContext.(SparkContext.scala:187) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2287) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:822) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:814) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:814) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:630) > Caused by: org.apache.spark.SparkException: Error while locating file > spark-version-info.properties > at > org.apache.spark.package$SparkBuildInfo$.liftedTree1$1(package.scala:75) > at org.apache.spark.package$SparkBuildInfo$.(package.scala:61) > at org.apache.spark.package$SparkBuildInfo$.(package.scala) > ... 19 more > Caused by: java.lang.NullPointerException > at java.util.Properties$LineReader.readLine(Properties.java:434) > at java.util.Properties.load0(Properties.java:353) > at
[jira] [Updated] (SPARK-17501) Re-register BlockManager again and again
[ https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-17501: -- Priority: Minor (was: Major) > Re-register BlockManager again and again > > > Key: SPARK-17501 > URL: https://issues.apache.org/jira/browse/SPARK-17501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: cen yuhai >Priority: Minor > > After many times re-register, executor will exit because of timeout > exception > {code} > 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot > register with driver: > spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168 > org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 > seconds. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Failure.recover(Try.scala:185) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) >
[jira] [Updated] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17538: -- Shepherd: (was: Matei Zaharia) Flags: (was: Important) Affects Version/s: (was: 2.0.1) (was: 2.1.0) Target Version/s: (was: 2.0.1, 2.1.0) Labels: (was: pyspark) Priority: Major (was: Critical) Fix Version/s: (was: 2.0.1) (was: 2.1.0) [~sririshindra] please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark There's a lot wrong with how you filled this out. > sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0 > - > > Key: SPARK-17538 > URL: https://issues.apache.org/jira/browse/SPARK-17538 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: os - linux > cluster -> yarn and local >Reporter: Srinivas Rishindra Pothireddi > > I have a production job in spark 1.6.2 that registers four dataframes as > tables. After testing the job in spark 2.0.0 one of the dataframes is not > getting registered as a table. > output of sqlContext.tableNames() just after registering the fourth dataframe > in spark 1.6.2 is > temp1,temp2,temp3,temp4 > output of sqlContext.tableNames() just after registering the fourth dataframe > in spark 2.0.0 is > temp1,temp2,temp3 > so when the table 'temp4' is used by the job at a later stage an > AnalysisException is raised in spark 2.0.0 > There are no changes in the code whatsoever. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17554) spark.executor.memory option not working
[ https://issues.apache.org/jira/browse/SPARK-17554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17554. --- Resolution: Invalid Questions should go to user@. Without seeing how you're running the job or what you are looking at specifically in the UI it's hard to say. The parameter does work correctly in all of my usages. > spark.executor.memory option not working > > > Key: SPARK-17554 > URL: https://issues.apache.org/jira/browse/SPARK-17554 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Sankar Mittapally > > Hi, > I am new to spark, I have spark cluster with 5 slaves(Each one have 2 cores > and 4g RAM). In spark cluster dashboard I am seeing memory per node is 1gb, I > tried to increase it to 2g by using this parameter spark.executor.memory 2g > in defaults.conf but it didn't work. I want to increase the memory. Please > let me know how to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17505) Add setBins for BinaryClassificationMetrics in mlllb/evaluation
[ https://issues.apache.org/jira/browse/SPARK-17505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17505. --- Resolution: Won't Fix > Add setBins for BinaryClassificationMetrics in mlllb/evaluation > --- > > Key: SPARK-17505 > URL: https://issues.apache.org/jira/browse/SPARK-17505 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Peng Meng >Priority: Minor > Original Estimate: 6h > Remaining Estimate: 6h > > Add a setBins method for BinaryClassificationMetrics. > BinaryClassificationMetrics is a class in mllib/Evaluation. numBins is a key > attribute of it. If numBins greater than 0, then the curves (ROC curve, PR > curve) computed internally will be down-sampled to this many "bins". > It is useful to let the user set the numBins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17554) spark.executor.memory option not working
Sankar Mittapally created SPARK-17554: - Summary: spark.executor.memory option not working Key: SPARK-17554 URL: https://issues.apache.org/jira/browse/SPARK-17554 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Sankar Mittapally Hi, I am new to spark, I have spark cluster with 5 slaves(Each one have 2 cores and 4g RAM). In spark cluster dashboard I am seeing memory per node is 1gb, I tried to increase it to 2g by using this parameter spark.executor.memory 2g in defaults.conf but it didn't work. I want to increase the memory. Please let me know how to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivas Rishindra Pothireddi updated SPARK-17538: -- Labels: pyspark (was: ) > sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0 > - > > Key: SPARK-17538 > URL: https://issues.apache.org/jira/browse/SPARK-17538 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1, 2.1.0 > Environment: os - linux > cluster -> yarn and local >Reporter: Srinivas Rishindra Pothireddi >Priority: Critical > Labels: pyspark > Fix For: 2.0.1, 2.1.0 > > > I have a production job in spark 1.6.2 that registers four dataframes as > tables. After testing the job in spark 2.0.0 one of the dataframes is not > getting registered as a table. > output of sqlContext.tableNames() just after registering the fourth dataframe > in spark 1.6.2 is > temp1,temp2,temp3,temp4 > output of sqlContext.tableNames() just after registering the fourth dataframe > in spark 2.0.0 is > temp1,temp2,temp3 > so when the table 'temp4' is used by the job at a later stage an > AnalysisException is raised in spark 2.0.0 > There are no changes in the code whatsoever. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17553) On Master Change the running Apps goes to wait state even though resource are available
vimal dinakaran created SPARK-17553: --- Summary: On Master Change the running Apps goes to wait state even though resource are available Key: SPARK-17553 URL: https://issues.apache.org/jira/browse/SPARK-17553 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.6.2 Reporter: vimal dinakaran Priority: Minor When a zookeeper node goes down the master election changes and all the running apps are transferred to the new master. This works fine. But in the spark UI the app status goes to waiting state. The apps are actually running which can be seen in the streaming ui . But the master ui page it is shown as waiting. This is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17534) Increase timeouts for DirectKafkaStreamSuite tests
[ https://issues.apache.org/jira/browse/SPARK-17534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17534: -- Assignee: Adam Roberts > Increase timeouts for DirectKafkaStreamSuite tests > -- > > Key: SPARK-17534 > URL: https://issues.apache.org/jira/browse/SPARK-17534 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Assignee: Adam Roberts >Priority: Trivial > > The current tests in DirectKafkaStreamSuite rely on a certain number of > messages being received within a given timeframe > On machines with four+ cores and better clock speeds, this doesn't pose a > problem, but on a two core x86 box I regularly see timeouts within two tests > within this suite. > To avoid other users hitting the same problem and needlessly doing their own > investigations, let's increase the timeouts. By using 1 sec instead of 0.2 > sec batch durations and increasing the timeout to be 100 seconds not 20, we > consistently see the tests passing even on less powerful hardware > I see the problem consistently using a machine with 2x > Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz and 16 GB of RAM, 1 TB HDD > Pull request to follow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17534) Increase timeouts for DirectKafkaStreamSuite tests
[ https://issues.apache.org/jira/browse/SPARK-17534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17534: -- Priority: Trivial (was: Minor) > Increase timeouts for DirectKafkaStreamSuite tests > -- > > Key: SPARK-17534 > URL: https://issues.apache.org/jira/browse/SPARK-17534 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Priority: Trivial > > The current tests in DirectKafkaStreamSuite rely on a certain number of > messages being received within a given timeframe > On machines with four+ cores and better clock speeds, this doesn't pose a > problem, but on a two core x86 box I regularly see timeouts within two tests > within this suite. > To avoid other users hitting the same problem and needlessly doing their own > investigations, let's increase the timeouts. By using 1 sec instead of 0.2 > sec batch durations and increasing the timeout to be 100 seconds not 20, we > consistently see the tests passing even on less powerful hardware > I see the problem consistently using a machine with 2x > Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz and 16 GB of RAM, 1 TB HDD > Pull request to follow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17536) Minor performance improvement to JDBC batch inserts
[ https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17536. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15098 [https://github.com/apache/spark/pull/15098] > Minor performance improvement to JDBC batch inserts > --- > > Key: SPARK-17536 > URL: https://issues.apache.org/jira/browse/SPARK-17536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: John Muller >Priority: Trivial > Labels: perfomance > Fix For: 2.1.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > JDBC batch inserts currently are set to repeatedly retrieve the number of > fields inside the row iterator: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598 > val numFields = rddSchema.fields.length > This value does not change and can be set prior to the loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17536) Minor performance improvement to JDBC batch inserts
[ https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17536: -- Assignee: John Muller > Minor performance improvement to JDBC batch inserts > --- > > Key: SPARK-17536 > URL: https://issues.apache.org/jira/browse/SPARK-17536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: John Muller >Assignee: John Muller >Priority: Trivial > Labels: perfomance > Fix For: 2.1.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > JDBC batch inserts currently are set to repeatedly retrieve the number of > fields inside the row iterator: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598 > val numFields = rddSchema.fields.length > This value does not change and can be set prior to the loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17406) Event Timeline will be very slow when there are too many executor events
[ https://issues.apache.org/jira/browse/SPARK-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17406: -- Assignee: cen yuhai > Event Timeline will be very slow when there are too many executor events > > > Key: SPARK-17406 > URL: https://issues.apache.org/jira/browse/SPARK-17406 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.2, 2.0.0 >Reporter: cen yuhai >Assignee: cen yuhai >Priority: Minor > Fix For: 2.1.0 > > Attachments: timeline1.png, timeline2.png > > > The job page will be too slow to open when there are thousands of executor > events(added or removed). I found that in ExecutorsTab file, executorIdToData > will not remove elements, it will increase all the time. Before this pr, it > looks like timeline1.png. After this pr, it looks like timeline2.png(we can > set how many events will be displayed) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17406) Event Timeline will be very slow when there are too many executor events
[ https://issues.apache.org/jira/browse/SPARK-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17406. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14969 [https://github.com/apache/spark/pull/14969] > Event Timeline will be very slow when there are too many executor events > > > Key: SPARK-17406 > URL: https://issues.apache.org/jira/browse/SPARK-17406 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.2, 2.0.0 >Reporter: cen yuhai >Priority: Minor > Fix For: 2.1.0 > > Attachments: timeline1.png, timeline2.png > > > The job page will be too slow to open when there are thousands of executor > events(added or removed). I found that in ExecutorsTab file, executorIdToData > will not remove elements, it will increase all the time. Before this pr, it > looks like timeline1.png. After this pr, it looks like timeline2.png(we can > set how many events will be displayed) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17521) Error when I use sparkContext.makeRDD(Seq())
[ https://issues.apache.org/jira/browse/SPARK-17521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17521: -- Assignee: Jianfei Wang > Error when I use sparkContext.makeRDD(Seq()) > > > Key: SPARK-17521 > URL: https://issues.apache.org/jira/browse/SPARK-17521 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jianfei Wang >Assignee: Jianfei Wang >Priority: Trivial > Labels: easyfix > Fix For: 2.0.1, 2.1.0 > > > when i use sc.makeRDD below > ``` > val data3 = sc.makeRDD(Seq()) > println(data3.partitions.length) > ``` > I got an error: > Exception in thread "main" java.lang.IllegalArgumentException: Positive > number of slices required > We can fix this bug just modify the last line ,do a check of seq.size > > def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope { > assertNotStopped() > val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap > new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs) > } > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17521) Error when I use sparkContext.makeRDD(Seq())
[ https://issues.apache.org/jira/browse/SPARK-17521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17521. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 15077 [https://github.com/apache/spark/pull/15077] > Error when I use sparkContext.makeRDD(Seq()) > > > Key: SPARK-17521 > URL: https://issues.apache.org/jira/browse/SPARK-17521 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jianfei Wang >Priority: Trivial > Labels: easyfix > Fix For: 2.0.1, 2.1.0 > > > when i use sc.makeRDD below > ``` > val data3 = sc.makeRDD(Seq()) > println(data3.partitions.length) > ``` > I got an error: > Exception in thread "main" java.lang.IllegalArgumentException: Positive > number of slices required > We can fix this bug just modify the last line ,do a check of seq.size > > def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope { > assertNotStopped() > val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap > new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs) > } > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size
[ https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17524: -- Assignee: Adam Roberts Priority: Trivial (was: Minor) > RowBasedKeyValueBatchSuite always uses 64 mb page size > -- > > Key: SPARK-17524 > URL: https://issues.apache.org/jira/browse/SPARK-17524 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Assignee: Adam Roberts >Priority: Trivial > Fix For: 2.1.0 > > > The appendRowUntilExceedingPageSize test at > sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java > always uses the default page size which is 64 MB for running the test > Users with less powerful machines (e.g. those with two cores) may opt to > choose a smaller spark.buffer.pageSize value in order to prevent problems > acquiring memory > If this size is reduced, let's say to 1 MB, this test will fail, here is the > problem scenario > We run with 1,048,576 page size (1 mb) > Default is 67,108,864 size (64 mb) > Test fails: java.lang.AssertionError: expected:<14563> but was:<932067> > 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger > than 1 MB (which is when we see the failure) > The failure is at > Assert.assertEquals(batch.numRows(), numRows); > This minor improvement has the test use whatever the user has specified > (looks for spark.buffer.pageSize) to prevent this problem from occurring for > anyone testing Apache Spark on a box with a reduced page size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size
[ https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17524. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15079 [https://github.com/apache/spark/pull/15079] > RowBasedKeyValueBatchSuite always uses 64 mb page size > -- > > Key: SPARK-17524 > URL: https://issues.apache.org/jira/browse/SPARK-17524 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Priority: Minor > Fix For: 2.1.0 > > > The appendRowUntilExceedingPageSize test at > sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java > always uses the default page size which is 64 MB for running the test > Users with less powerful machines (e.g. those with two cores) may opt to > choose a smaller spark.buffer.pageSize value in order to prevent problems > acquiring memory > If this size is reduced, let's say to 1 MB, this test will fail, here is the > problem scenario > We run with 1,048,576 page size (1 mb) > Default is 67,108,864 size (64 mb) > Test fails: java.lang.AssertionError: expected:<14563> but was:<932067> > 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger > than 1 MB (which is when we see the failure) > The failure is at > Assert.assertEquals(batch.numRows(), numRows); > This minor improvement has the test use whatever the user has specified > (looks for spark.buffer.pageSize) to prevent this problem from occurring for > anyone testing Apache Spark on a box with a reduced page size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17507) check weight vector size in ANN
[ https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17507. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15060 [https://github.com/apache/spark/pull/15060] > check weight vector size in ANN > --- > > Key: SPARK-17507 > URL: https://issues.apache.org/jira/browse/SPARK-17507 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > check weight vector(specified by user) size in ANN. > and if not right throw exception in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17507) check weight vector size in ANN
[ https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17507: -- Assignee: Weichen Xu Priority: Trivial (was: Major) > check weight vector size in ANN > --- > > Key: SPARK-17507 > URL: https://issues.apache.org/jira/browse/SPARK-17507 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Trivial > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > check weight vector(specified by user) size in ANN. > and if not right throw exception in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org