[jira] [Commented] (SPARK-16609) Single function for parsing timestamps/dates
[ https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410501#comment-15410501 ] Sandeep Singh commented on SPARK-16609: --- I can work on this. > Single function for parsing timestamps/dates > > > Key: SPARK-16609 > URL: https://issues.apache.org/jira/browse/SPARK-16609 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust > > Today, if you want to parse a date or timestamp, you have to use the unix > time function and then cast to a timestamp. Its a little odd there isn't a > single function that does both. I propose we add > {code} > to_date(, )/to_timestamp(, ). > {code} > For reference, in other systems there are: > MS SQL: {{convert(, )}}. See: > https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx > Netezza: {{to_timestamp(, )}}. See: > https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html > Teradata has special casting functionality: {{cast( as timestamp > format '')}} > MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you > define both date and time parts. See: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16932) Programming-guide Accumulator section should be more clear w.r.t new API
[ https://issues.apache.org/jira/browse/SPARK-16932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410488#comment-15410488 ] Apache Spark commented on SPARK-16932: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/14516 > Programming-guide Accumulator section should be more clear w.r.t new API > > > Key: SPARK-16932 > URL: https://issues.apache.org/jira/browse/SPARK-16932 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Bryan Cutler >Priority: Trivial > > The programming-guide section on Accumulators starts off describing the old > API, which is deprecated now, and then shows examples with the new API and > ends with another code snippet of the old API. For Scala, at least, there > should only be mention of the new API to be clear to the user what is > recommended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16932) Programming-guide Accumulator section should be more clear w.r.t new API
[ https://issues.apache.org/jira/browse/SPARK-16932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16932: Assignee: Apache Spark > Programming-guide Accumulator section should be more clear w.r.t new API > > > Key: SPARK-16932 > URL: https://issues.apache.org/jira/browse/SPARK-16932 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Trivial > > The programming-guide section on Accumulators starts off describing the old > API, which is deprecated now, and then shows examples with the new API and > ends with another code snippet of the old API. For Scala, at least, there > should only be mention of the new API to be clear to the user what is > recommended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16932) Programming-guide Accumulator section should be more clear w.r.t new API
[ https://issues.apache.org/jira/browse/SPARK-16932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16932: Assignee: (was: Apache Spark) > Programming-guide Accumulator section should be more clear w.r.t new API > > > Key: SPARK-16932 > URL: https://issues.apache.org/jira/browse/SPARK-16932 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Bryan Cutler >Priority: Trivial > > The programming-guide section on Accumulators starts off describing the old > API, which is deprecated now, and then shows examples with the new API and > ends with another code snippet of the old API. For Scala, at least, there > should only be mention of the new API to be clear to the user what is > recommended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16932) Programming-guide Accumulator section should be more clear w.r.t new API
Bryan Cutler created SPARK-16932: Summary: Programming-guide Accumulator section should be more clear w.r.t new API Key: SPARK-16932 URL: https://issues.apache.org/jira/browse/SPARK-16932 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Bryan Cutler Priority: Trivial The programming-guide section on Accumulators starts off describing the old API, which is deprecated now, and then shows examples with the new API and ends with another code snippet of the old API. For Scala, at least, there should only be mention of the new API to be clear to the user what is recommended. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15702) Update document programming-guide accumulator section
[ https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler closed SPARK-15702. Resolution: Fixed > Update document programming-guide accumulator section > - > > Key: SPARK-15702 > URL: https://issues.apache.org/jira/browse/SPARK-15702 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Spark Core >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.0.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > Update document programming-guide accumulator section -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16889) Add formatMessage Column expression for formatting strings in java.text.MessageFormat style in Scala API
[ https://issues.apache.org/jira/browse/SPARK-16889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410484#comment-15410484 ] Sandeep Singh commented on SPARK-16889: --- Why not use something like {code} "Argument '%s' shall not be negative. The given value was %f.".format("12",2.0) {code} scala supports string interpolation. > Add formatMessage Column expression for formatting strings in > java.text.MessageFormat style in Scala API > - > > Key: SPARK-16889 > URL: https://issues.apache.org/jira/browse/SPARK-16889 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kapil Singh > > format_string formats the arguments in printf-style and has following major > cons compared to proposed function for formatting java.text.MessageFormat: > 1. MessageFormat syntax is more readable since it is more explicit > java.util.Formatter syntax: "Argument '%s' shall not be negative. The given > value was %f." > java.text.MessageFormat syntax: "Argument '{0}' shall not be negative. The > given value was {1}." > 2. Formatter forces user to declare the argument type (e.g. "%s" or "%f"), > while MessageFormat infers it from the object type. For example if the > argument could be a string or a number, then Formatter forces us to use the > "%s" type (passing a string to "%f" causes an exception). However a number > formatted with "%s" is formatted using Number.toString(), which produce an > unlocalized value. By contrast, MessageFormat produces localized values > dynamically for all recognized types. > To address these drawbacks, a MessageFormat function should be added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16852) RejectedExecutionException when exit at some times
[ https://issues.apache.org/jira/browse/SPARK-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410481#comment-15410481 ] Sean Owen commented on SPARK-16852: --- Does it cause any problem? > RejectedExecutionException when exit at some times > -- > > Key: SPARK-16852 > URL: https://issues.apache.org/jira/browse/SPARK-16852 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weizhong >Priority: Minor > > If we run a huge job, some times when exit will print > RejectedExecutionException > {noformat} > 16/05/27 08:30:40 ERROR client.TransportResponseHandler: Still have 3 > requests outstanding when connection from HGH117808/10.184.66.104:41980 > is closed > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@6b66dba rejected from > java.util.concurrent.ThreadPoolExecutor@60725736[Terminated, pool size = 0, > active threads = 0, queued tasks = 0, completed tasks = 269] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634) > at > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.tryFailure(Promise.scala:115) > at > scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) > at > org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$1.apply(NettyRpcEnv.scala:214) > at > org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$1.apply(NettyRpcEnv.scala:214) > at > org.apache.spark.rpc.netty.RpcOutboxMessage.onFailure(Outbox.scala:74) > at > org.apache.spark.network.client.TransportResponseHandler.failOutstandingRequests(TransportResponseHandler.java:90) > at > org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:94) > at >
[jira] [Commented] (SPARK-16326) Evaluate sparklyr package from RStudio
[ https://issues.apache.org/jira/browse/SPARK-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410480#comment-15410480 ] Sean Owen commented on SPARK-16326: --- Is there an action for the Spark code here? I'm inclined to close this as I don't know what outcome there would be that would reflect in the Spark code. > Evaluate sparklyr package from RStudio > -- > > Key: SPARK-16326 > URL: https://issues.apache.org/jira/browse/SPARK-16326 > Project: Spark > Issue Type: Brainstorming > Components: SparkR >Reporter: Sun Rui > > Rstudio has developed sparklyr (https://github.com/rstudio/sparklyr) > connecting R community to Spark. A rough review shows that sparklyr provides > a dplyr backend and new API for mLLIB and for calling Spark from R. Of > course, sparklyr internally uses the low level mechanism in SparkR. > We can discuss how to position SparkR with sparklyr. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16717) Dataframe (jdbc) is missing a way to link and external function to get a connection
[ https://issues.apache.org/jira/browse/SPARK-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16717: -- Priority: Minor (was: Major) > Dataframe (jdbc) is missing a way to link and external function to get a > connection > --- > > Key: SPARK-16717 > URL: https://issues.apache.org/jira/browse/SPARK-16717 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.2 >Reporter: Marco Colombo >Priority: Minor > > In JdbcRRD it was possible to use a function to get a JDBC connection. This > allow an external handling of the connections while now this is no more > possible with dataframes. > Please consider an addition to Dataframes for using an externally provided > connectionFactory (such as a connection pool) in order to make data loading > more efficient, avoiding connection close/recreation. Connections should be > taken from provided function and returned to a second function whenever no > more used by the RRD. So this will make jdbc handling more efficient. > I.e. extending DataFrame class with a method like > jdbc(Function0 getConnection, Function0 > releaseConnection(java.sql.Connection)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16666) Kryo encoder for custom complex classes
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410454#comment-15410454 ] Sam commented on SPARK-1: - [~clockfly] in your code sample, there is a case class for Point, not esri's point class. > Kryo encoder for custom complex classes > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.6.2 >Reporter: Sam > > I'm trying to create a dataset with some geo data using spark and esri. If > `Foo` only have `Point` field, it'll work but if I add some other fields > beyond a `Point`, I get ArrayIndexOutOfBoundsException. > {code:scala} > import com.esri.core.geometry.Point > import org.apache.spark.sql.{Encoder, Encoders, SQLContext} > import org.apache.spark.{SparkConf, SparkContext} > > object Main { > > case class Foo(position: Point, name: String) > > object MyEncoders { > implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point] > > implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] > } > > def main(args: Array[String]): Unit = { > val sc = new SparkContext(new > SparkConf().setAppName("app").setMaster("local")) > val sqlContext = new SQLContext(sc) > import MyEncoders.{FooEncoder, PointEncoder} > import sqlContext.implicits._ > Seq(new Foo(new Point(0, 0), "bar")).toDS.show > } > } > {code} > {noformat} > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71) > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70) > > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70) > > at > org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69) > > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) > at > org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) > > at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) > at org.apache.spark.sql.Dataset.show(Dataset.scala:230) > at org.apache.spark.sql.Dataset.show(Dataset.scala:193) > at org.apache.spark.sql.Dataset.show(Dataset.scala:201) > at Main$.main(Main.scala:24) > at Main.main(Main.scala) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16929) Bad synchronization with regard to speculation
[ https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410441#comment-15410441 ] Sean Owen edited comment on SPARK-16929 at 8/6/16 3:55 AM: --- At least, one easy optimization we could make is to let checkSpeculatableTasks short-circuit. It actually loops through all tasks even if the first one returns true. I wonder if in practice that would solve this problem? EDIT: not sure that's valid actually. It seems like it wants to check for all possible speculative tasks, not just any. I agree, I don't see the purpose of that lock. It is not protecting rootPool since no other access to it is synchronized. I think you can make a PR for this. was (Author: srowen): At least, one easy optimization we could make is to let checkSpeculatableTasks short-circuit. It actually loops through all tasks even if the first one returns true. I wonder if in practice that would solve this problem? > Bad synchronization with regard to speculation > -- > > Key: SPARK-16929 > URL: https://issues.apache.org/jira/browse/SPARK-16929 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Nicholas Brown > > Our cluster has been running slowly since I got speculation working, I looked > into it and noticed that stderr was saying some tasks were taking almost an > hour to run even though in the application logs on the nodes that task only > took a minute or so to run. Digging into the thread dump for the master node > I noticed a number of threads are blocked, apparently by speculation thread. > At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while > it looks through the tasks to see what needs to be rerun. Unfortunately that > code loops through each of the tasks, so when you have even just a couple > hundred thousand tasks to run that can be prohibitively slow to run inside of > a synchronized block. Once I disabled speculation, the job went back to > having acceptable performance. > There are no comments around that lock indicating why it was added, and the > git history seems to have a couple refactorings so its hard to find where it > was added. I'm tempted to believe it is the result of someone assuming that > an extra synchronized block never hurt anyone (in reality I've probably just > as many bugs caused by over synchronization as too little) as it looks too > broad to be actually guarding any potential concurrency issue. But, since > concurrency issues can be tricky to reproduce (and yes, I understand that's > an extreme understatement) I'm not sure just blindly removing it without > being familiar with the history is necessarily safe. > Can someone look into this? Or at least make a note in the documentation > that speculation should not be used with large clusters? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16929) Bad synchronization with regard to speculation
[ https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410441#comment-15410441 ] Sean Owen commented on SPARK-16929: --- At least, one easy optimization we could make is to let checkSpeculatableTasks short-circuit. It actually loops through all tasks even if the first one returns true. I wonder if in practice that would solve this problem? > Bad synchronization with regard to speculation > -- > > Key: SPARK-16929 > URL: https://issues.apache.org/jira/browse/SPARK-16929 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Nicholas Brown > > Our cluster has been running slowly since I got speculation working, I looked > into it and noticed that stderr was saying some tasks were taking almost an > hour to run even though in the application logs on the nodes that task only > took a minute or so to run. Digging into the thread dump for the master node > I noticed a number of threads are blocked, apparently by speculation thread. > At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while > it looks through the tasks to see what needs to be rerun. Unfortunately that > code loops through each of the tasks, so when you have even just a couple > hundred thousand tasks to run that can be prohibitively slow to run inside of > a synchronized block. Once I disabled speculation, the job went back to > having acceptable performance. > There are no comments around that lock indicating why it was added, and the > git history seems to have a couple refactorings so its hard to find where it > was added. I'm tempted to believe it is the result of someone assuming that > an extra synchronized block never hurt anyone (in reality I've probably just > as many bugs caused by over synchronization as too little) as it looks too > broad to be actually guarding any potential concurrency issue. But, since > concurrency issues can be tricky to reproduce (and yes, I understand that's > an extreme understatement) I'm not sure just blindly removing it without > being familiar with the history is necessarily safe. > Can someone look into this? Or at least make a note in the documentation > that speculation should not be used with large clusters? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15899) file scheme should be used correctly
[ https://issues.apache.org/jira/browse/SPARK-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15899: -- Assignee: Alexander Ulanov Priority: Major (was: Minor) Issue Type: Bug (was: Improvement) > file scheme should be used correctly > > > Key: SPARK-15899 > URL: https://issues.apache.org/jira/browse/SPARK-15899 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Kazuaki Ishizaki >Assignee: Alexander Ulanov > > [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as > {{file://host/}} or {{file:///}}. > [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme] > [Some code > stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58] > use different prefix such as {{file:}}. > It would be good to prepare a utility method to correctly add {{file://host}} > or {{file://} prefix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader
[ https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16847: -- Assignee: Hyukjin Kwon > Prevent to potentially read corrupt statstics on binary in Parquet via > VectorizedReader > --- > > Key: SPARK-16847 > URL: https://issues.apache.org/jira/browse/SPARK-16847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > It is still possible to read corrupt Parquet's statistics. > This problem was found in PARQUET-251 and we disabled filter pushdown on > binary columns in Spark before. > We enabled this after upgrading Parquet but it seems there are potential > incompatibility for Parquet files written in lower Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader
[ https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16847. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14450 [https://github.com/apache/spark/pull/14450] > Prevent to potentially read corrupt statstics on binary in Parquet via > VectorizedReader > --- > > Key: SPARK-16847 > URL: https://issues.apache.org/jira/browse/SPARK-16847 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > It is still possible to read corrupt Parquet's statistics. > This problem was found in PARQUET-251 and we disabled filter pushdown on > binary columns in Spark before. > We enabled this after upgrading Parquet but it seems there are potential > incompatibility for Parquet files written in lower Spark versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining
[ https://issues.apache.org/jira/browse/SPARK-16928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16928: -- Priority: Minor (was: Major) Description: In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() with the following code pattern: {code} public int getInt(int rowId) { if (dictionary == null) { return intData[rowId]; } else { return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); } } {code} As dictionaryIds is also a ColumnVector, this results in a recursive call of getInt() and breaks JIT inlining. As a result, getInt() will not get inlined. was: In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() with the following code pattern: public int getInt(int rowId) { if (dictionary == null) { return intData[rowId]; } else { return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); } } As dictionaryIds is also a ColumnVector, this results in a recursive call of getInt() and breaks JIT inlining. As a result, getInt() will not get inlined. > Recursive call of ColumnVector::getInt() breaks JIT inlining > > > Key: SPARK-16928 > URL: https://issues.apache.org/jira/browse/SPARK-16928 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Qifan Pu >Priority: Minor > > In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() > with the following code pattern: > {code} > public int getInt(int rowId) { > if (dictionary == null) { > return intData[rowId]; > } else { > return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); > } > } > {code} > As dictionaryIds is also a ColumnVector, this results in a recursive call of > getInt() and breaks JIT inlining. As a result, getInt() will not get inlined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15702) Update document programming-guide accumulator section
[ https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410431#comment-15410431 ] Sean Owen commented on SPARK-15702: --- I don't think this should be reopened to add a new and somewhat distinct change. You're further changing the same section for a related but different purpose. Just make another JIRA. > Update document programming-guide accumulator section > - > > Key: SPARK-15702 > URL: https://issues.apache.org/jira/browse/SPARK-15702 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Spark Core >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.0.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > Update document programming-guide accumulator section -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410430#comment-15410430 ] Sean Owen commented on SPARK-16864: --- What would an app do with that info at runtime -- you'd really branch logic based on a git commit? If you're grabbing binaries you still have version or date/time to distinguish the build. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16610: Assignee: Apache Spark > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Apache Spark > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16610: Assignee: (was: Apache Spark) > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410393#comment-15410393 ] Apache Spark commented on SPARK-16610: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14518 > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16856) Link the application's executor page to the master's UI
[ https://issues.apache.org/jira/browse/SPARK-16856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Lin updated SPARK-16856: Summary: Link the application's executor page to the master's UI (was: Link application summary page and detail page to the master page) > Link the application's executor page to the master's UI > --- > > Key: SPARK-16856 > URL: https://issues.apache.org/jira/browse/SPARK-16856 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Reporter: Tao Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16931) PySpark access to data-frame bucketing api
[ https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16931: Assignee: (was: Apache Spark) > PySpark access to data-frame bucketing api > -- > > Key: SPARK-16931 > URL: https://issues.apache.org/jira/browse/SPARK-16931 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Greg Bowyer > > Attached is a patch that enables bucketing for pyspark using the dataframe > API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16931) PySpark access to data-frame bucketing api
[ https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410351#comment-15410351 ] Apache Spark commented on SPARK-16931: -- User 'GregBowyer' has created a pull request for this issue: https://github.com/apache/spark/pull/14517 > PySpark access to data-frame bucketing api > -- > > Key: SPARK-16931 > URL: https://issues.apache.org/jira/browse/SPARK-16931 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Greg Bowyer > > Attached is a patch that enables bucketing for pyspark using the dataframe > API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16931) PySpark access to data-frame bucketing api
[ https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16931: Assignee: Apache Spark > PySpark access to data-frame bucketing api > -- > > Key: SPARK-16931 > URL: https://issues.apache.org/jira/browse/SPARK-16931 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Greg Bowyer >Assignee: Apache Spark > > Attached is a patch that enables bucketing for pyspark using the dataframe > API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16931) PySpark access to data-frame bucketing api
Greg Bowyer created SPARK-16931: --- Summary: PySpark access to data-frame bucketing api Key: SPARK-16931 URL: https://issues.apache.org/jira/browse/SPARK-16931 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.0.0 Reporter: Greg Bowyer Attached is a patch that enables bucketing for pyspark using the dataframe API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410296#comment-15410296 ] Junyang Qian commented on SPARK-16508: -- It seems that there are still some warnings in my local check, e.g. undocumented arguments in as.data.frame "row.names", "optional". I was wondering if I missed something or if we should deal with those? > Fix documentation warnings found by R CMD check > --- > > Key: SPARK-16508 > URL: https://issues.apache.org/jira/browse/SPARK-16508 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > A full list of warnings after the fixes in SPARK-16507 is at > https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16930) ApplicationMaster's code that waits for SparkContext is race-prone
Marcelo Vanzin created SPARK-16930: -- Summary: ApplicationMaster's code that waits for SparkContext is race-prone Key: SPARK-16930 URL: https://issues.apache.org/jira/browse/SPARK-16930 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin Priority: Minor While taking a look at SPARK-15937 and checking if there's something wrong with the code, I noticed two races that explain the behavior. Because they're really narrow races, I'm a little wary of declaring them the cause of that bug. Also because the logs posted there don't really explain what went wrong (and don't really look like a SparkContext was run at all). The races I found are: - it's possible, but very unlikely, for an application to instantiate a SparkContext and stop it before the AM enters the loop where it checks for the instance. - it's possible, but very unlikely, for an application to stop the SparkContext after the AM is already waiting for one, has been notified of its creation, but hasn't yet stored the SparkContext reference in a local variable. I'll fix those and clean up the code a bit in the process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16929) Bad synchronization with regard to speculation
Nicholas Brown created SPARK-16929: -- Summary: Bad synchronization with regard to speculation Key: SPARK-16929 URL: https://issues.apache.org/jira/browse/SPARK-16929 Project: Spark Issue Type: Bug Components: Scheduler Reporter: Nicholas Brown Our cluster has been running slowly since I got speculation working, I looked into it and noticed that stderr was saying some tasks were taking almost an hour to run even though in the application logs on the nodes that task only took a minute or so to run. Digging into the thread dump for the master node I noticed a number of threads are blocked, apparently by speculation thread. At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while it looks through the tasks to see what needs to be rerun. Unfortunately that code loops through each of the tasks, so when you have even just a couple hundred thousand tasks to run that can be prohibitively slow to run inside of a synchronized block. Once I disabled speculation, the job went back to having acceptable performance. There are no comments around that lock indicating why it was added, and the git history seems to have a couple refactorings so its hard to find where it was added. I'm tempted to believe it is the result of someone assuming that an extra synchronized block never hurt anyone (in reality I've probably just as many bugs caused by over synchronization as too little) as it looks too broad to be actually guarding any potential concurrency issue. But, since concurrency issues can be tricky to reproduce (and yes, I understand that's an extreme understatement) I'm not sure just blindly removing it without being familiar with the history is necessarily safe. Can someone look into this? Or at least make a note in the documentation that speculation should not be used with large clusters? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15354) Topology aware block replication strategies
[ https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410269#comment-15410269 ] Davies Liu commented on SPARK-15354: This strategy used in HDFS is to balance the write traffic (for performance) and durability (or availability) of blocks. But the blocks in Spark is much different, they could be lost and recovered by re-computing, so we usually have only one copy, rarely having two copy. Even with two copies, it's actually to place them randomly to have better balance for computing. Overall, implementing this strategy for Spark may be not that useful. Could you share more information on your case? > Topology aware block replication strategies > --- > > Key: SPARK-15354 > URL: https://issues.apache.org/jira/browse/SPARK-15354 > Project: Spark > Issue Type: Sub-task > Components: Mesos, Spark Core, YARN >Reporter: Shubham Chopra > > Implementations of strategies for resilient block replication for different > resource managers that replicate the 3-replica strategy used by HDFS, where > the first replica is on an executor, the second replica within the same rack > as the executor and a third replica on a different rack. > The implementation involves providing two pluggable classes, one running in > the driver that provides topology information for every host at cluster start > and the second prioritizing a list of peer BlockManagerIds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11638) Run Spark on Mesos with bridge networking
[ https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410261#comment-15410261 ] Stavros Kontopoulos edited comment on SPARK-11638 at 8/5/16 11:08 PM: -- [~radekg] [~mgummelt] what do you think? was (Author: skonto): [~radekg][~mgummelt] what do you think? > Run Spark on Mesos with bridge networking > - > > Key: SPARK-11638 > URL: https://issues.apache.org/jira/browse/SPARK-11638 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0 >Reporter: Radoslaw Gruchalski > Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, > 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch > > > h4. Summary > Provides {{spark.driver.advertisedPort}}, > {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and > {{spark.replClassServer.advertisedPort}} settings to enable running Spark in > Mesos on Docker with Bridge networking. Provides patches for Akka Remote to > enable Spark driver advertisement using alternative host and port. > With these settings, it is possible to run Spark Master in a Docker container > and have the executors running on Mesos talk back correctly to such Master. > The problem is discussed on the Mesos mailing list here: > https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E > h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door > In order for the framework to receive orders in the bridged container, Mesos > in the container has to register for offers using the IP address of the > Agent. Offers are sent by Mesos Master to the Docker container running on a > different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} > would advertise itself using the IP address of the container, something like > {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a > different host, it's a different machine. Mesos 0.24.0 introduced two new > properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and > {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's > address to register for offers. This was provided mainly for running Mesos in > Docker on Mesos. > h4. Spark - how does the above relate and what is being addressed here? > Similar to Mesos, out of the box, Spark does not allow to advertise its > services on ports different than bind ports. Consider following scenario: > Spark is running inside a Docker container on Mesos, it's a bridge networking > mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for > the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and > {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to > Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the > container ports. Starting the executors from such container results in > executors not being able to communicate back to the Spark Master. > This happens because of 2 things: > Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} > transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port > different to what it bound to. The settings discussed are here: > https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376. > These do not exist in Akka {{2.3.x}}. Spark driver will always advertise > port {{}} as this is the one {{akka-remote}} is bound to. > Any URIs the executors contact the Spark Master on, are prepared by Spark > Master and handed over to executors. These always contain the port number > used by the Master to find the service on. The services are: > - {{spark.broadcast.port}} > - {{spark.fileserver.port}} > - {{spark.replClassServer.port}} > all above ports are by default {{0}} (random assignment) but can be specified > using Spark configuration ( {{-Dspark...port}} ). However, they are limited > in the same way as the {{spark.driver.port}}; in the above example, an > executor should not contact the file server on port {{6677}} but rather on > the respective 31xxx assigned by Mesos. > Spark currently does not allow any of that. > h4. Taking on the problem, step 1: Spark Driver > As mentioned above, Spark Driver is based on {{akka-remote}}. In order to > take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and > {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile > with Akka 2.4.x yet. > What we want is the back port of mentioned {{akka-remote}} settings to > {{2.3.x}} versions. These patches are attached to this ticket - >
[jira] [Commented] (SPARK-11638) Run Spark on Mesos with bridge networking
[ https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410261#comment-15410261 ] Stavros Kontopoulos commented on SPARK-11638: - [~radekg][~mgummelt] what do you think? > Run Spark on Mesos with bridge networking > - > > Key: SPARK-11638 > URL: https://issues.apache.org/jira/browse/SPARK-11638 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0 >Reporter: Radoslaw Gruchalski > Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, > 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch > > > h4. Summary > Provides {{spark.driver.advertisedPort}}, > {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and > {{spark.replClassServer.advertisedPort}} settings to enable running Spark in > Mesos on Docker with Bridge networking. Provides patches for Akka Remote to > enable Spark driver advertisement using alternative host and port. > With these settings, it is possible to run Spark Master in a Docker container > and have the executors running on Mesos talk back correctly to such Master. > The problem is discussed on the Mesos mailing list here: > https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E > h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door > In order for the framework to receive orders in the bridged container, Mesos > in the container has to register for offers using the IP address of the > Agent. Offers are sent by Mesos Master to the Docker container running on a > different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} > would advertise itself using the IP address of the container, something like > {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a > different host, it's a different machine. Mesos 0.24.0 introduced two new > properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and > {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's > address to register for offers. This was provided mainly for running Mesos in > Docker on Mesos. > h4. Spark - how does the above relate and what is being addressed here? > Similar to Mesos, out of the box, Spark does not allow to advertise its > services on ports different than bind ports. Consider following scenario: > Spark is running inside a Docker container on Mesos, it's a bridge networking > mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for > the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and > {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to > Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the > container ports. Starting the executors from such container results in > executors not being able to communicate back to the Spark Master. > This happens because of 2 things: > Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} > transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port > different to what it bound to. The settings discussed are here: > https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376. > These do not exist in Akka {{2.3.x}}. Spark driver will always advertise > port {{}} as this is the one {{akka-remote}} is bound to. > Any URIs the executors contact the Spark Master on, are prepared by Spark > Master and handed over to executors. These always contain the port number > used by the Master to find the service on. The services are: > - {{spark.broadcast.port}} > - {{spark.fileserver.port}} > - {{spark.replClassServer.port}} > all above ports are by default {{0}} (random assignment) but can be specified > using Spark configuration ( {{-Dspark...port}} ). However, they are limited > in the same way as the {{spark.driver.port}}; in the above example, an > executor should not contact the file server on port {{6677}} but rather on > the respective 31xxx assigned by Mesos. > Spark currently does not allow any of that. > h4. Taking on the problem, step 1: Spark Driver > As mentioned above, Spark Driver is based on {{akka-remote}}. In order to > take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and > {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile > with Akka 2.4.x yet. > What we want is the back port of mentioned {{akka-remote}} settings to > {{2.3.x}} versions. These patches are attached to this ticket - > {{2.3.4.patch}} and {{2.3.11.patch}} files provide patches for respective > akka versions. These add mentioned settings and
[jira] [Resolved] (SPARK-16901) Hive settings in hive-site.xml may be overridden by Hive's default values
[ https://issues.apache.org/jira/browse/SPARK-16901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16901. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14497 [https://github.com/apache/spark/pull/14497] > Hive settings in hive-site.xml may be overridden by Hive's default values > - > > Key: SPARK-16901 > URL: https://issues.apache.org/jira/browse/SPARK-16901 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.1, 2.1.0 > > > When we create the HiveConf for metastore client, we use a Hadoop Conf as the > base, which may contain Hive settings in hive-site.xml > (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L49). > However, HiveConf's initialize function basically ignores the base Hadoop > Conf and always its default values (i.e. settings with non-null default > values) as the base > (https://github.com/apache/hive/blob/release-1.2.1/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2687). > So, even a user put {{javax.jdo.option.ConnectionURL}} in hive-site.xml, it > is not used and Hive will use its default, which is > {{jdbc:derby:;databaseName=metastore_db;create=true}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410238#comment-15410238 ] Jan Gorecki edited comment on SPARK-16864 at 8/5/16 10:52 PM: -- Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code of a project, you won't find better field to reference source code. Referencing release version is a different thing. was (Author: jangorecki): Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code a project, you won't find better field to references the source code. Referencing release versions is simply a different thing. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410238#comment-15410238 ] Jan Gorecki edited comment on SPARK-16864 at 8/5/16 10:51 PM: -- Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code a project, you won't find better field to references the source code. Referencing release versions is simply a different thing. was (Author: jangorecki): Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who just grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code a project, you won't find better field to references the source code. Referencing release versions is simply a different thing. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410238#comment-15410238 ] Jan Gorecki commented on SPARK-16864: - Hi, git commit is relevant to applications at runtime as long as the subject for an application (in any dimension) is spark itself. I don't understand why that info would not be included. This may not be a problem for people who build from source, they can eventually put that metadata in plaintext file (still an overhead). The bigger problem is for those who just grab binaries and for example just want to track performance in their cluster over spark git history. Git commit hash is a natural key for a source code a project, you won't find better field to references the source code. Referencing release versions is simply a different thing. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15702) Update document programming-guide accumulator section
[ https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15702: Assignee: Weichen Xu (was: Apache Spark) > Update document programming-guide accumulator section > - > > Key: SPARK-15702 > URL: https://issues.apache.org/jira/browse/SPARK-15702 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Spark Core >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.0.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > Update document programming-guide accumulator section -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15702) Update document programming-guide accumulator section
[ https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410239#comment-15410239 ] Apache Spark commented on SPARK-15702: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/14516 > Update document programming-guide accumulator section > - > > Key: SPARK-15702 > URL: https://issues.apache.org/jira/browse/SPARK-15702 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Spark Core >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.0.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > Update document programming-guide accumulator section -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15702) Update document programming-guide accumulator section
[ https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15702: Assignee: Apache Spark (was: Weichen Xu) > Update document programming-guide accumulator section > - > > Key: SPARK-15702 > URL: https://issues.apache.org/jira/browse/SPARK-15702 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Spark Core >Reporter: Weichen Xu >Assignee: Apache Spark > Fix For: 2.0.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > Update document programming-guide accumulator section -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15702) Update document programming-guide accumulator section
[ https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reopened SPARK-15702: -- I'm reopening this because I think the current programming guide accumulator section is confusing. It mixes both the old and new APIs. > Update document programming-guide accumulator section > - > > Key: SPARK-15702 > URL: https://issues.apache.org/jira/browse/SPARK-15702 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Spark Core >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.0.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > Update document programming-guide accumulator section -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source
[ https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-16924: --- Issue Type: Improvement (was: Bug) > DataStreamReader can not support option("inferSchema", true/false) for csv > and json file source > --- > > Key: SPARK-16924 > URL: https://issues.apache.org/jira/browse/SPARK-16924 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > Currently DataStreamReader can not support option("inferSchema", true|false) > for csv and json file source. It only takes SQLConf setting > "spark.sql.streaming.schemaInference", which needs to be set at session > level. > For example: > {code} > scala> val in = spark.readStream.format("json").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/json/t1") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > scala> val in = spark.readStream.format("csv").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/csv") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > {code} > In the example, even though users specify the option("inferSchema", true), it > does not take it. But for batch data, DataFrameReader can take it: > {code} > scala> val in = spark.read.format("csv").option("header", > true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1") > in: org.apache.spark.sql.DataFrame = [signal: string, flash: int] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16926) Partition columns are present in columns metadata for partition but not table
[ https://issues.apache.org/jira/browse/SPARK-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410202#comment-15410202 ] Apache Spark commented on SPARK-16926: -- User 'dafrista' has created a pull request for this issue: https://github.com/apache/spark/pull/14515 > Partition columns are present in columns metadata for partition but not table > - > > Key: SPARK-16926 > URL: https://issues.apache.org/jira/browse/SPARK-16926 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Brian Cho > > A change introduced in SPARK-14388 removes partition columns from the column > metadata of tables, but not for partitions. This causes TableReader to > believe that the schema is different and create an unnecessary conversion > object inspector, taking the else codepath in TableReader below: > {code} > val soi = if > (rawDeser.getObjectInspector.equals(tableDeser.getObjectInspector)) { > rawDeser.getObjectInspector.asInstanceOf[StructObjectInspector] > } else { > ObjectInspectorConverters.getConvertedOI( > rawDeser.getObjectInspector, > tableDeser.getObjectInspector).asInstanceOf[StructObjectInspector] > } > {code} > Printing the properties as debug output confirms the difference for the Hive > table. > Table properties (tableDesc.getProperties): > {code} > 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, > string:bigint:string:bigint:bigint:array > {code} > Partition properties (partProps): > {code} > 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, > string:bigint:string:bigint:bigint:array:string:string:string > {code} > Where the final three string columns are partition columns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16926) Partition columns are present in columns metadata for partition but not table
[ https://issues.apache.org/jira/browse/SPARK-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16926: Assignee: Apache Spark > Partition columns are present in columns metadata for partition but not table > - > > Key: SPARK-16926 > URL: https://issues.apache.org/jira/browse/SPARK-16926 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Brian Cho >Assignee: Apache Spark > > A change introduced in SPARK-14388 removes partition columns from the column > metadata of tables, but not for partitions. This causes TableReader to > believe that the schema is different and create an unnecessary conversion > object inspector, taking the else codepath in TableReader below: > {code} > val soi = if > (rawDeser.getObjectInspector.equals(tableDeser.getObjectInspector)) { > rawDeser.getObjectInspector.asInstanceOf[StructObjectInspector] > } else { > ObjectInspectorConverters.getConvertedOI( > rawDeser.getObjectInspector, > tableDeser.getObjectInspector).asInstanceOf[StructObjectInspector] > } > {code} > Printing the properties as debug output confirms the difference for the Hive > table. > Table properties (tableDesc.getProperties): > {code} > 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, > string:bigint:string:bigint:bigint:array > {code} > Partition properties (partProps): > {code} > 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, > string:bigint:string:bigint:bigint:array:string:string:string > {code} > Where the final three string columns are partition columns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16926) Partition columns are present in columns metadata for partition but not table
[ https://issues.apache.org/jira/browse/SPARK-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16926: Assignee: (was: Apache Spark) > Partition columns are present in columns metadata for partition but not table > - > > Key: SPARK-16926 > URL: https://issues.apache.org/jira/browse/SPARK-16926 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Brian Cho > > A change introduced in SPARK-14388 removes partition columns from the column > metadata of tables, but not for partitions. This causes TableReader to > believe that the schema is different and create an unnecessary conversion > object inspector, taking the else codepath in TableReader below: > {code} > val soi = if > (rawDeser.getObjectInspector.equals(tableDeser.getObjectInspector)) { > rawDeser.getObjectInspector.asInstanceOf[StructObjectInspector] > } else { > ObjectInspectorConverters.getConvertedOI( > rawDeser.getObjectInspector, > tableDeser.getObjectInspector).asInstanceOf[StructObjectInspector] > } > {code} > Printing the properties as debug output confirms the difference for the Hive > table. > Table properties (tableDesc.getProperties): > {code} > 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, > string:bigint:string:bigint:bigint:array > {code} > Partition properties (partProps): > {code} > 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, > string:bigint:string:bigint:bigint:array:string:string:string > {code} > Where the final three string columns are partition columns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining
[ https://issues.apache.org/jira/browse/SPARK-16928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16928: Assignee: Apache Spark > Recursive call of ColumnVector::getInt() breaks JIT inlining > > > Key: SPARK-16928 > URL: https://issues.apache.org/jira/browse/SPARK-16928 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Qifan Pu >Assignee: Apache Spark > > In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() > with the following code pattern: > public int getInt(int rowId) { > if (dictionary == null) { > return intData[rowId]; > } else { > return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); > } > } > As dictionaryIds is also a ColumnVector, this results in a recursive call of > getInt() and breaks JIT inlining. As a result, getInt() will not get inlined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining
[ https://issues.apache.org/jira/browse/SPARK-16928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410173#comment-15410173 ] Apache Spark commented on SPARK-16928: -- User 'ooq' has created a pull request for this issue: https://github.com/apache/spark/pull/14513 > Recursive call of ColumnVector::getInt() breaks JIT inlining > > > Key: SPARK-16928 > URL: https://issues.apache.org/jira/browse/SPARK-16928 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Qifan Pu > > In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() > with the following code pattern: > public int getInt(int rowId) { > if (dictionary == null) { > return intData[rowId]; > } else { > return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); > } > } > As dictionaryIds is also a ColumnVector, this results in a recursive call of > getInt() and breaks JIT inlining. As a result, getInt() will not get inlined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining
[ https://issues.apache.org/jira/browse/SPARK-16928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16928: Assignee: (was: Apache Spark) > Recursive call of ColumnVector::getInt() breaks JIT inlining > > > Key: SPARK-16928 > URL: https://issues.apache.org/jira/browse/SPARK-16928 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Qifan Pu > > In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() > with the following code pattern: > public int getInt(int rowId) { > if (dictionary == null) { > return intData[rowId]; > } else { > return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); > } > } > As dictionaryIds is also a ColumnVector, this results in a recursive call of > getInt() and breaks JIT inlining. As a result, getInt() will not get inlined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining
Qifan Pu created SPARK-16928: Summary: Recursive call of ColumnVector::getInt() breaks JIT inlining Key: SPARK-16928 URL: https://issues.apache.org/jira/browse/SPARK-16928 Project: Spark Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Qifan Pu In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() with the following code pattern: public int getInt(int rowId) { if (dictionary == null) { return intData[rowId]; } else { return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); } } As dictionaryIds is also a ColumnVector, this results in a recursive call of getInt() and breaks JIT inlining. As a result, getInt() will not get inlined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16927) Mesos Cluster Dispatcher default properties
[ https://issues.apache.org/jira/browse/SPARK-16927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16927: Assignee: Apache Spark > Mesos Cluster Dispatcher default properties > --- > > Key: SPARK-16927 > URL: https://issues.apache.org/jira/browse/SPARK-16927 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Michael Gummelt >Assignee: Apache Spark > > Add the capability to set default driver properties for all jobs submitted > through the dispatcher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16927) Mesos Cluster Dispatcher default properties
[ https://issues.apache.org/jira/browse/SPARK-16927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16927: Assignee: (was: Apache Spark) > Mesos Cluster Dispatcher default properties > --- > > Key: SPARK-16927 > URL: https://issues.apache.org/jira/browse/SPARK-16927 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > Add the capability to set default driver properties for all jobs submitted > through the dispatcher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16923) Mesos cluster scheduler duplicates config vars by setting them in the environment and as --conf
[ https://issues.apache.org/jira/browse/SPARK-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16923: Assignee: Apache Spark > Mesos cluster scheduler duplicates config vars by setting them in the > environment and as --conf > --- > > Key: SPARK-16923 > URL: https://issues.apache.org/jira/browse/SPARK-16923 > Project: Spark > Issue Type: Task > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Michael Gummelt >Assignee: Apache Spark > > I don't think this introduces any bugs, but we should fix it nonetheless -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16927) Mesos Cluster Dispatcher default properties
[ https://issues.apache.org/jira/browse/SPARK-16927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410129#comment-15410129 ] Apache Spark commented on SPARK-16927: -- User 'mgummelt' has created a pull request for this issue: https://github.com/apache/spark/pull/14511 > Mesos Cluster Dispatcher default properties > --- > > Key: SPARK-16927 > URL: https://issues.apache.org/jira/browse/SPARK-16927 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > Add the capability to set default driver properties for all jobs submitted > through the dispatcher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16923) Mesos cluster scheduler duplicates config vars by setting them in the environment and as --conf
[ https://issues.apache.org/jira/browse/SPARK-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16923: Assignee: (was: Apache Spark) > Mesos cluster scheduler duplicates config vars by setting them in the > environment and as --conf > --- > > Key: SPARK-16923 > URL: https://issues.apache.org/jira/browse/SPARK-16923 > Project: Spark > Issue Type: Task > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > I don't think this introduces any bugs, but we should fix it nonetheless -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16923) Mesos cluster scheduler duplicates config vars by setting them in the environment and as --conf
[ https://issues.apache.org/jira/browse/SPARK-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410130#comment-15410130 ] Apache Spark commented on SPARK-16923: -- User 'mgummelt' has created a pull request for this issue: https://github.com/apache/spark/pull/14511 > Mesos cluster scheduler duplicates config vars by setting them in the > environment and as --conf > --- > > Key: SPARK-16923 > URL: https://issues.apache.org/jira/browse/SPARK-16923 > Project: Spark > Issue Type: Task > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > I don't think this introduces any bugs, but we should fix it nonetheless -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16927) Mesos Cluster Dispatcher default properties
Michael Gummelt created SPARK-16927: --- Summary: Mesos Cluster Dispatcher default properties Key: SPARK-16927 URL: https://issues.apache.org/jira/browse/SPARK-16927 Project: Spark Issue Type: New Feature Components: Mesos Affects Versions: 2.0.0 Reporter: Michael Gummelt Add the capability to set default driver properties for all jobs submitted through the dispatcher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16260) ML Example Improvements and Cleanup
[ https://issues.apache.org/jira/browse/SPARK-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-16260: - Description: This parent task is to track a few possible improvements and cleanup for PySpark ML examples I noticed during 2.0 QA. These include: * Parity between different API examples * Ensure input format is documented in example * Ensure results of example are clear and demonstrate functionality * Cleanup unused imports * Fix minor issues was: This parent task is to track a few possible improvements and cleanup for PySpark ML examples I noticed during 2.0 QA. These include: * Parity with Scala ML examples * Ensure input format is documented in example * Ensure results of example are clear and demonstrate functionality * Cleanup unused imports * Fix minor issues > ML Example Improvements and Cleanup > --- > > Key: SPARK-16260 > URL: https://issues.apache.org/jira/browse/SPARK-16260 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > > This parent task is to track a few possible improvements and cleanup for > PySpark ML examples I noticed during 2.0 QA. These include: > * Parity between different API examples > * Ensure input format is documented in example > * Ensure results of example are clear and demonstrate functionality > * Cleanup unused imports > * Fix minor issues -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16260) ML Example Improvements and Cleanup
[ https://issues.apache.org/jira/browse/SPARK-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-16260: - Summary: ML Example Improvements and Cleanup (was: PySpark ML Example Improvements and Cleanup) > ML Example Improvements and Cleanup > --- > > Key: SPARK-16260 > URL: https://issues.apache.org/jira/browse/SPARK-16260 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > > This parent task is to track a few possible improvements and cleanup for > PySpark ML examples I noticed during 2.0 QA. These include: > * Parity with Scala ML examples > * Ensure input format is documented in example > * Ensure results of example are clear and demonstrate functionality > * Cleanup unused imports > * Fix minor issues -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410105#comment-15410105 ] Hossein Falaki commented on SPARK-16883: Thanks [~shivaram]! This may require changing the serialization. Who thought this might happen! ;) > SQL decimal type is not properly cast to number when collecting SparkDataFrame > -- > > Key: SPARK-16883 > URL: https://issues.apache.org/jira/browse/SPARK-16883 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > To reproduce run following code. As you can see "y" is a list of values. > {code} > registerTempTable(createDataFrame(iris), "iris") > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y > from iris limit 5"))) > 'data.frame': 5 obs. of 2 variables: > $ x: num 1 1 1 1 1 > $ y:List of 5 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16817) Enable storing of shuffle data in Alluxio
[ https://issues.apache.org/jira/browse/SPARK-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16817. --- Resolution: Won't Fix We removed the Tachyon dependency a while ago, although its purpose was somewhat different. I don't know that this is a good use for Alluxio. You indeed really just want fast local storage and not something distributed or fault tolerant. > Enable storing of shuffle data in Alluxio > - > > Key: SPARK-16817 > URL: https://issues.apache.org/jira/browse/SPARK-16817 > Project: Spark > Issue Type: New Feature >Reporter: Tim Bisson > > If one is using Alluxio for storage, it would also be useful if Spark can > store shuffle spill data in Alluxio. For example: > spark.local.dir="alluxio://host:port/path" > Several users on the Alluxio mailing list have asked for this feature: > https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/90pRZWRVi0s/mgLWLS5aAgAJ > https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/s9H93PnDebw/v_1_FMjR7vEJ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16784. --- Resolution: Not A Problem Reopen if that's not what you meant > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410078#comment-15410078 ] Apache Spark commented on SPARK-16925: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/14510 > Spark tasks which cause JVM to exit with a zero exit code may cause app to > hang in Standalone mode > -- > > Key: SPARK-16925 > URL: https://issues.apache.org/jira/browse/SPARK-16925 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > If you have a Spark standalone cluster which runs a single application and > you have a Spark task which repeatedly fails by causing the executor JVM to > exit with a _zero_ exit code then this may temporarily freeze / hang the > Spark application. > For example, running > {code} > sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } > {code} > on a cluster will cause all executors to die but those executors won't be > replaced unless another Spark application or worker joins or leaves the > cluster. This is caused by a bug in the standalone Master where > {{schedule()}} is only called on executor exit when the exit code is > non-zero, whereas I think that we should always call {{schedule()}} even on a > "clean" executor shutdown since {{schedule()}} should always be safe to call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-16925: --- Description: If you have a Spark standalone cluster which runs a single application and you have a Spark task which repeatedly fails by causing the executor JVM to exit with a _zero_ exit code then this may temporarily freeze / hang the Spark application. For example, running {code} sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } {code} on a cluster will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster. This is caused by a bug in the standalone Master where {{schedule()}} is only called on executor exit when the exit code is non-zero, whereas I think that we should always call {{schedule()}} even on a "clean" executor shutdown since {{schedule()}} should always be safe to call. was: If you have a Spark standalone cluster which runs a single application and you have a Spark task which repeatedly fails by causing the executor JVM to exit with a _zero_ exit code then this may temporarily freeze / hang the Spark application. For example, running {code} sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } {code} on a cluster will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster. This is caused by a bug in the standalone Master where {{schedule()}} is only called on executor exit when the exit code is non-zero, whereas I think that we should always call {{schedule()}} even on a "clean" executor shutdown since {{schedule()}} should be idempotent. > Spark tasks which cause JVM to exit with a zero exit code may cause app to > hang in Standalone mode > -- > > Key: SPARK-16925 > URL: https://issues.apache.org/jira/browse/SPARK-16925 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > If you have a Spark standalone cluster which runs a single application and > you have a Spark task which repeatedly fails by causing the executor JVM to > exit with a _zero_ exit code then this may temporarily freeze / hang the > Spark application. > For example, running > {code} > sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } > {code} > on a cluster will cause all executors to die but those executors won't be > replaced unless another Spark application or worker joins or leaves the > cluster. This is caused by a bug in the standalone Master where > {{schedule()}} is only called on executor exit when the exit code is > non-zero, whereas I think that we should always call {{schedule()}} even on a > "clean" executor shutdown since {{schedule()}} should always be safe to call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16709. --- Resolution: Duplicate > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > Attachments: commit failed.png > > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16499) Improve applyInPlace function for matrix in ANN code
[ https://issues.apache.org/jira/browse/SPARK-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16499. --- Resolution: Won't Fix > Improve applyInPlace function for matrix in ANN code > > > Key: SPARK-16499 > URL: https://issues.apache.org/jira/browse/SPARK-16499 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > The ApplyInPlace function in ANN code: > ` > def apply(x: BDM[Double], y: BDM[Double], func: Double => Double): Unit > ` > is implemented using simple while loop. > It should be replaced with breeze BDM matrix operating function, so that it > can take advantage of breeze lib's optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16455) Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16455. --- Resolution: Won't Fix > Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling > new tasks when cluster is restarting > > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Priority: Minor > > In our case, we are implementing a new mechanism which will let driver > survive when cluster is temporarily down and restarting. So when the service > provided by cluster is not available, scheduler should stop scheduling new > tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to > avoid new task scheduling when it's necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-16925: --- Summary: Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode (was: Spark tasks which cause JVM to exit with a zero exit code may cause app to hang) > Spark tasks which cause JVM to exit with a zero exit code may cause app to > hang in Standalone mode > -- > > Key: SPARK-16925 > URL: https://issues.apache.org/jira/browse/SPARK-16925 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > If you have a Spark standalone cluster which runs a single application and > you have a Spark task which repeatedly fails by causing the executor JVM to > exit with a _zero_ exit code then this may temporarily freeze / hang the > Spark application. > For example, running > {code} > sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } > {code} > on a cluster will cause all executors to die but those executors won't be > replaced unless another Spark application or worker joins or leaves the > cluster. This is caused by a bug in the standalone Master where > {{schedule()}} is only called on executor exit when the exit code is > non-zero, whereas I think that we should always call {{schedule()}} even on a > "clean" executor shutdown since {{schedule()}} should be idempotent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang
Josh Rosen created SPARK-16925: -- Summary: Spark tasks which cause JVM to exit with a zero exit code may cause app to hang Key: SPARK-16925 URL: https://issues.apache.org/jira/browse/SPARK-16925 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.0.0, 1.6.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical If you have a Spark standalone cluster which runs a single application and you have a Spark task which repeatedly fails by causing the executor JVM to exit with a _zero_ exit code then this may temporarily freeze / hang the Spark application. For example, running {code} sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } {code} on a cluster will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster. This is caused by a bug in the standalone Master where {{schedule()}} is only called on executor exit when the exit code is non-zero, whereas I think that we should always call {{schedule()}} even on a "clean" executor shutdown since {{schedule()}} should be idempotent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16926) Partition columns are present in columns metadata for partition but not table
Brian Cho created SPARK-16926: - Summary: Partition columns are present in columns metadata for partition but not table Key: SPARK-16926 URL: https://issues.apache.org/jira/browse/SPARK-16926 Project: Spark Issue Type: Bug Components: SQL Reporter: Brian Cho A change introduced in SPARK-14388 removes partition columns from the column metadata of tables, but not for partitions. This causes TableReader to believe that the schema is different and create an unnecessary conversion object inspector, taking the else codepath in TableReader below: {code} val soi = if (rawDeser.getObjectInspector.equals(tableDeser.getObjectInspector)) { rawDeser.getObjectInspector.asInstanceOf[StructObjectInspector] } else { ObjectInspectorConverters.getConvertedOI( rawDeser.getObjectInspector, tableDeser.getObjectInspector).asInstanceOf[StructObjectInspector] } {code} Printing the properties as debug output confirms the difference for the Hive table. Table properties (tableDesc.getProperties): {code} 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, string:bigint:string:bigint:bigint:array {code} Partition properties (partProps): {code} 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, string:bigint:string:bigint:bigint:array:string:string:string {code} Where the final three string columns are partition columns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source
[ https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16924: Assignee: Apache Spark > DataStreamReader can not support option("inferSchema", true/false) for csv > and json file source > --- > > Key: SPARK-16924 > URL: https://issues.apache.org/jira/browse/SPARK-16924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu >Assignee: Apache Spark > > Currently DataStreamReader can not support option("inferSchema", true|false) > for csv and json file source. It only takes SQLConf setting > "spark.sql.streaming.schemaInference", which needs to be set at session > level. > For example: > {code} > scala> val in = spark.readStream.format("json").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/json/t1") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > scala> val in = spark.readStream.format("csv").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/csv") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > {code} > In the example, even though users specify the option("inferSchema", true), it > does not take it. But for batch data, DataFrameReader can take it: > {code} > scala> val in = spark.read.format("csv").option("header", > true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1") > in: org.apache.spark.sql.DataFrame = [signal: string, flash: int] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source
[ https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16924: Assignee: (was: Apache Spark) > DataStreamReader can not support option("inferSchema", true/false) for csv > and json file source > --- > > Key: SPARK-16924 > URL: https://issues.apache.org/jira/browse/SPARK-16924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > Currently DataStreamReader can not support option("inferSchema", true|false) > for csv and json file source. It only takes SQLConf setting > "spark.sql.streaming.schemaInference", which needs to be set at session > level. > For example: > {code} > scala> val in = spark.readStream.format("json").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/json/t1") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > scala> val in = spark.readStream.format("csv").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/csv") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > {code} > In the example, even though users specify the option("inferSchema", true), it > does not take it. But for batch data, DataFrameReader can take it: > {code} > scala> val in = spark.read.format("csv").option("header", > true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1") > in: org.apache.spark.sql.DataFrame = [signal: string, flash: int] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source
[ https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410059#comment-15410059 ] Apache Spark commented on SPARK-16924: -- User 'xwu0226' has created a pull request for this issue: https://github.com/apache/spark/pull/14509 > DataStreamReader can not support option("inferSchema", true/false) for csv > and json file source > --- > > Key: SPARK-16924 > URL: https://issues.apache.org/jira/browse/SPARK-16924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > Currently DataStreamReader can not support option("inferSchema", true|false) > for csv and json file source. It only takes SQLConf setting > "spark.sql.streaming.schemaInference", which needs to be set at session > level. > For example: > {code} > scala> val in = spark.readStream.format("json").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/json/t1") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > scala> val in = spark.readStream.format("csv").option("inferSchema", > true).load("/Users/xinwu/spark-test/data/csv") > java.lang.IllegalArgumentException: Schema must be specified when creating a > streaming source DataFrame. If some files already exist in the directory, > then depending on the file format you may be able to create a static > DataFrame on that directory with 'spark.read.load(directory)' and infer > schema from it. > at > org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) > at > org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) > at > org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) > at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) > ... 48 elided > {code} > In the example, even though users specify the option("inferSchema", true), it > does not take it. But for batch data, DataFrameReader can take it: > {code} > scala> val in = spark.read.format("csv").option("header", > true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1") > in: org.apache.spark.sql.DataFrame = [signal: string, flash: int] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source
Xin Wu created SPARK-16924: -- Summary: DataStreamReader can not support option("inferSchema", true/false) for csv and json file source Key: SPARK-16924 URL: https://issues.apache.org/jira/browse/SPARK-16924 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu Currently DataStreamReader can not support option("inferSchema", true|false) for csv and json file source. It only takes SQLConf setting "spark.sql.streaming.schemaInference", which needs to be set at session level. For example: {code} scala> val in = spark.readStream.format("json").option("inferSchema", true).load("/Users/xinwu/spark-test/data/json/t1") java.lang.IllegalArgumentException: Schema must be specified when creating a streaming source DataFrame. If some files already exist in the directory, then depending on the file format you may be able to create a static DataFrame on that directory with 'spark.read.load(directory)' and infer schema from it. at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) ... 48 elided scala> val in = spark.readStream.format("csv").option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv") java.lang.IllegalArgumentException: Schema must be specified when creating a streaming source DataFrame. If some files already exist in the directory, then depending on the file format you may be able to create a static DataFrame on that directory with 'spark.read.load(directory)' and infer schema from it. at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153) ... 48 elided {code} In the example, even though users specify the option("inferSchema", true), it does not take it. But for batch data, DataFrameReader can take it: {code} scala> val in = spark.read.format("csv").option("header", true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1") in: org.apache.spark.sql.DataFrame = [signal: string, flash: int] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16923) Mesos cluster scheduler duplicates config vars by setting them in the environment and as --conf
Michael Gummelt created SPARK-16923: --- Summary: Mesos cluster scheduler duplicates config vars by setting them in the environment and as --conf Key: SPARK-16923 URL: https://issues.apache.org/jira/browse/SPARK-16923 Project: Spark Issue Type: Task Components: Mesos Affects Versions: 2.0.0 Reporter: Michael Gummelt I don't think this introduces any bugs, but we should fix it nonetheless -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13238) Add ganglia dmax parameter
[ https://issues.apache.org/jira/browse/SPARK-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-13238. Resolution: Fixed Assignee: Ekasit Kijsipongse Fix Version/s: 2.1.0 > Add ganglia dmax parameter > -- > > Key: SPARK-13238 > URL: https://issues.apache.org/jira/browse/SPARK-13238 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Ekasit Kijsipongse >Assignee: Ekasit Kijsipongse >Priority: Minor > Fix For: 2.1.0 > > > The current ganglia reporter doesn't set metric expiration time (dmax). The > metrics of all finished applications are indefinitely left displayed in > ganglia web. The dmax parameter allows user to set the lifetime of the > metrics. The default value is 0 for compatibility with previous versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16586) spark-class crash with "[: too many arguments" instead of displaying the correct error message
[ https://issues.apache.org/jira/browse/SPARK-16586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409984#comment-15409984 ] Apache Spark commented on SPARK-16586: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/14508 > spark-class crash with "[: too many arguments" instead of displaying the > correct error message > -- > > Key: SPARK-16586 > URL: https://issues.apache.org/jira/browse/SPARK-16586 > Project: Spark > Issue Type: Bug >Reporter: Xiang Gao >Priority: Minor > > When trying to run spark on a machine that cannot provide enough memory for > java to use, instead of printing the correct error message, spark-class will > crash with {{spark-class: line 83: [: too many arguments}} > Simple shell commands to trigger this problem are: > {code} > ulimit -v 10 > ./sbin/start-master.sh > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16260) PySpark ML Example Improvements and Cleanup
[ https://issues.apache.org/jira/browse/SPARK-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16260. --- Resolution: Done Assignee: Bryan Cutler > PySpark ML Example Improvements and Cleanup > --- > > Key: SPARK-16260 > URL: https://issues.apache.org/jira/browse/SPARK-16260 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > > This parent task is to track a few possible improvements and cleanup for > PySpark ML examples I noticed during 2.0 QA. These include: > * Parity with Scala ML examples > * Ensure input format is documented in example > * Ensure results of example are clear and demonstrate functionality > * Cleanup unused imports > * Fix minor issues -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16826: -- Assignee: Sylvain Zimmer > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Sylvain Zimmer > Fix For: 2.1.0 > > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16421) Improve output from ML examples
[ https://issues.apache.org/jira/browse/SPARK-16421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16421: -- Assignee: Bryan Cutler Priority: Major (was: Trivial) > Improve output from ML examples > --- > > Key: SPARK-16421 > URL: https://issues.apache.org/jira/browse/SPARK-16421 > Project: Spark > Issue Type: Sub-task > Components: Examples, ML >Reporter: Bryan Cutler >Assignee: Bryan Cutler > Fix For: 2.1.0 > > > In many ML examples, the output is useless. Sometimes {{show()}} is called > and any pertinent results are hidden. For example, here is the output of > max_abs_scaler > {noformat} > $ bin/spark-submit examples/src/main/python/ml/max_abs_scaler_example.py > +-+++ > |label|features| scaledFeatures| > +-+++ > | 0.0|(692,[127,128,129...|(692,[127,128,129...| > | 1.0|(692,[158,159,160...|(692,[158,159,160...| > | 1.0|(692,[124,125,126...|(692,[124,125,126...| > {noformat} > Other times a few rows are printed out when {{show}} might be more > appropriate. Here is the output from binarizer_example > {noformat} > $ bin/spark-submit examples/src/main/python/ml/binarizer_example.py > 0.0 > > 1.0 > 0.0 > {noformat} > But would be much more useful to just {{show()}} the transformed DataFrame > {noformat} > +-+---+-+ > |label|feature|binarized_feature| > +-+---+-+ > |0|0.1| 0.0| > |1|0.8| 1.0| > |2|0.2| 0.0| > +-+---+-+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16421) Improve output from ML examples
[ https://issues.apache.org/jira/browse/SPARK-16421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16421. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14308 [https://github.com/apache/spark/pull/14308] > Improve output from ML examples > --- > > Key: SPARK-16421 > URL: https://issues.apache.org/jira/browse/SPARK-16421 > Project: Spark > Issue Type: Sub-task > Components: Examples, ML >Reporter: Bryan Cutler >Priority: Trivial > Fix For: 2.1.0 > > > In many ML examples, the output is useless. Sometimes {{show()}} is called > and any pertinent results are hidden. For example, here is the output of > max_abs_scaler > {noformat} > $ bin/spark-submit examples/src/main/python/ml/max_abs_scaler_example.py > +-+++ > |label|features| scaledFeatures| > +-+++ > | 0.0|(692,[127,128,129...|(692,[127,128,129...| > | 1.0|(692,[158,159,160...|(692,[158,159,160...| > | 1.0|(692,[124,125,126...|(692,[124,125,126...| > {noformat} > Other times a few rows are printed out when {{show}} might be more > appropriate. Here is the output from binarizer_example > {noformat} > $ bin/spark-submit examples/src/main/python/ml/binarizer_example.py > 0.0 > > 1.0 > 0.0 > {noformat} > But would be much more useful to just {{show()}} the transformed DataFrame > {noformat} > +-+---+-+ > |label|feature|binarized_feature| > +-+---+-+ > |0|0.1| 0.0| > |1|0.8| 1.0| > |2|0.2| 0.0| > +-+---+-+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16826. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14488 [https://github.com/apache/spark/pull/14488] > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > Fix For: 2.1.0 > > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query failure due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409878#comment-15409878 ] Sital Kedia commented on SPARK-16922: - PS - Rerunning the query with spark.sql.codegen.aggregate.map.columns.max=0 to disabled vectorized aggregation for columnar map also OOMs, but with a different stack trace. {code} [Stage 1:>(0 + 4) / 184]16/08/05 11:26:31 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID 4, hadoop4774.prn2.facebook.com): java.lang.OutOfMemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} > Query failure due to executor OOM in Spark 2.0 > -- > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query
[jira] [Assigned] (SPARK-16905) Support SQL DDL: MSCK REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16905: Assignee: Apache Spark (was: Davies Liu) > Support SQL DDL: MSCK REPAIR TABLE > -- > > Key: SPARK-16905 > URL: https://issues.apache.org/jira/browse/SPARK-16905 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Apache Spark > > MSCK REPAIR TABLE could be used to recover the partitions in metastore based > on partitions in file system. > Another syntax is: > ALTER TABLE table RECOVER PARTITIONS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409829#comment-15409829 ] Sean Owen commented on SPARK-6305: -- The problem, I think, is that then this would not take effect if the user overrides with a custom logging configuration. If this, or the silencing that some of the examples do, is really otherwise blocking this change, I'd say yank it out so we can update and just live with the message or provide at least the appropriate config in a default logging file that users can copy if desired. I'm not so so worried about this aspect as much as maintaining the right POM configuration to exclude and continue to exclude all external usages of log4j 1 and do all the appropriate redirecting. That's going to be a small nightmare to keep correct, though, if Spark stops using log4j directly, probably less problematic. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16905) Support SQL DDL: MSCK REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409830#comment-15409830 ] Apache Spark commented on SPARK-16905: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/14500 > Support SQL DDL: MSCK REPAIR TABLE > -- > > Key: SPARK-16905 > URL: https://issues.apache.org/jira/browse/SPARK-16905 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > > MSCK REPAIR TABLE could be used to recover the partitions in metastore based > on partitions in file system. > Another syntax is: > ALTER TABLE table RECOVER PARTITIONS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16905) Support SQL DDL: MSCK REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16905: Assignee: Davies Liu (was: Apache Spark) > Support SQL DDL: MSCK REPAIR TABLE > -- > > Key: SPARK-16905 > URL: https://issues.apache.org/jira/browse/SPARK-16905 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > > MSCK REPAIR TABLE could be used to recover the partitions in metastore based > on partitions in file system. > Another syntax is: > ALTER TABLE table RECOVER PARTITIONS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query failure due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409813#comment-15409813 ] Sital Kedia commented on SPARK-16922: - cc - [~rxin] > Query failure due to executor OOM in Spark 2.0 > -- > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16922) Query failure due to executor OOM in Spark 2.0
Sital Kedia created SPARK-16922: --- Summary: Query failure due to executor OOM in Spark 2.0 Key: SPARK-16922 URL: https://issues.apache.org/jira/browse/SPARK-16922 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 2.0.0 Reporter: Sital Kedia A query which used to work in Spark 1.6 fails with executor OOM in 2.0. Stack trace - {code} at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Query plan in Spark 1.6 {code} == Physical Plan == TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) +- TungstenExchange hashpartitioning(field1#101,200), None +- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) +- Project [field1#101,field2#74] +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as decimal(20,0)) as bigint)], BuildRight :- ConvertToUnsafe : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] +- ConvertToUnsafe +- HiveTableScan [field1#101,field4#97], MetastoreRelation foo, table2, Some(b) {code} Query plan in 2.0 {code} == Physical Plan == *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) +- Exchange hashpartitioning(field1#160, 200) +- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 100.0))]) +- *Project [field2#133, field1#160] +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as decimal(20,0)) as bigint)], Inner, BuildRight :- *Filter isnotnull(field5#122L) : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 2013-12-31)] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as decimal(20,0)) as bigint))) +- *Filter isnotnull(field4#156) +- HiveTableScan [field4#156, field1#160], MetastoreRelation foo, table2, b {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409801#comment-15409801 ] Yin Huai commented on SPARK-16610: -- Sure. For ORC, {{orc.compression}} is the actual conf that lets ORC writer know what codec to use. I think it is fine to have {{compression}}. But, if a user uses {{orc.compression}} and do not use {{compression}}, we just cannot ignore the value of {{orc.compression}} and use the default value of {{compression}} to override {{orc.compression}}. > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5312. -- Resolution: Won't Fix > Use sbt to detect new or changed public classes in PRs > -- > > Key: SPARK-5312 > URL: https://issues.apache.org/jira/browse/SPARK-5312 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > We currently use an [unwieldy grep/sed > contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] > to detect new public classes in PRs. > -Apparently, sbt lets you get a list of public classes [much more > directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via > {{show compile:discoveredMainClasses}}. We should use that instead.- > There is a tool called [ClassUtil|http://software.clapper.org/classutil/] > that seems to help give this kind of information much more directly. We > should look into using that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16921) RDD/DataFrame persist() and cache() should return Python context managers
Nicholas Chammas created SPARK-16921: Summary: RDD/DataFrame persist() and cache() should return Python context managers Key: SPARK-16921 URL: https://issues.apache.org/jira/browse/SPARK-16921 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor [Context managers|https://docs.python.org/3/reference/datamodel.html#context-managers] are a natural way to capture closely related setup and teardown code in Python. For example, they are commonly used when doing file I/O: {code} with open('/path/to/file') as f: contents = f.read() ... {code} Once the program exits the with block, {{f}} is automatically closed. I think it makes sense to apply this pattern to persisting and unpersisting DataFrames and RDDs. There are many cases when you want to persist a DataFrame for a specific set of operations and then unpersist it immediately afterwards. For example, take model training. Today, you might do something like this: {code} labeled_data.persist() model = pipeline.fit(labeled_data) labeled_data.unpersist() {code} If {{persist()}} returned a context manager, you could rewrite this as follows: {code} with labeled_data.persist(): model = pipeline.fit(labeled_data) {code} Upon exiting the {{with}} block, {{labeled_data}} would automatically be unpersisted. This can be done in a backwards-compatible way since {{persist()}} would still return the parent DataFrame or RDD as it does today, but add two methods to the object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.
[ https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas closed SPARK-7505. --- Resolution: Invalid Closing this as invalid as I believe these issues are no longer important. > Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, > etc. > > > Key: SPARK-7505 > URL: https://issues.apache.org/jira/browse/SPARK-7505 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Nicholas Chammas >Priority: Minor > > The PySpark docs for DataFrame need the following fixes and improvements: > # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over > {{\_\_getattr\_\_}} and change all our examples accordingly. > # *We should say clearly that the API is experimental.* (That is currently > not the case for the PySpark docs.) > # We should provide an example of how to join and select from 2 DataFrames > that have identically named columns, because it is not obvious: > {code} > >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}'])) > >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}'])) > >>> df12 = df1.join(df2, df1['a'] == df2['a']) > >>> df12.select(df1['a'], df2['other']).show() > a other > > 4 I dunno {code} > # > [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy] > and > [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort] > should be marked as aliases if that's what they are. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409767#comment-15409767 ] Nicholas Chammas commented on SPARK-5312: - [~boyork] - Shall we close this? It doesn't look like it has any momentum. > Use sbt to detect new or changed public classes in PRs > -- > > Key: SPARK-5312 > URL: https://issues.apache.org/jira/browse/SPARK-5312 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > We currently use an [unwieldy grep/sed > contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] > to detect new public classes in PRs. > -Apparently, sbt lets you get a list of public classes [much more > directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via > {{show compile:discoveredMainClasses}}. We should use that instead.- > There is a tool called [ClassUtil|http://software.clapper.org/classutil/] > that seems to help give this kind of information much more directly. We > should look into using that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858
[ https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16920: -- Issue Type: Improvement (was: Bug) > Investigate and fix issues introduced in SPARK-15858 > > > Key: SPARK-16920 > URL: https://issues.apache.org/jira/browse/SPARK-16920 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg > > There were several issues regarding the PR resolving SPARK-15858, my comments > are available here: > https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93 > The two most important issues are: > 1. The PR did not add a stress test proving it resolved the issue it was > supposed to (though I have no doubt the optimization made is indeed correct). > 2. The PR introduced quadratic prediction time in terms of the number of > trees, which was previously linear. This issue needs to be investigated for > whether it causes problems for large numbers of trees (say, 1000), an > appropriate test should be added, and then fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858
Vladimir Feinberg created SPARK-16920: - Summary: Investigate and fix issues introduced in SPARK-15858 Key: SPARK-16920 URL: https://issues.apache.org/jira/browse/SPARK-16920 Project: Spark Issue Type: Bug Components: MLlib Reporter: Vladimir Feinberg There were several issues regarding the PR resolving SPARK-15858, my comments are available here: https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93 The two most important issues are: 1. The PR did not add a stress test proving it resolved the issue it was supposed to (though I have no doubt the optimization made is indeed correct). 2. The PR introduced quadratic prediction time in terms of the number of trees, which was previously linear. This issue needs to be investigated for whether it causes problems for large numbers of trees (say, 1000), an appropriate test should be added, and then fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409691#comment-15409691 ] Shivaram Venkataraman commented on SPARK-16883: --- The thing to check then is how the serialization / deserialization is happening. That is in SerDe.scala we should be writing the Decimal out as numeric for this to work correctly > SQL decimal type is not properly cast to number when collecting SparkDataFrame > -- > > Key: SPARK-16883 > URL: https://issues.apache.org/jira/browse/SPARK-16883 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > To reproduce run following code. As you can see "y" is a list of values. > {code} > registerTempTable(createDataFrame(iris), "iris") > str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y > from iris limit 5"))) > 'data.frame': 5 obs. of 2 variables: > $ x: num 1 1 1 1 1 > $ y:List of 5 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > ..$ : num 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409660#comment-15409660 ] Mikael Ståldal commented on SPARK-6305: --- For unit tests, you can put a logging configuration file in {{src/test/resources}}. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409655#comment-15409655 ] Mikael Ståldal commented on SPARK-6305: --- It should not be necessary to do explicit logging configuration for REPL as done in Logging.scala:(130-141), you can include a log configuration file in {{src/main/resources}} in spark-repl instead. Not as easy for the R and Python backends though, since they are in the spark-core module. However, I think you should at least move that code to those classes rather than keeping it in Logging.scala. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409646#comment-15409646 ] Mikael Ståldal commented on SPARK-6305: --- It might not be necessary to load logging configuration explicitly as done in Logging.scala:(117-128). You could include a log configuration file in {{src/main/resources}} and then the user can override it with a Java system property. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16917) Spark streaming kafka version compatibility.
[ https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudev updated SPARK-16917: -- Description: It would be nice to have Kafka version compatibility information in the official documentation. It's very confusing now. * If you look at this JIRA[1], it seems like Kafka is supported in Spark 2.0.0. * The documentation lists artifact for (Kafka 0.8) spark-streaming-kafka-0-8_2.11 Is Kafka 0.9 supported by Spark 2.0.0 ? Since I'm confused here even after an hours effort googling on the same, I think someone should help add the compatibility matrix. [1] https://issues.apache.org/jira/browse/SPARK-12177 was: It would be nice to have Kafka version compatibility information in the official documentation. It's very confusing now. * If you look at this JIRA[1], it seems like Kafka is supported in Spark 2.0.0. * The documentation lists attifact for (Kafka 0.8) spark-streaming-kafka-0-8_2.11 Is Kafka 0.9 supported by Spark 2.0.0 ? Since I'm confused here even after an hours effort googling on the same, I think someone should help add the compatibility matrix. [1] https://issues.apache.org/jira/browse/SPARK-12177 > Spark streaming kafka version compatibility. > - > > Key: SPARK-16917 > URL: https://issues.apache.org/jira/browse/SPARK-16917 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Sudev >Priority: Trivial > Labels: documentation > > It would be nice to have Kafka version compatibility information in the > official documentation. > It's very confusing now. > * If you look at this JIRA[1], it seems like Kafka is supported in Spark > 2.0.0. > * The documentation lists artifact for (Kafka 0.8) > spark-streaming-kafka-0-8_2.11 > Is Kafka 0.9 supported by Spark 2.0.0 ? > Since I'm confused here even after an hours effort googling on the same, I > think someone should help add the compatibility matrix. > [1] https://issues.apache.org/jira/browse/SPARK-12177 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org