[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574502#comment-15574502 ] Low Chin Wei commented on SPARK-13747: -- I encounter this in 2.0.1, is there any workaround like having separate SparkSession will help? > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Andrew Or > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574540#comment-15574540 ] Apache Spark commented on SPARK-16845: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/15480 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17903) MetastoreRelation should talk to external catalog instead of hive client
[ https://issues.apache.org/jira/browse/SPARK-17903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17903. - Resolution: Fixed Fix Version/s: 2.1.0 > MetastoreRelation should talk to external catalog instead of hive client > > > Key: SPARK-17903 > URL: https://issues.apache.org/jira/browse/SPARK-17903 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17917) Convert 'Initial job has not accepted any resources..' logWarning to a SparkListener event
[ https://issues.apache.org/jira/browse/SPARK-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574639#comment-15574639 ] Mario Briggs commented on SPARK-17917: -- >> I don't have a strong feeling on this partly because I'm not sure what the action then is – kill the job? << Here is an example - Lets say i am using a notebook and kicked off some spark actions that dont' get executors because user/org/group quota's of executors have been exhausted. These events can be used by the notebook implementor to then surface the issue to the user via a UI update on that cell itself, maybe even additionally query the user/org/group quota's show which apps are using up the quota's etc and allow the user to take what action required (kill the other jobs, just wait on this job etc). Therefore not looking to define in anyway on the event, what the set of actions can be, since that would be very implementation specific. >> Maybe, I suppose it will be a little tricky to define what the event is here << Where you referring to the actual arguments of the event method. I can give a shot at defining and then look for feedback > Convert 'Initial job has not accepted any resources..' logWarning to a > SparkListener event > -- > > Key: SPARK-17917 > URL: https://issues.apache.org/jira/browse/SPARK-17917 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Mario Briggs >Priority: Minor > > When supporting Spark on a multi-tenant shared large cluster with quotas per > tenant, often a submitted taskSet might not get executors because quotas have > been exhausted (or) resources unavailable. In these situations, firing a > SparkListener event instead of just logging the issue (as done currently at > https://github.com/apache/spark/blob/9216901d52c9c763bfb908013587dcf5e781f15b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L192), > would give applications/listeners an opportunity to handle this more > appropriately as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16575) partition calculation mismatch with sc.binaryFiles
[ https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572950#comment-15572950 ] Tarun Kumar edited comment on SPARK-16575 at 10/14/16 8:38 AM: --- [~rxin] I have now added the support of openCostInBytes, similar to SQL (thanks for pointing out). It does now create an optimized number of partitions. Request you to review and suggest. Once again, Thanks for your suggestion, It worked like a charm! was (Author: fidato13): [~rxin] I have now added the support of openCostInBytes, similar to SQL (thanks for pointing out). It does now creates an optimized number of partitions. Request you to review and suggest. Once again, Thanks for your suggestion, It worked like a charm! > partition calculation mismatch with sc.binaryFiles > -- > > Key: SPARK-16575 > URL: https://issues.apache.org/jira/browse/SPARK-16575 > Project: Spark > Issue Type: Bug > Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell >Affects Versions: 1.6.1, 1.6.2 >Reporter: Suhas >Priority: Critical > > sc.binaryFiles is always creating an RDD with number of partitions as 2. > Steps to reproduce: (Tested this bug on databricks community edition) > 1. Try to create an RDD using sc.binaryFiles. In this example, airlines > folder has 1922 files. > Ex: {noformat}val binaryRDD = > sc.binaryFiles("/databricks-datasets/airlines/*"){noformat} > 2. check the number of partitions of the above RDD > - binaryRDD.partitions.size = 2. (expected value is more than 2) > 3. If the RDD is created using sc.textFile, then the number of partitions are > 1921. > 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 > version. > For explanation with screenshot, please look at the link below, > http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext
[ https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10872. --- Resolution: Not A Problem I don't think this is something to be fixed if I understand it correctly, in that it's not intended to stop/start a context within one JVM. > Derby error (XSDB6) when creating new HiveContext after restarting > SparkContext > --- > > Key: SPARK-10872 > URL: https://issues.apache.org/jira/browse/SPARK-10872 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Dmytro Bielievtsov > > Starting from spark 1.4.0 (works well on 1.3.1), the following code fails > with "XSDB6: Another instance of Derby may have already booted the database > ~/metastore_db": > {code:python} > from pyspark import SparkContext, HiveContext > sc = SparkContext("local[*]", "app1") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() > sc.stop() > sc = SparkContext("local[*]", "app2") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() # Py4J error > {code} > This is related to [#SPARK-9539], and I intend to restart spark context > several times for isolated jobs to prevent cache cluttering and GC errors. > Here's a larger part of the full error trace: > {noformat} > Failed to start database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > org.datanucleus.exceptions.NucleusDataStoreException: Failed to start > database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) > at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) > at java.security.AccessController.doPrivileged(Native Method) > at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) > at > javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HM
[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor
[ https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-17933: - Attachment: screenshot-2.png screenshot-1.png > Shuffle fails when driver is on one of the same machines as executor > > > Key: SPARK-17933 > URL: https://issues.apache.org/jira/browse/SPARK-17933 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.2 >Reporter: Frank Rosner > Attachments: screenshot-1.png, screenshot-2.png > > > h4. Problem > When I run a job that requires some shuffle, some tasks fail because the > executor cannot fetch the shuffle blocks from another executor. > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > 10-250-20-140:44042 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.nio.channels.UnresolvedAddressException > at sun.nio.ch.Net.checkAddress(Net.java:101) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622) > at > io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097) > at > io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannel
[jira] [Created] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor
Frank Rosner created SPARK-17933: Summary: Shuffle fails when driver is on one of the same machines as executor Key: SPARK-17933 URL: https://issues.apache.org/jira/browse/SPARK-17933 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.6.2 Reporter: Frank Rosner Attachments: screenshot-1.png, screenshot-2.png h4. Problem When I run a job that requires some shuffle, some tasks fail because the executor cannot fetch the shuffle blocks from another executor. {noformat} org.apache.spark.shuffle.FetchFailedException: Failed to connect to 10-250-20-140:44042 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622) at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207) at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097) at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456) at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47) at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractCh
[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor
[ https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-17933: - Description: h4. Problem When I run a job that requires some shuffle, some tasks fail because the executor cannot fetch the shuffle blocks from another executor. {noformat} org.apache.spark.shuffle.FetchFailedException: Failed to connect to 10-250-20-140:44042 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622) at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207) at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097) at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456) at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47) at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456) at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50) at io.netty.channel.Abstra
[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor
[ https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-17933: - Description: h4. Problem When I run a job that requires some shuffle, some tasks fail because the executor cannot fetch the shuffle blocks from another executor. {noformat} org.apache.spark.shuffle.FetchFailedException: Failed to connect to 10-250-20-140:44042 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622) at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207) at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097) at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456) at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47) at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456) at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50) at io.netty.channel.Abstra
[jira] [Created] (SPARK-17934) Support percentile scale in ml.feature
Lei Wang created SPARK-17934: Summary: Support percentile scale in ml.feature Key: SPARK-17934 URL: https://issues.apache.org/jira/browse/SPARK-17934 Project: Spark Issue Type: New Feature Components: ML Reporter: Lei Wang Percentile scale is often used in feature scale. In my project, I need to use this scaler. Compared to MinMaxScaler, PercentileScaler will not produce unstable result due to anomaly large value. About percentile scale, refer to https://en.wikipedia.org/wiki/Percentile_rank -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17934) Support percentile scale in ml.feature
[ https://issues.apache.org/jira/browse/SPARK-17934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574787#comment-15574787 ] Sean Owen commented on SPARK-17934: --- This might be on the border of things that are widely used enough to include in MLlib. You can of course implement this as a UDF. > Support percentile scale in ml.feature > -- > > Key: SPARK-17934 > URL: https://issues.apache.org/jira/browse/SPARK-17934 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Lei Wang > > Percentile scale is often used in feature scale. > In my project, I need to use this scaler. > Compared to MinMaxScaler, PercentileScaler will not produce unstable result > due to anomaly large value. > About percentile scale, refer to https://en.wikipedia.org/wiki/Percentile_rank -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4257) Spark master can only be accessed by hostname
[ https://issues.apache.org/jira/browse/SPARK-4257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4257. -- Resolution: Not A Problem > Spark master can only be accessed by hostname > - > > Key: SPARK-4257 > URL: https://issues.apache.org/jira/browse/SPARK-4257 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Davies Liu >Priority: Critical > > After sbin/start-all.sh, the spark shell can not connect to standalone master > by spark://IP:7077, it works if replace IP by hostname. > In the docs[1], it says use `spark://IP:PORT` to connect to master. > [1] http://spark.apache.org/docs/latest/spark-standalone.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark
[ https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5113. -- Resolution: Won't Fix > Audit and document use of hostnames and IP addresses in Spark > - > > Key: SPARK-5113 > URL: https://issues.apache.org/jira/browse/SPARK-5113 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Priority: Critical > > Spark has multiple network components that start servers and advertise their > network addresses to other processes. > We should go through each of these components and make sure they have > consistent and/or documented behavior wrt (a) what interface(s) they bind to > and (b) what hostname they use to advertise themselves to other processes. We > should document this clearly and explain to people what to do in different > cases (e.g. EC2, dockerized containers, etc). > When Spark initializes, it will search for a network interface until it finds > one that is not a loopback address. Then it will do a reverse DNS lookup for > a hostname associated with that interface. Then the network components will > use that hostname to advertise the component to other processes. That > hostname is also the one used for the akka system identifier (akka supports > only supplying a single name which it uses both as the bind interface and as > the actor identifier). In some cases, that hostname is used as the bind > hostname also (e.g. I think this happens in the connection manager and > possibly akka) - which will likely internally result in a re-resolution of > this to an IP address. In other cases (the web UI and netty shuffle) we seem > to bind to all interfaces. > The best outcome would be to have three configs that can be set on each > machine: > {code} > SPARK_LOCAL_IP # Ip address we bind to for all services > SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within > the cluster > SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the > cluster (e.g. the UI) > {code} > It's not clear how easily we can support that scheme while providing > backwards compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy - > it's just an alias for what is now SPARK_PUBLIC_DNS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset
[ https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17929: Assignee: (was: Apache Spark) > Deadlock when AM restart and send RemoveExecutor on reset > - > > Key: SPARK-17929 > URL: https://issues.apache.org/jira/browse/SPARK-17929 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Weizhong >Priority: Minor > > We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala > {code} > protected def reset(): Unit = synchronized { > numPendingExecutors = 0 > executorsPendingToRemove.clear() > // Remove all the lingering executors that should be removed but not yet. > The reason might be > // because (1) disconnected event is not yet received; (2) executors die > silently. > executorDataMap.toMap.foreach { case (eid, _) => > driverEndpoint.askWithRetry[Boolean]( > RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager > re-registered."))) > } > } > {code} > but on removeExecutor also need the lock > "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, > and send RPC will failed, and reset failed > {code} > private def removeExecutor(executorId: String, reason: > ExecutorLossReason): Unit = { > logDebug(s"Asked to remove executor $executorId with reason $reason") > executorDataMap.get(executorId) match { > case Some(executorInfo) => > // This must be synchronized because variables mutated > // in this block are read when requesting executors > val killed = CoarseGrainedSchedulerBackend.this.synchronized { > addressToExecutorId -= executorInfo.executorAddress > executorDataMap -= executorId > executorsPendingLossReason -= executorId > executorsPendingToRemove.remove(executorId).getOrElse(false) > } > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor
[ https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574802#comment-15574802 ] Sean Owen commented on SPARK-17933: --- I think this is related to a lot of existing JIRAs concerning the difference between advertised host names and IPs, like https://issues.apache.org/jira/browse/SPARK-4563 > Shuffle fails when driver is on one of the same machines as executor > > > Key: SPARK-17933 > URL: https://issues.apache.org/jira/browse/SPARK-17933 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.2 >Reporter: Frank Rosner > Attachments: screenshot-1.png, screenshot-2.png > > > h4. Problem > When I run a job that requires some shuffle, some tasks fail because the > executor cannot fetch the shuffle blocks from another executor. > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > 10-250-20-140:44042 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.nio.channels.UnresolvedAddressException > at sun.nio.ch.Net.checkAddress(Net.java:101) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622) > at > io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207) > at > io.netty.channel.DefaultCh
[jira] [Assigned] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset
[ https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17929: Assignee: Apache Spark > Deadlock when AM restart and send RemoveExecutor on reset > - > > Key: SPARK-17929 > URL: https://issues.apache.org/jira/browse/SPARK-17929 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Weizhong >Assignee: Apache Spark >Priority: Minor > > We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala > {code} > protected def reset(): Unit = synchronized { > numPendingExecutors = 0 > executorsPendingToRemove.clear() > // Remove all the lingering executors that should be removed but not yet. > The reason might be > // because (1) disconnected event is not yet received; (2) executors die > silently. > executorDataMap.toMap.foreach { case (eid, _) => > driverEndpoint.askWithRetry[Boolean]( > RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager > re-registered."))) > } > } > {code} > but on removeExecutor also need the lock > "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, > and send RPC will failed, and reset failed > {code} > private def removeExecutor(executorId: String, reason: > ExecutorLossReason): Unit = { > logDebug(s"Asked to remove executor $executorId with reason $reason") > executorDataMap.get(executorId) match { > case Some(executorInfo) => > // This must be synchronized because variables mutated > // in this block are read when requesting executors > val killed = CoarseGrainedSchedulerBackend.this.synchronized { > addressToExecutorId -= executorInfo.executorAddress > executorDataMap -= executorId > executorsPendingLossReason -= executorId > executorsPendingToRemove.remove(executorId).getOrElse(false) > } > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset
[ https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574801#comment-15574801 ] Apache Spark commented on SPARK-17929: -- User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/15481 > Deadlock when AM restart and send RemoveExecutor on reset > - > > Key: SPARK-17929 > URL: https://issues.apache.org/jira/browse/SPARK-17929 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Weizhong >Priority: Minor > > We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala > {code} > protected def reset(): Unit = synchronized { > numPendingExecutors = 0 > executorsPendingToRemove.clear() > // Remove all the lingering executors that should be removed but not yet. > The reason might be > // because (1) disconnected event is not yet received; (2) executors die > silently. > executorDataMap.toMap.foreach { case (eid, _) => > driverEndpoint.askWithRetry[Boolean]( > RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager > re-registered."))) > } > } > {code} > but on removeExecutor also need the lock > "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, > and send RPC will failed, and reset failed > {code} > private def removeExecutor(executorId: String, reason: > ExecutorLossReason): Unit = { > logDebug(s"Asked to remove executor $executorId with reason $reason") > executorDataMap.get(executorId) match { > case Some(executorInfo) => > // This must be synchronized because variables mutated > // in this block are read when requesting executors > val killed = CoarseGrainedSchedulerBackend.this.synchronized { > addressToExecutorId -= executorInfo.executorAddress > executorDataMap -= executorId > executorsPendingLossReason -= executorId > executorsPendingToRemove.remove(executorId).getOrElse(false) > } > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17898) --repositories needs username and password
[ https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17898. --- Resolution: Not A Problem These are Gradle questions, really. It should be perfectly possible to access your repo and build an app jar with it. > --repositories needs username and password > --- > > Key: SPARK-17898 > URL: https://issues.apache.org/jira/browse/SPARK-17898 > Project: Spark > Issue Type: Wish >Affects Versions: 2.0.1 >Reporter: lichenglin > > My private repositories need username and password to visit. > I can't find a way to declaration the username and password when submit > spark application > {code} > bin/spark-submit --repositories > http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages > com.databricks:spark-csv_2.10:1.2.0 --class > org.apache.spark.examples.SparkPi --master local[8] > examples/jars/spark-examples_2.11-2.0.1.jar 100 > {code} > The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username > and password -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17572) Write.df is failing on spark cluster
[ https://issues.apache.org/jira/browse/SPARK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17572. --- Resolution: Not A Problem > Write.df is failing on spark cluster > > > Key: SPARK-17572 > URL: https://issues.apache.org/jira/browse/SPARK-17572 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Sankar Mittapally > > Hi, > We have spark cluster with four nodes, all four nodes have NFS partition > shared(there is no HDFS), We have same uid on all servers. When we are trying > to write data we are getting following exceptions. I am not sure whether it > is a error or not and not sure will I lost the data in the output. > The command which I am using to save the data. > {code} > saveDF(banking_l1_1,"banking_l1_v2.csv",source="csv",mode="append",schema="true") > {code} > {noformat} > 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > java.io.IOException: Failed to rename > DeprecatedRawLocalFileStatus{path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task_201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv; > isDirectory=false; length=436486316; replication=1; blocksize=33554432; > modification_time=147409940; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false} to > file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:371) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(A
[jira] [Resolved] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-650. - Resolution: Duplicate > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16720) Loading CSV file with 2k+ columns fails during attribute resolution on action
[ https://issues.apache.org/jira/browse/SPARK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16720. --- Resolution: Not A Problem > Loading CSV file with 2k+ columns fails during attribute resolution on action > - > > Key: SPARK-16720 > URL: https://issues.apache.org/jira/browse/SPARK-16720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > Example shell for repro: > {quote} > scala> val df =spark.read.format("csv").option("header", > "true").option("inferSchema", "true").load("/home/holden/Downloads/ex*.csv") > df: org.apache.spark.sql.DataFrame = [Date: string, Lifetime Total Likes: int > ... 2125 more fields] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(Date,StringType,true), StructField(Lifetime Total > Likes,IntegerType,true), StructField(Daily New Likes,IntegerType,true), > StructField(Daily Unlikes,IntegerType,true), StructField(Daily Page Engaged > Users,IntegerType,true), StructField(Weekly Page Engaged > Users,IntegerType,true), StructField(28 Days Page Engaged > Users,IntegerType,true), StructField(Daily Like Sources - On Your > Page,IntegerType,true), StructField(Daily Total Reach,IntegerType,true), > StructField(Weekly Total Reach,IntegerType,true), StructField(28 Days Total > Reach,IntegerType,true), StructField(Daily Organic Reach,IntegerType,true), > StructField(Weekly Organic Reach,IntegerType,true), StructField(28 Days > Organic Reach,IntegerType,true), StructField(Daily T... > scala> df.take(1) > [GIANT LIST OF COLUMNS] > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61) > at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) > at > org.apache.spark.sq
[jira] [Resolved] (SPARK-17555) ExternalShuffleBlockResolver fails randomly with External Shuffle Service and Dynamic Resource Allocation on Mesos running under Marathon
[ https://issues.apache.org/jira/browse/SPARK-17555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17555. --- Resolution: Not A Problem > ExternalShuffleBlockResolver fails randomly with External Shuffle Service > and Dynamic Resource Allocation on Mesos running under Marathon > -- > > Key: SPARK-17555 > URL: https://issues.apache.org/jira/browse/SPARK-17555 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1, 1.6.2, 2.0.0 > Environment: Mesos using docker with external shuffle service running > on Marathon. Running code from pyspark shell in client mode. >Reporter: Brad Willard > Labels: docker, dynamic_allocation, mesos, shuffle > > External Shuffle Service throws these errors about 90% of the time. It seems > to die between stages and work inconsistently with these style of errors > about missing files. I've tested this same behavior with all the spark > versions listed on the jira using the pre-build hadoop 2.6 distributions from > the apache spark download page. I also want to mention everything works > successfully with dynamic resource allocation turned off. > I have read other related bugs and have tried some of the > workaround/suggestions. Seems like some people have blamed the switch from > akka to netty which got me testing this in the 1.5* range with no luck. I'm > currently running these config option (informed by reading other bugs on jira > that seemed related to my problem). These settings have helped it work > sometimes instead of never. > {code} > spark.shuffle.service.port7338 > spark.shuffle.io.numConnectionsPerPeer 4 > spark.shuffle.io.connectionTimeout 18000s > spark.shuffle.service.enabled true > spark.dynamicAllocation.enabled true > {code} > on the driver for pyspark submit I'm sending along this config > {code} > --conf > spark.mesos.executor.docker.image=docker-registry.x.net/machine-learning/spark:spark-1-6-2v1 > \ > --conf spark.shuffle.service.enabled=true \ > --conf spark.dynamicAllocation.enabled=true \ > --conf spark.mesos.coarse=true \ > --conf spark.cores.max=100 \ > --conf spark.executor.uri=$SPARK_EXECUTOR_URI \ > --conf spark.shuffle.service.port=7338 \ > --executor-memory 15g > {code} > Under Marathon I'm pinning each external shuffle service to an agent and > starting the service like this. > {code} > $SPARK_HOME/sbin/start-mesos-shuffle-service.sh && tail -f > $SPARK_HOME/logs/spark--org.apache.spark.deploy.mesos.MesosExternalShuffleService* > {code} > On startup it seems like all is well > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Python version 3.5.2 (default, Jul 2 2016 17:52:12) > SparkContext available as sc, HiveContext available as sqlContext. > >>> 16/09/15 11:35:53 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now > >>> TASK_RUNNING > 16/09/15 11:35:53 INFO MesosExternalShuffleClient: Successfully registered > app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service. > 16/09/15 11:35:55 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now > TASK_RUNNING > 16/09/15 11:35:55 INFO MesosExternalShuffleClient: Successfully registered > app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service. > 16/09/15 11:35:56 INFO CoarseMesosSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (mesos-agent002.[redacted]:61281) with ID > 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1 > 16/09/15 11:35:56 INFO ExecutorAllocationManager: New executor > 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1 has registered (new total is 1) > 16/09/15 11:35:56 INFO BlockManagerMasterEndpoint: Registering block manager > mesos-agent002.[redacted]:46247 with 10.6 GB RAM, > BlockManagerId(55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1, > mesos-agent002.[redacted], 46247) > 16/09/15 11:35:59 INFO CoarseMesosSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (mesos-agent004.[redacted]:42738) with ID > 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0 > 16/09/15 11:35:59 INFO ExecutorAllocationManager: New executor > 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0 has registered (new total is 2) > 16/09/15 11:35:59 INFO BlockManagerMasterEndpoint: Registering block manager > mesos-agent004.[redacted]:11262 with 10.6 GB RAM, > BlockManagerId(55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0, > mesos-agent004.[redacted], 11262) > {code} > I'm running a simple sort command on a spark data frame loaded from hdfs from > a parquet file. These are the errors. T
[jira] [Created] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
zhangxinyu created SPARK-17935: -- Summary: Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module Key: SPARK-17935 URL: https://issues.apache.org/jira/browse/SPARK-17935 Project: Spark Issue Type: Improvement Components: SQL, Streaming Affects Versions: 2.0.0 Reporter: zhangxinyu Fix For: 2.0.0 Now spark already supports kafkaInputStream. It would be useful that we add `KafkaForeachWriter` to output results to kafka in structured streaming module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangxinyu updated SPARK-17935: --- External issue URL: https://github.com/apache/spark/pull/15483 > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > Fix For: 2.0.0 > > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers
[ https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575005#comment-15575005 ] Luis Ramos commented on SPARK-636: -- I feel like the broadcasting mechanism doesn't get me "close" enough to solve my issue (initialization of a logging system). That's partly because initialization would be deferred (meaning a loss of useful logs), and also it could enable us to have init code that is 'guaranteed' to only be executed once as opposed to implement that 'guarantee' yourself, which currently can lead to bad practices. > Add mechanism to run system management/configuration tasks on all workers > - > > Key: SPARK-636 > URL: https://issues.apache.org/jira/browse/SPARK-636 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen > > It would be useful to have a mechanism to run a task on all workers in order > to perform system management tasks, such as purging caches or changing system > properties. This is useful for automated experiments and benchmarking; I > don't envision this being used for heavy computation. > Right now, I can mimic this with something like > {code} > sc.parallelize(0 until numMachines, numMachines).foreach { } > {code} > but this does not guarantee that every worker runs a task and requires my > user code to know the number of workers. > One sample use case is setup and teardown for benchmark tests. For example, > I might want to drop cached RDDs, purge shuffle data, and call > {{System.gc()}} between test runs. It makes sense to incorporate some of > this functionality, such as dropping cached RDDs, into Spark itself, but it > might be helpful to have a general mechanism for running ad-hoc tasks like > {{System.gc()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17935: Assignee: (was: Apache Spark) > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > Fix For: 2.0.0 > > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangxinyu updated SPARK-17935: --- External issue URL: (was: https://github.com/apache/spark/pull/15483) > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > Fix For: 2.0.0 > > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575007#comment-15575007 ] Apache Spark commented on SPARK-17935: -- User 'zhangxinyu1' has created a pull request for this issue: https://github.com/apache/spark/pull/15483 > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > Fix For: 2.0.0 > > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17935: Assignee: Apache Spark > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu >Assignee: Apache Spark > Fix For: 2.0.0 > > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangxinyu updated SPARK-17935: --- Description: Now spark already supports kafkaInputStream. It would be useful that we add `KafkaForeachWriter` to output results to kafka in structured streaming module. `KafkaForeachWriter.scala` is put in external kafka-0.8.0. was:Now spark already supports kafkaInputStream. It would be useful that we add `KafkaForeachWriter` to output results to kafka in structured streaming module. > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > Fix For: 2.0.0 > > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers
[ https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575005#comment-15575005 ] Luis Ramos edited comment on SPARK-636 at 10/14/16 11:07 AM: - I feel like the broadcasting mechanism doesn't get me "close" enough to solve my issue (initialization of a logging system). That's partly because my initialization would be deferred (meaning a loss of useful logs), and also it could enable us to have some 'init' code that is guaranteed to only be evaluated once as opposed to implementing that 'guarantee' yourself, which can currently lead to bad practices. was (Author: luisramos): I feel like the broadcasting mechanism doesn't get me "close" enough to solve my issue (initialization of a logging system). That's partly because initialization would be deferred (meaning a loss of useful logs), and also it could enable us to have init code that is 'guaranteed' to only be executed once as opposed to implement that 'guarantee' yourself, which currently can lead to bad practices. > Add mechanism to run system management/configuration tasks on all workers > - > > Key: SPARK-636 > URL: https://issues.apache.org/jira/browse/SPARK-636 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen > > It would be useful to have a mechanism to run a task on all workers in order > to perform system management tasks, such as purging caches or changing system > properties. This is useful for automated experiments and benchmarking; I > don't envision this being used for heavy computation. > Right now, I can mimic this with something like > {code} > sc.parallelize(0 until numMachines, numMachines).foreach { } > {code} > but this does not guarantee that every worker runs a task and requires my > user code to know the number of workers. > One sample use case is setup and teardown for benchmark tests. For example, > I might want to drop cached RDDs, purge shuffle data, and call > {{System.gc()}} between test runs. It makes sense to incorporate some of > this functionality, such as dropping cached RDDs, into Spark itself, but it > might be helpful to have a general mechanism for running ad-hoc tasks like > {{System.gc()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers
[ https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575005#comment-15575005 ] Luis Ramos edited comment on SPARK-636 at 10/14/16 11:11 AM: - I feel like the broadcasting mechanism doesn't get me "close" enough to solve my issue (initialization of a logging system). That's partly because my initialization would be deferred (meaning a loss of useful logs), and also it could enable us to have some 'init' code that is guaranteed to only be evaluated once as opposed to implementing that 'guarantee' yourself, which can currently lead to bad practices. Edit: For some context, I'm approaching this issue from SPARK-650 was (Author: luisramos): I feel like the broadcasting mechanism doesn't get me "close" enough to solve my issue (initialization of a logging system). That's partly because my initialization would be deferred (meaning a loss of useful logs), and also it could enable us to have some 'init' code that is guaranteed to only be evaluated once as opposed to implementing that 'guarantee' yourself, which can currently lead to bad practices. > Add mechanism to run system management/configuration tasks on all workers > - > > Key: SPARK-636 > URL: https://issues.apache.org/jira/browse/SPARK-636 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Josh Rosen > > It would be useful to have a mechanism to run a task on all workers in order > to perform system management tasks, such as purging caches or changing system > properties. This is useful for automated experiments and benchmarking; I > don't envision this being used for heavy computation. > Right now, I can mimic this with something like > {code} > sc.parallelize(0 until numMachines, numMachines).foreach { } > {code} > but this does not guarantee that every worker runs a task and requires my > user code to know the number of workers. > One sample use case is setup and teardown for benchmark tests. For example, > I might want to drop cached RDDs, purge shuffle data, and call > {{System.gc()}} between test runs. It makes sense to incorporate some of > this functionality, such as dropping cached RDDs, into Spark itself, but it > might be helpful to have a general mechanism for running ad-hoc tasks like > {{System.gc()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15402) PySpark ml.evaluation should support save/load
[ https://issues.apache.org/jira/browse/SPARK-15402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-15402. - Resolution: Fixed Assignee: Yanbo Liang Fix Version/s: 2.1.0 > PySpark ml.evaluation should support save/load > -- > > Key: SPARK-15402 > URL: https://issues.apache.org/jira/browse/SPARK-15402 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.1.0 > > > Since ml.evaluation has supported save/load at Scala side, supporting it at > Python side is very straightforward and easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14634) Add BisectingKMeansSummary
[ https://issues.apache.org/jira/browse/SPARK-14634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-14634. - Resolution: Fixed Assignee: zhengruifeng Fix Version/s: 2.1.0 > Add BisectingKMeansSummary > -- > > Key: SPARK-14634 > URL: https://issues.apache.org/jira/browse/SPARK-14634 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.1.0 > > > Add BisectingKMeansSummary -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17870. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15444 [https://github.com/apache/spark/pull/15444] > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > Fix For: 2.1.0 > > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17870: -- Assignee: Peng Meng > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Assignee: Peng Meng >Priority: Critical > Fix For: 2.1.0 > > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17855) Spark worker throw Exception when uber jar's http url contains query string
[ https://issues.apache.org/jira/browse/SPARK-17855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17855. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15420 [https://github.com/apache/spark/pull/15420] > Spark worker throw Exception when uber jar's http url contains query string > --- > > Key: SPARK-17855 > URL: https://issues.apache.org/jira/browse/SPARK-17855 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Hao Ren > Fix For: 2.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > spark-submit support jar url with http protocol > If the url contains any query strings, *worker.DriverRunner.downloadUserJar * > method will throw "Did not see expected jar" exception. This is because this > method checks the existance of a downloaded jar whose name contains query > strings. > This is a problem when your jar is located on some web service which requires > some additional information to retrieve the file. For example, to download a > jar from s3 bucket via http, the url contains signature, datetime, etc as > query string. > {code} > https://s3.amazonaws.com/deploy/spark-job.jar > ?X-Amz-Algorithm=AWS4-HMAC-SHA256 > &X-Amz-Credential=/20130721/us-east-1/s3/aws4_request > &X-Amz-Date=20130721T201207Z > &X-Amz-Expires=86400 > &X-Amz-SignedHeaders=host > &X-Amz-Signature= > {code} > Worker will look for a jar named > "spark-job.jar?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=/20130721/us-east-1/s3/aws4_request&X-Amz-Date=20130721T201207Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=" > instead of > "spark-job.jar" > Hence, all the query string should be removed before checking jar existance. > I created a pr to fix this, if anyone can review it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17855) Spark worker throw Exception when uber jar's http url contains query string
[ https://issues.apache.org/jira/browse/SPARK-17855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17855: -- Assignee: Hao Ren > Spark worker throw Exception when uber jar's http url contains query string > --- > > Key: SPARK-17855 > URL: https://issues.apache.org/jira/browse/SPARK-17855 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Hao Ren >Assignee: Hao Ren > Fix For: 2.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > spark-submit support jar url with http protocol > If the url contains any query strings, *worker.DriverRunner.downloadUserJar * > method will throw "Did not see expected jar" exception. This is because this > method checks the existance of a downloaded jar whose name contains query > strings. > This is a problem when your jar is located on some web service which requires > some additional information to retrieve the file. For example, to download a > jar from s3 bucket via http, the url contains signature, datetime, etc as > query string. > {code} > https://s3.amazonaws.com/deploy/spark-job.jar > ?X-Amz-Algorithm=AWS4-HMAC-SHA256 > &X-Amz-Credential=/20130721/us-east-1/s3/aws4_request > &X-Amz-Date=20130721T201207Z > &X-Amz-Expires=86400 > &X-Amz-SignedHeaders=host > &X-Amz-Signature= > {code} > Worker will look for a jar named > "spark-job.jar?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=/20130721/us-east-1/s3/aws4_request&X-Amz-Date=20130721T201207Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=" > instead of > "spark-job.jar" > Hence, all the query string should be removed before checking jar existance. > I created a pr to fix this, if anyone can review it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17930) The SerializerInstance instance used when deserializing a TaskResult is not reused
[ https://issues.apache.org/jira/browse/SPARK-17930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575123#comment-15575123 ] Sean Owen commented on SPARK-17930: --- But this code is only ever called once per object. Reuse doesn't help here, and I don't think you can share the serializer across instances. > The SerializerInstance instance used when deserializing a TaskResult is not > reused > --- > > Key: SPARK-17930 > URL: https://issues.apache.org/jira/browse/SPARK-17930 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.1 >Reporter: Guoqiang Li > > The following code is called when the DirectTaskResult instance is > deserialized > {noformat} > def value(): T = { > if (valueObjectDeserialized) { > valueObject > } else { > // Each deserialization creates a new instance of SerializerInstance, > which is very time-consuming > val resultSer = SparkEnv.get.serializer.newInstance() > valueObject = resultSer.deserialize(valueBytes) > valueObjectDeserialized = true > valueObject > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10681) DateTimeUtils needs a method to parse string to SQL's timestamp value
[ https://issues.apache.org/jira/browse/SPARK-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10681. --- Resolution: Not A Problem > DateTimeUtils needs a method to parse string to SQL's timestamp value > - > > Key: SPARK-10681 > URL: https://issues.apache.org/jira/browse/SPARK-10681 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now, {{DateTimeUtils.stringToTime}} returns a java.util.Date whose > getTime returns milliseconds. It will be great if we have a method to parse > string directly to SQL's timestamp value (microseconds). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong
[ https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10954. --- Resolution: Not A Problem > Parquet version in the "created_by" metadata field of Parquet files written > by Spark 1.5 and 1.6 is wrong > - > > Key: SPARK-10954 > URL: https://issues.apache.org/jira/browse/SPARK-10954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1, 1.6.0 >Reporter: Cheng Lian >Assignee: Gayathri Murali >Priority: Minor > > We've upgraded to parquet-mr 1.7.0 in Spark 1.5, but the {{created_by}} field > still says 1.6.0. This issue can be reproduced by generating any Parquet file > with Spark 1.5, and then check the metadata with {{parquet-meta}} CLI tool: > {noformat} > $ parquet-meta /tmp/parquet/dec > file: > file:/tmp/parquet/dec/part-r-0-f210e968-1be5-40bc-bcbc-007f935e6dc7.gz.parquet > creator: parquet-mr version 1.6.0 > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"dec","type":"decimal(20,2)","nullable":true,"metadata":{}}]} > file schema: spark_schema > - > dec: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:10 TS:140 OFFSET:4 > - > dec: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:4 SZ:99/140/1.41 VC:10 > ENC:PLAIN,BIT_PACKED,RLE > {noformat} > Note that this field is written by parquet-mr rather than Spark. However, > writing Parquet files using parquet-mr 1.7.0 directly without Spark 1.5 only > shows {{parquet-mr}} without any version number. Files written by parquet-mr > 1.8.1 without Spark look fine though. > Currently this isn't a big issue. But parquet-mr 1.8 checks for this field to > workaround PARQUET-251. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17777) Spark Scheduler Hangs Indefinitely
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1. --- Resolution: Not A Problem > Spark Scheduler Hangs Indefinitely > -- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: AWS EMR 4.3, can also be reproduced locally >Reporter: Ameen Tayyebi > Attachments: jstack-dump.txt, repro.scala > > > We've identified a problem with Spark scheduling. The issue manifests itself > when an RDD calls SparkContext.parallelize within its getPartitions method. > This seemingly "recursive" call causes the problem. We have a repro case that > can easily be run. > Please advise on what the issue might be and how we can work around it in the > mean time. > I've attached repro.scala which can simply be pasted in spark-shell to > reproduce the problem. > Why are we calling sc.parallelize in production within getPartitions? Well, > we have an RDD that is composed of several thousands of Parquet files. To > compute the partitioning strategy for this RDD, we create an RDD to read all > file sizes from S3 in parallel, so that we can quickly determine the proper > partitions. We do this to avoid executing this serially from the master node > which can result in significant slowness in the execution. Pseudo-code: > val splitInfo = sc.parallelize(filePaths).map(f => (f, > s3.getObjectSummary)).collect() > A similar logic is used in DataFrame by Spark itself: > https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902 > > Thanks, > -Ameen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17787) spark submit throws error while using kafka Appender log4j:ERROR Could not instantiate class [kafka.producer.KafkaLog4jAppender]. java.lang.ClassNotFoundException: kafk
[ https://issues.apache.org/jira/browse/SPARK-17787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17787. --- Resolution: Not A Problem > spark submit throws error while using kafka Appender log4j:ERROR Could not > instantiate class [kafka.producer.KafkaLog4jAppender]. > java.lang.ClassNotFoundException: kafka.producer.KafkaLog4jAppender > - > > Key: SPARK-17787 > URL: https://issues.apache.org/jira/browse/SPARK-17787 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2 >Reporter: Taukir > > While trying to push spark app logs to kafka , I am trying to use > KafkaAppender. Please find the command below.It throws the following error > spark-submit --class com.SampleApp --conf spark.ui.port=8081 --master > yarn-cluster > --files > files/home/hos/KafkaApp/log4j.properties#log4j.properties,/home/hos/KafkaApp/kafka_2.10-0.8.0.jar > --conf > spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/home/hos/KafkaApp/log4j.properties' > --conf "spark.driver.extraClassPath=/home/hos/KafkaApp/kafka_2.10-0.8.0.jar" > --conf > spark.executor.extraJavaOptions='-Dlog4j.configuration=file:/home/hos/KafkaApp/log4j.properties' > --conf > "spark.executor.extraClassPath=/home/hos/KafkaApp/kafka_2.10-0.8.0.jar" > Kafka-App-assembly-1.0.jar > it it throws java.lang.ClassNotFoundException: > kafka.producer.KafkaLog4jAppender exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor
[ https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-17933: - Attachment: screenshot-1.png > Shuffle fails when driver is on one of the same machines as executor > > > Key: SPARK-17933 > URL: https://issues.apache.org/jira/browse/SPARK-17933 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.2 >Reporter: Frank Rosner > Attachments: screenshot-1.png, screenshot-2.png > > > h4. Problem > When I run a job that requires some shuffle, some tasks fail because the > executor cannot fetch the shuffle blocks from another executor. > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > 10-250-20-140:44042 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.nio.channels.UnresolvedAddressException > at sun.nio.ch.Net.checkAddress(Net.java:101) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622) > at > io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097) > at > io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471) >
[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor
[ https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-17933: - Attachment: (was: screenshot-1.png) > Shuffle fails when driver is on one of the same machines as executor > > > Key: SPARK-17933 > URL: https://issues.apache.org/jira/browse/SPARK-17933 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.2 >Reporter: Frank Rosner > Attachments: screenshot-1.png, screenshot-2.png > > > h4. Problem > When I run a job that requires some shuffle, some tasks fail because the > executor cannot fetch the shuffle blocks from another executor. > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > 10-250-20-140:44042 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.nio.channels.UnresolvedAddressException > at sun.nio.ch.Net.checkAddress(Net.java:101) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622) > at > io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097) > at > io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:47
[jira] [Commented] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor
[ https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575232#comment-15575232 ] Frank Rosner commented on SPARK-17933: -- Thanks [~srowen]. I know a lot of discussions about the driver advertised address etc. but I am not sure if this is a configuration problem or a bug. The ticket you reference talks about the driver IP but I am having troubles with the executor (I highlighted the line in the screenshot). Any ideas how to address this problem? > Shuffle fails when driver is on one of the same machines as executor > > > Key: SPARK-17933 > URL: https://issues.apache.org/jira/browse/SPARK-17933 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.2 >Reporter: Frank Rosner > Attachments: screenshot-1.png, screenshot-2.png > > > h4. Problem > When I run a job that requires some shuffle, some tasks fail because the > executor cannot fetch the shuffle blocks from another executor. > {noformat} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > 10-250-20-140:44042 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.nio.channels.UnresolvedAddressException > at sun.nio.ch.Net.checkAddress(Net.java:101) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622) > at > io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketCh
[jira] [Updated] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17935: -- Target Version/s: (was: 2.0.0) Fix Version/s: (was: 2.0.0) > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
Justin Miller created SPARK-17936: - Summary: "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error Key: SPARK-17936 URL: https://issues.apache.org/jira/browse/SPARK-17936 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1 Reporter: Justin Miller Greetings. I'm currently in the process of migrating a project I'm working on from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift structs coming from Kafka into Parquet files stored in S3. This conversion process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll paste the stack trace below. org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) at org.codehaus.janino.CodeContext.write(CodeContext.java:854) at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) Also, later on: 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[Executor task launch worker-6,5,run-main-group-0] java.lang.OutOfMemoryError: Java heap space I've seen similar issues posted, but those were always on the query side. I have a hunch that this is happening at write time as the error occurs after batchDuration. Here's the write snippet. stream. flatMap { case Success(row) => thriftParseSuccess += 1 Some(row) case Failure(ex) => thriftParseErrors += 1 logger.error("Error during deserialization: ", ex) None }.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.context) transformer(sqlContext.createDataFrame(rdd, converter.schema)) .coalesce(coalesceSize) .write .mode(Append) .partitionBy(partitioning: _*) .parquet(parquetPath) } Please let me know if you can be of assistance and if there's anything I can do to help. Best, Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17936. --- Resolution: Duplicate Duplicate of several JIRAs -- have a look through first. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575400#comment-15575400 ] Apache Spark commented on SPARK-17868: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/15484 > Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS > > > Key: SPARK-17868 > URL: https://issues.apache.org/jira/browse/SPARK-17868 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell > > We generate bitmasks for grouping sets during the parsing process, and use > these during analysis. These bitmasks are difficult to work with in practice > and have lead to numerous bugs. I suggest that we remove these and use actual > sets instead, however we would need to generate these offsets for the > grouping_id. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17868: Assignee: (was: Apache Spark) > Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS > > > Key: SPARK-17868 > URL: https://issues.apache.org/jira/browse/SPARK-17868 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell > > We generate bitmasks for grouping sets during the parsing process, and use > these during analysis. These bitmasks are difficult to work with in practice > and have lead to numerous bugs. I suggest that we remove these and use actual > sets instead, however we would need to generate these offsets for the > grouping_id. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17868: Assignee: Apache Spark > Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS > > > Key: SPARK-17868 > URL: https://issues.apache.org/jira/browse/SPARK-17868 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Apache Spark > > We generate bitmasks for grouping sets during the parsing process, and use > these during analysis. These bitmasks are difficult to work with in practice > and have lead to numerous bugs. I suggest that we remove these and use actual > sets instead, however we would need to generate these offsets for the > grouping_id. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575421#comment-15575421 ] Harish commented on SPARK-16845: I have posted a scenario in stack overflow http://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe ... let me know if you need any help on this. If this is already taken care you can ignore my comment. > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575452#comment-15575452 ] Justin Miller commented on SPARK-17936: --- I did look through them and I don't think they're related. Note that the error is different and this is trying to write data not read large amounts of data. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575471#comment-15575471 ] Justin Miller commented on SPARK-17936: --- I'd also note this wasn't an issue in Spark 1.6.2. The process would run fine for hours and never crashed on this error. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Miller updated SPARK-17936: -- Comment: was deleted (was: I did look through them and I don't think they're related. Note that the error is different and this is trying to write data not read large amounts of data.) > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575476#comment-15575476 ] Sean Owen commented on SPARK-17936: --- I doubt it is a different cause given it is the same type of error in the same path. That is, do you expect the resolution differs? It can be reopened if so but otherwise I think this is more likely to fragment discussion about one problem. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575494#comment-15575494 ] Justin Miller commented on SPARK-17936: --- It's just strange that the issue seems to effect previous versions (if they're the same issue) but didn't impact me when I was using 1.6.2 and the 0.8 kafka consumer. Is it possible that Scala 2.10 vs Scala 2.11 makes a difference? There are a lot of variables at play unfortunately. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justin Miller reopened SPARK-17936: --- I don't believe this is a duplicate. This occurs while trying to write to parquet (with very little data) and happens almost immediately after batchDuration. > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
Cody Koeninger created SPARK-17937: -- Summary: Clarify Kafka offset semantics for Structured Streaming Key: SPARK-17937 URL: https://issues.apache.org/jira/browse/SPARK-17937 Project: Spark Issue Type: Sub-task Reporter: Cody Koeninger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17938) Backpressure rate not adjusting
Samy Dindane created SPARK-17938: Summary: Backpressure rate not adjusting Key: SPARK-17938 URL: https://issues.apache.org/jira/browse/SPARK-17938 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 2.0.1, 2.0.0 Reporter: Samy Dindane spark-streaming 2.0.1 and spark-streaming-kafka-0-10 version is 2.0.1. Same behavior with 2.0.0 though. spark.streaming.kafka.consumer.poll.ms is set to 3 spark.streaming.kafka.maxRatePerPartition is set to 10 spark.streaming.backpressure.enabled is set to true `batchDuration` of the streaming context is set to 1 second. I consume a Kafka topic using KafkaUtils.createDirectStream(). My system can handle 100k records batches, but it'd take more than 1 seconds to process them all. I'd thus expect the backpressure to reduce the number of records that would be fetched in the next batch to keep the processing delay inferior to 1 second. Only this does not happen and the rate of the backpressure stays the same: stuck in `100.0`, no matter how the other variables change (processing time, error, etc.). Here's a log showing how all these variables change but the chosen rate stays the same: https://gist.github.com/Dinduks/d9fa67fc8a036d3cad8e859c508acdba (I would have attached a file but I don't see how). Is this the expected behavior and I am missing something, or is this a bug? I'll gladly help by providing more information or writing code if necessary. Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
Aleksander Eskilson created SPARK-17939: --- Summary: Spark-SQL Nullability: Optimizations vs. Enforcement Clarification Key: SPARK-17939 URL: https://issues.apache.org/jira/browse/SPARK-17939 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Aleksander Eskilson Priority: Critical The notion of Nullability of of StructFields in DataFrames and Datasets creates some confusion. As has been pointed out previously [1], Nullability is a hint to the Catalyst optimizer, and is not meant to be a type-level enforcement. Allowing null fields can also help the reader successfully parse certain types of more loosely-typed data, like JSON and CSV, where null values are common, rather than just failing. There's already been some movement to clarify the meaning of Nullable in the API, but also some requests for a (perhaps completely separate) type-level implementation of Nullable that can act as an enforcement contract. This bug is logged here to discuss and clarify this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
[ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksander Eskilson updated SPARK-17939: Description: The notion of Nullability of of StructFields in DataFrames and Datasets creates some confusion. As has been pointed out previously [1], Nullability is a hint to the Catalyst optimizer, and is not meant to be a type-level enforcement. Allowing null fields can also help the reader successfully parse certain types of more loosely-typed data, like JSON and CSV, where null values are common, rather than just failing. There's already been some movement to clarify the meaning of Nullable in the API, but also some requests for a (perhaps completely separate) type-level implementation of Nullable that can act as an enforcement contract. This bug is logged here to discuss and clarify this issue. [1] - [https://issues.apache.org/jira/browse/SPARK-11319][https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535] [2] - https://github.com/apache/spark/pull/11785 was: The notion of Nullability of of StructFields in DataFrames and Datasets creates some confusion. As has been pointed out previously [1], Nullability is a hint to the Catalyst optimizer, and is not meant to be a type-level enforcement. Allowing null fields can also help the reader successfully parse certain types of more loosely-typed data, like JSON and CSV, where null values are common, rather than just failing. There's already been some movement to clarify the meaning of Nullable in the API, but also some requests for a (perhaps completely separate) type-level implementation of Nullable that can act as an enforcement contract. This bug is logged here to discuss and clarify this issue. > Spark-SQL Nullability: Optimizations vs. Enforcement Clarification > -- > > Key: SPARK-17939 > URL: https://issues.apache.org/jira/browse/SPARK-17939 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Aleksander Eskilson >Priority: Critical > > The notion of Nullability of of StructFields in DataFrames and Datasets > creates some confusion. As has been pointed out previously [1], Nullability > is a hint to the Catalyst optimizer, and is not meant to be a type-level > enforcement. Allowing null fields can also help the reader successfully parse > certain types of more loosely-typed data, like JSON and CSV, where null > values are common, rather than just failing. > There's already been some movement to clarify the meaning of Nullable in the > API, but also some requests for a (perhaps completely separate) type-level > implementation of Nullable that can act as an enforcement contract. > This bug is logged here to discuss and clarify this issue. > [1] - > [https://issues.apache.org/jira/browse/SPARK-11319][https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535] > [2] - https://github.com/apache/spark/pull/11785 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.
[ https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575553#comment-15575553 ] Yanbo Liang commented on SPARK-17904: - [~felixcheung] I think the proposal I made in this JIRA is different from what you mentioned, I think they are two different scenarios. R users may call install.packges across the session, rather than installing all necessary libraries before they start the session. From the discussion, I found it's not easy to support this feature. Like what [~shivaram]'s suggestion, we can try to add required packages when we start the session. This can satisfy parts of users' requirements, but not all. All in all, I appreciate all your comments. Thanks. > Add a wrapper function to install R packages on each executors. > --- > > Key: SPARK-17904 > URL: https://issues.apache.org/jira/browse/SPARK-17904 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > SparkR provides {{spark.lappy}} to run local R functions in distributed > environment, and {{dapply}} to run UDF on SparkDataFrame. > If users use third-party libraries inside of the function which was passed > into {{spark.lappy}} or {{dapply}}, they should install required R packages > on each executor in advance. > To install dependent R packages on each executors and check it successfully, > we can run similar code like following: > (Note: The code is just for example, not the prototype of this proposal. The > detail implementation should be discussed.) > {code} > rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), > install.packages("Matrix”)) > test <- function(x) { "Matrix" %in% rownames(installed.packages()) } > rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test ) > collectRDD(rdd) > {code} > It’s cumbersome to run this code snippet each time when you need third-party > library, since SparkR is an interactive analytics tools, users may call lots > of libraries during the analytics session. In native R, users can run > {{install.packages()}} and {{library()}} across the interactive session. > Should we provide one API to wrapper the work mentioned above, then users can > install dependent R packages to each executor easily? > I propose the following API: > {{spark.installPackages(pkgs, repos)}} > * pkgs: the name of packages. If repos = NULL, this can be set with a > local/hdfs path, then SparkR can install packages from local package archives. > * repos: the base URL(s) of the repositories to use. It can be NULL to > install from local directories. > Since SparkR has its own library directories where to install the packages on > each executor, so I think it will not pollute the native R environment. I'd > like to know whether it make sense, and feel free to correct me if there is > misunderstanding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
[ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksander Eskilson updated SPARK-17939: Description: The notion of Nullability of of StructFields in DataFrames and Datasets creates some confusion. As has been pointed out previously [1], Nullability is a hint to the Catalyst optimizer, and is not meant to be a type-level enforcement. Allowing null fields can also help the reader successfully parse certain types of more loosely-typed data, like JSON and CSV, where null values are common, rather than just failing. There's already been some movement to clarify the meaning of Nullable in the API, but also some requests for a (perhaps completely separate) type-level implementation of Nullable that can act as an enforcement contract. This bug is logged here to discuss and clarify this issue. [1] - [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535] [2] - https://github.com/apache/spark/pull/11785 was: The notion of Nullability of of StructFields in DataFrames and Datasets creates some confusion. As has been pointed out previously [1], Nullability is a hint to the Catalyst optimizer, and is not meant to be a type-level enforcement. Allowing null fields can also help the reader successfully parse certain types of more loosely-typed data, like JSON and CSV, where null values are common, rather than just failing. There's already been some movement to clarify the meaning of Nullable in the API, but also some requests for a (perhaps completely separate) type-level implementation of Nullable that can act as an enforcement contract. This bug is logged here to discuss and clarify this issue. [1] - [https://issues.apache.org/jira/browse/SPARK-11319][https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535] [2] - https://github.com/apache/spark/pull/11785 > Spark-SQL Nullability: Optimizations vs. Enforcement Clarification > -- > > Key: SPARK-17939 > URL: https://issues.apache.org/jira/browse/SPARK-17939 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Aleksander Eskilson >Priority: Critical > > The notion of Nullability of of StructFields in DataFrames and Datasets > creates some confusion. As has been pointed out previously [1], Nullability > is a hint to the Catalyst optimizer, and is not meant to be a type-level > enforcement. Allowing null fields can also help the reader successfully parse > certain types of more loosely-typed data, like JSON and CSV, where null > values are common, rather than just failing. > There's already been some movement to clarify the meaning of Nullable in the > API, but also some requests for a (perhaps completely separate) type-level > implementation of Nullable that can act as an enforcement contract. > This bug is logged here to discuss and clarify this issue. > [1] - > [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535] > [2] - https://github.com/apache/spark/pull/11785 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17904) Add a wrapper function to install R packages on each executors.
[ https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575553#comment-15575553 ] Yanbo Liang edited comment on SPARK-17904 at 10/14/16 2:49 PM: --- [~felixcheung] The proposal I made in this JIRA is different from what you mentioned, I think they are two different scenarios. R users may call install.packges across the session, rather than installing all necessary libraries before they start the session. From the discussion, I found it's not easy to support this feature. Like what [~shivaram]'s suggestion, we can try to add required packages when we start the session. This can satisfy parts of users' requirements, but not all. I appreciate all your comments. Thanks. was (Author: yanboliang): [~felixcheung] I think the proposal I made in this JIRA is different from what you mentioned, I think they are two different scenarios. R users may call install.packges across the session, rather than installing all necessary libraries before they start the session. From the discussion, I found it's not easy to support this feature. Like what [~shivaram]'s suggestion, we can try to add required packages when we start the session. This can satisfy parts of users' requirements, but not all. All in all, I appreciate all your comments. Thanks. > Add a wrapper function to install R packages on each executors. > --- > > Key: SPARK-17904 > URL: https://issues.apache.org/jira/browse/SPARK-17904 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > SparkR provides {{spark.lappy}} to run local R functions in distributed > environment, and {{dapply}} to run UDF on SparkDataFrame. > If users use third-party libraries inside of the function which was passed > into {{spark.lappy}} or {{dapply}}, they should install required R packages > on each executor in advance. > To install dependent R packages on each executors and check it successfully, > we can run similar code like following: > (Note: The code is just for example, not the prototype of this proposal. The > detail implementation should be discussed.) > {code} > rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), > install.packages("Matrix”)) > test <- function(x) { "Matrix" %in% rownames(installed.packages()) } > rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test ) > collectRDD(rdd) > {code} > It’s cumbersome to run this code snippet each time when you need third-party > library, since SparkR is an interactive analytics tools, users may call lots > of libraries during the analytics session. In native R, users can run > {{install.packages()}} and {{library()}} across the interactive session. > Should we provide one API to wrapper the work mentioned above, then users can > install dependent R packages to each executor easily? > I propose the following API: > {{spark.installPackages(pkgs, repos)}} > * pkgs: the name of packages. If repos = NULL, this can be set with a > local/hdfs path, then SparkR can install packages from local package archives. > * repos: the base URL(s) of the repositories to use. It can be NULL to > install from local directories. > Since SparkR has its own library directories where to install the packages on > each executor, so I think it will not pollute the native R environment. I'd > like to know whether it make sense, and feel free to correct me if there is > misunderstanding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-17937: --- Description: Possible events for which offsets are needed: # New partition is discovered # Offset out of range (aka, data has been lost) Possible sources of offsets: # Earliest position in log # Latest position in log # Fail and kill the query # Checkpoint position # User specified per topicpartition # Kafka commit log. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # Timestamp. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # X offsets before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is earliest or latest, use that. If startingOffsets is perTopicpartition, and the new partition isn't in the map, Fail. Note that this is effectively undistinguishable from 2a below, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with earliest for a during-query out of range #* Offset out of range on executor: Fail or Earliest, based on failOnDataLoss. # During query #* New partition: Earliest, only. This seems to be by fiat, I see no reason this can't be configurable. #* Offset out of range on driver: this ??probably?? doesn't happen, because we're doing explicit seeks to the latest position #* Offset out of range on executor: Fail or Earliest, based on failOnDataLoss # At query restart #* New partition: Checkpoint, fall back to Earliest. Again, no reason this couldn't be configurable fall back to Latest #* Offset out of range on driver: Fail or Earliest, based on FailOnDataLoss #* Offset out of range on executor: Fail or Earliest, based on FailOnDataLoss I've probably missed something, chime in. > Clarify Kafka offset semantics for Structured Streaming > --- > > Key: SPARK-17937 > URL: https://issues.apache.org/jira/browse/SPARK-17937 > Project: Spark > Issue Type: Sub-task >Reporter: Cody Koeninger > > Possible events for which offsets are needed: > # New partition is discovered > # Offset out of range (aka, data has been lost) > Possible sources of offsets: > # Earliest position in log > # Latest position in log > # Fail and kill the query > # Checkpoint position > # User specified per topicpartition > # Kafka commit log. Currently unsupported. This means users who want to > migrate from existing kafka jobs need to jump through hoops. Even if we > never want to support it, as soon as we take on SPARK-17815 we need to make > sure Kafka commit log state is clearly documented and handled. > # Timestamp. Currently unsupported. This could be supported with old, > inaccurate Kafka time api, or upcoming time index > # X offsets before or after latest / earliest position. Currently > unsupported. I think the semantics of this are super unclear by comparison > with timestamp, given that Kafka doesn't have a single range of offsets. > Currently allowed pre-query configuration, all "ORs" are exclusive: > # startingOffsets: earliest OR latest OR json per topicpartition > (SPARK-17812) > # failOnDataLoss: true (which implies Fail above) OR false (which implies > Earliest above) In general, I see no reason this couldn't specify Latest as > an option. > Possible lifecycle times in which an offset-related event may happen: > # At initial query start > #* New partition: if startingOffsets is earliest or latest, use that. If > startingOffsets is perTopicpartition, and the new partition isn't in the map, > Fail. Note that this is effectively undistinguishable from 2a below, because > partitions may have changed in between pre-query configuration and query > start, but we treat it diff
[jira] [Commented] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
[ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575599#comment-15575599 ] Aleksander Eskilson commented on SPARK-17939: - [~marmbrus] suggested the opening of this issue after a bit of discussion in the email list [1]. I'd like to clarify what I proposed in this newer context. First, clarification of what nullability means in the current API, like in the GitHub issue linked above would be great. Second, it makes total sense for default nullability in the reader in many instances, to support more loosely-typed data sources, like JSON and CSV. However, apart from an analysis-time hint to the Catalyst optimizer, there are instances where a (potentially separate?), enforcement-level idea of nullability would be quite useful. With the possibility of now writing custom-encoders open, other kinds of more strongly-typed data might be read into Datasets, e.g. Avro. Avro's UNION type with NULL gives us a harder notion of a truly nullable vs. non-nullable type. It was suggested in the other linked Jira issue above that the current contract is for users to make sure that they do not pass bad data into the reader (as it currently performs conversions that might surprise the user, like from null to 0). What I mean to suggest is that a type-level notion of nullability could help us fail faster and abide by our own data-contracts when we have data to read into Datasets that comes from more strongly-typed sources with known schemas. Thoughts on this? > Spark-SQL Nullability: Optimizations vs. Enforcement Clarification > -- > > Key: SPARK-17939 > URL: https://issues.apache.org/jira/browse/SPARK-17939 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Aleksander Eskilson >Priority: Critical > > The notion of Nullability of of StructFields in DataFrames and Datasets > creates some confusion. As has been pointed out previously [1], Nullability > is a hint to the Catalyst optimizer, and is not meant to be a type-level > enforcement. Allowing null fields can also help the reader successfully parse > certain types of more loosely-typed data, like JSON and CSV, where null > values are common, rather than just failing. > There's already been some movement to clarify the meaning of Nullable in the > API, but also some requests for a (perhaps completely separate) type-level > implementation of Nullable that can act as an enforcement contract. > This bug is logged here to discuss and clarify this issue. > [1] - > [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535] > [2] - https://github.com/apache/spark/pull/11785 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-17937: --- Description: Possible events for which offsets are needed: # New partition is discovered # Offset out of range (aka, data has been lost) Possible sources of offsets: # Earliest position in log # Latest position in log # Fail and kill the query # Checkpoint position # User specified per topicpartition # Kafka commit log. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # Timestamp. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # X offsets before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is earliest or latest, use that. If startingOffsets is perTopicpartition, and the new partition isn't in the map, Fail. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with earliest for a during-query out of range #* Offset out of range on executor: Fail or Earliest, based on failOnDataLoss. # During query #* New partition: Earliest, only. This seems to be by fiat, I see no reason this can't be configurable. #* Offset out of range on driver: this ??probably?? doesn't happen, because we're doing explicit seeks to the latest position #* Offset out of range on executor: Fail or Earliest, based on failOnDataLoss # At query restart #* New partition: Checkpoint, fall back to Earliest. Again, no reason this couldn't be configurable fall back to Latest #* Offset out of range on driver: Fail or Earliest, based on FailOnDataLoss #* Offset out of range on executor: Fail or Earliest, based on FailOnDataLoss I've probably missed something, chime in. was: Possible events for which offsets are needed: # New partition is discovered # Offset out of range (aka, data has been lost) Possible sources of offsets: # Earliest position in log # Latest position in log # Fail and kill the query # Checkpoint position # User specified per topicpartition # Kafka commit log. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # Timestamp. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # X offsets before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is earliest or latest, use that. If startingOffsets is perTopicpartition, and the new partition isn't in the map, Fail. Note that this is effectively undistinguishable from 2a below, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with earliest for a during-query out of range #* Offset out of range on executor: Fail or Ea
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-17937: --- Description: Possible events for which offsets are needed: # *New partition* is discovered # *Offset out of range* (aka, data has been lost) Possible sources of offsets: # *Earliest* position in log # *Latest* position in log # *Fail* and kill the query # *Checkpoint* position # *User specified* per topicpartition # *Kafka commit log*. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # *Timestamp*. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # *X offsets* before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If startingOffsets is *User specified* perTopicpartition, and the new partition isn't in the map, *Fail*. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with earliest for a during-query out of range #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss. # During query #* New partition: *Earliest*, only. This seems to be by fiat, I see no reason this can't be configurable. #* Offset out of range on driver: this _probably_ doesn't happen, because we're doing explicit seeks to the latest position #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss # At query restart #* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason this couldn't be configurable fall back to Latest #* Offset out of range on driver: *Fail* or *Earliest*, based on failOnDataLoss #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss I've probably missed something, chime in. was: Possible events for which offsets are needed: # New partition is discovered # Offset out of range (aka, data has been lost) Possible sources of offsets: # Earliest position in log # Latest position in log # Fail and kill the query # Checkpoint position # User specified per topicpartition # Kafka commit log. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # Timestamp. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # X offsets before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is earliest or latest, use that. If startingOffsets is perTopicpartition, and the new partition isn't in the map, Fail. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with ea
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-17937: --- Description: Possible events for which offsets are needed: # *New partition* is discovered # *Offset out of range* (aka, data has been lost) Possible sources of offsets: # *Earliest* position in log # *Latest* position in log # *Fail* and kill the query # *Checkpoint* position # *User specified* per topicpartition # *Kafka commit log*. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # *Timestamp*. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # *X offsets* before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR User specified json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If startingOffsets is *User specified* perTopicpartition, and the new partition isn't in the map, *Fail*. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with earliest for a during-query out of range #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss. # During query #* New partition: *Earliest*, only. This seems to be by fiat, I see no reason this can't be configurable. #* Offset out of range on driver: this _probably_ doesn't happen, because we're doing explicit seeks to the latest position #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss # At query restart #* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason this couldn't be configurable fall back to Latest #* Offset out of range on driver: *Fail* or *Earliest*, based on failOnDataLoss #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss I've probably missed something, chime in. was: Possible events for which offsets are needed: # *New partition* is discovered # *Offset out of range* (aka, data has been lost) Possible sources of offsets: # *Earliest* position in log # *Latest* position in log # *Fail* and kill the query # *Checkpoint* position # *User specified* per topicpartition # *Kafka commit log*. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # *Timestamp*. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # *X offsets* before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If startingOffsets is *User specified* perTopicpartition, and the new partition isn't in the map, *Fail*. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at st
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-17937: --- Description: Possible events for which offsets are needed: # New partition is discovered # Offset out of range (aka, data has been lost) Possible sources of offsets: # *Earliest* position in log # *Latest* position in log # *Fail* and kill the query # *Checkpoint* position # *User specified* per topicpartition # *Kafka commit log*. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # *Timestamp*. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # *X offsets* before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: *earliest* OR *latest* OR *User specified* json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies *Fail* above) OR false (which implies *Earliest* above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If startingOffsets is *User specified* perTopicpartition, and the new partition isn't in the map, *Fail*. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with earliest for a during-query out of range #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss. # During query #* New partition: *Earliest*, only. This seems to be by fiat, I see no reason this can't be configurable. #* Offset out of range on driver: this _probably_ doesn't happen, because we're doing explicit seeks to the latest position #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss # At query restart #* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason this couldn't be configurable fall back to Latest #* Offset out of range on driver: *Fail* or *Earliest*, based on failOnDataLoss #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss I've probably missed something, chime in. was: Possible events for which offsets are needed: # New partition is discovered # Offset out of range (aka, data has been lost) Possible sources of offsets: # *Earliest* position in log # *Latest* position in log # *Fail* and kill the query # *Checkpoint* position # *User specified* per topicpartition # *Kafka commit log*. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # *Timestamp*. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # *X offsets* before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR User specified json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If startingOffsets is *User specified* perTopicpartition, and the new partition isn't in the map, *Fail*. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-17937: --- Description: Possible events for which offsets are needed: # New partition is discovered # Offset out of range (aka, data has been lost) Possible sources of offsets: # Earliest position in log # Latest position in log # Fail and kill the query # Checkpoint position # User specified per topicpartition # Kafka commit log. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # Timestamp. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # X offsets before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is earliest or latest, use that. If startingOffsets is perTopicpartition, and the new partition isn't in the map, Fail. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with earliest for a during-query out of range #* Offset out of range on executor: Fail or Earliest, based on failOnDataLoss. # During query #* New partition: Earliest, only. This seems to be by fiat, I see no reason this can't be configurable. #* Offset out of range on driver: this _probably_ doesn't happen, because we're doing explicit seeks to the latest position #* Offset out of range on executor: Fail or Earliest, based on failOnDataLoss # At query restart #* New partition: Checkpoint, fall back to Earliest. Again, no reason this couldn't be configurable fall back to Latest #* Offset out of range on driver: Fail or Earliest, based on FailOnDataLoss #* Offset out of range on executor: Fail or Earliest, based on FailOnDataLoss I've probably missed something, chime in. was: Possible events for which offsets are needed: # New partition is discovered # Offset out of range (aka, data has been lost) Possible sources of offsets: # Earliest position in log # Latest position in log # Fail and kill the query # Checkpoint position # User specified per topicpartition # Kafka commit log. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # Timestamp. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # X offsets before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is earliest or latest, use that. If startingOffsets is perTopicpartition, and the new partition isn't in the map, Fail. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with earliest for a during-query out of range #* Offset out of range on exe
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-17937: --- Description: Possible events for which offsets are needed: # New partition is discovered # Offset out of range (aka, data has been lost) Possible sources of offsets: # *Earliest* position in log # *Latest* position in log # *Fail* and kill the query # *Checkpoint* position # *User specified* per topicpartition # *Kafka commit log*. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # *Timestamp*. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # *X offsets* before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR User specified json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If startingOffsets is *User specified* perTopicpartition, and the new partition isn't in the map, *Fail*. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want to know at startup that something was wrong, even if they're ok with earliest for a during-query out of range #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss. # During query #* New partition: *Earliest*, only. This seems to be by fiat, I see no reason this can't be configurable. #* Offset out of range on driver: this _probably_ doesn't happen, because we're doing explicit seeks to the latest position #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss # At query restart #* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason this couldn't be configurable fall back to Latest #* Offset out of range on driver: *Fail* or *Earliest*, based on failOnDataLoss #* Offset out of range on executor: *Fail* or *Earliest*, based on failOnDataLoss I've probably missed something, chime in. was: Possible events for which offsets are needed: # *New partition* is discovered # *Offset out of range* (aka, data has been lost) Possible sources of offsets: # *Earliest* position in log # *Latest* position in log # *Fail* and kill the query # *Checkpoint* position # *User specified* per topicpartition # *Kafka commit log*. Currently unsupported. This means users who want to migrate from existing kafka jobs need to jump through hoops. Even if we never want to support it, as soon as we take on SPARK-17815 we need to make sure Kafka commit log state is clearly documented and handled. # *Timestamp*. Currently unsupported. This could be supported with old, inaccurate Kafka time api, or upcoming time index # *X offsets* before or after latest / earliest position. Currently unsupported. I think the semantics of this are super unclear by comparison with timestamp, given that Kafka doesn't have a single range of offsets. Currently allowed pre-query configuration, all "ORs" are exclusive: # startingOffsets: earliest OR latest OR User specified json per topicpartition (SPARK-17812) # failOnDataLoss: true (which implies Fail above) OR false (which implies Earliest above) In general, I see no reason this couldn't specify Latest as an option. Possible lifecycle times in which an offset-related event may happen: # At initial query start #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If startingOffsets is *User specified* perTopicpartition, and the new partition isn't in the map, *Fail*. Note that this is effectively undistinguishable from new parititon during query, because partitions may have changed in between pre-query configuration and query start, but we treat it differently, and users in this case are SOL #* Offset out of range on driver: We don't technically have behavior for this case yet. Could use the value of failOnDataLoss, but it's possible people may want t
[jira] [Updated] (SPARK-17624) Flaky test? StateStoreSuite maintenance
[ https://issues.apache.org/jira/browse/SPARK-17624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Roberts updated SPARK-17624: - Affects Version/s: 2.1.0 > Flaky test? StateStoreSuite maintenance > --- > > Key: SPARK-17624 > URL: https://issues.apache.org/jira/browse/SPARK-17624 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.0.1, 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > I've noticed this test failing consistently (25x in a row) with a two core > machine but not on an eight core machine > If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 > being the default in Spark), the test reliably passes, we can also gain > reliability by setting the master to be anything other than just local. > Is there a reason spark.rpc.numRetries is set to be 1? > I see this failure is also mentioned here so it's been flaky for a while > http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html > If we run without the "quietly" code so we get debug info: > {code} > 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error > sending message [message = > VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)] > in 1 attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > Caused by: org.apache.spark.SparkException: Could not find > StateStoreCoordinator. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:129) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:225) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcE
[jira] [Commented] (SPARK-17624) Flaky test? StateStoreSuite maintenance
[ https://issues.apache.org/jira/browse/SPARK-17624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575683#comment-15575683 ] Adam Roberts commented on SPARK-17624: -- Having another look at this now as it still fails intermittently [~tdas] hi, as author of this test and feature can you please explain why numRetries was chosen to be 1 instead of the default (2) -- is it important for this test? My concern is that there's something actually wrong/unreliable here and that the failure is legitimate [~jerryshao] HW used is a 2x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 16 GB RAM (standard -Xmx3g used for the tests anyway), ext4 filesystem, Ubuntu 14 04 LTS > Flaky test? StateStoreSuite maintenance > --- > > Key: SPARK-17624 > URL: https://issues.apache.org/jira/browse/SPARK-17624 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.0.1, 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > I've noticed this test failing consistently (25x in a row) with a two core > machine but not on an eight core machine > If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 > being the default in Spark), the test reliably passes, we can also gain > reliability by setting the master to be anything other than just local. > Is there a reason spark.rpc.numRetries is set to be 1? > I see this failure is also mentioned here so it's been flaky for a while > http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html > If we run without the "quietly" code so we get debug info: > {code} > 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error > sending message [message = > VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)] > in 1 attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent
[jira] [Created] (SPARK-17940) Typo in LAST function error message
Shuai Lin created SPARK-17940: - Summary: Typo in LAST function error message Key: SPARK-17940 URL: https://issues.apache.org/jira/browse/SPARK-17940 Project: Spark Issue Type: Improvement Components: SQL Reporter: Shuai Lin Priority: Minor https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40 {code} throw new AnalysisException("The second argument of First should be a boolean literal.") {code} "First" should be "Last". Also the usage string can be improved to match the FIRST function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17940) Typo in LAST function error message
[ https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17940: Assignee: (was: Apache Spark) > Typo in LAST function error message > --- > > Key: SPARK-17940 > URL: https://issues.apache.org/jira/browse/SPARK-17940 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shuai Lin >Priority: Minor > > https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40 > {code} > throw new AnalysisException("The second argument of First should be a > boolean literal.") > {code} > "First" should be "Last". > Also the usage string can be improved to match the FIRST function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17940) Typo in LAST function error message
[ https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575694#comment-15575694 ] Apache Spark commented on SPARK-17940: -- User 'lins05' has created a pull request for this issue: https://github.com/apache/spark/pull/15487 > Typo in LAST function error message > --- > > Key: SPARK-17940 > URL: https://issues.apache.org/jira/browse/SPARK-17940 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shuai Lin >Priority: Minor > > https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40 > {code} > throw new AnalysisException("The second argument of First should be a > boolean literal.") > {code} > "First" should be "Last". > Also the usage string can be improved to match the FIRST function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17940) Typo in LAST function error message
[ https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17940: Assignee: Apache Spark > Typo in LAST function error message > --- > > Key: SPARK-17940 > URL: https://issues.apache.org/jira/browse/SPARK-17940 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shuai Lin >Assignee: Apache Spark >Priority: Minor > > https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40 > {code} > throw new AnalysisException("The second argument of First should be a > boolean literal.") > {code} > "First" should be "Last". > Also the usage string can be improved to match the FIRST function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17910) Allow users to update the comment of a column
[ https://issues.apache.org/jira/browse/SPARK-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17910: - Target Version/s: 2.1.0 > Allow users to update the comment of a column > - > > Key: SPARK-17910 > URL: https://issues.apache.org/jira/browse/SPARK-17910 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > Right now, once a user set the comment of a column with create table command, > he/she cannot update the comment. It will be useful to provide a public > interface to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet
Seth Hendrickson created SPARK-17941: Summary: Logistic regression test suites should use weights when comparing to glmnet Key: SPARK-17941 URL: https://issues.apache.org/jira/browse/SPARK-17941 Project: Spark Issue Type: Test Components: ML Reporter: Seth Hendrickson Priority: Minor Logistic regression suite currently has many test cases comparing to R's glmnet. Both libraries support weights, and to make the testing of weights in Spark LOR more robust, we should add weights to all the test cases. The current weight testing is quite minimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17910) Allow users to update the comment of a column
[ https://issues.apache.org/jira/browse/SPARK-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17910: - Description: Right now, once a user set the comment of a column with create table command, he/she cannot update the comment. It will be useful to provide a public interface (e.g. SQL) to do that. (was: Right now, once a user set the comment of a column with create table command, he/she cannot update the comment. It will be useful to provide a public interface to do that. ) > Allow users to update the comment of a column > - > > Key: SPARK-17910 > URL: https://issues.apache.org/jira/browse/SPARK-17910 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > Right now, once a user set the comment of a column with create table command, > he/she cannot update the comment. It will be useful to provide a public > interface (e.g. SQL) to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet
[ https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17941: Assignee: Apache Spark > Logistic regression test suites should use weights when comparing to glmnet > --- > > Key: SPARK-17941 > URL: https://issues.apache.org/jira/browse/SPARK-17941 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Seth Hendrickson >Assignee: Apache Spark >Priority: Minor > > Logistic regression suite currently has many test cases comparing to R's > glmnet. Both libraries support weights, and to make the testing of weights in > Spark LOR more robust, we should add weights to all the test cases. The > current weight testing is quite minimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet
[ https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575716#comment-15575716 ] Apache Spark commented on SPARK-17941: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/15488 > Logistic regression test suites should use weights when comparing to glmnet > --- > > Key: SPARK-17941 > URL: https://issues.apache.org/jira/browse/SPARK-17941 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > Logistic regression suite currently has many test cases comparing to R's > glmnet. Both libraries support weights, and to make the testing of weights in > Spark LOR more robust, we should add weights to all the test cases. The > current weight testing is quite minimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet
[ https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17941: Assignee: (was: Apache Spark) > Logistic regression test suites should use weights when comparing to glmnet > --- > > Key: SPARK-17941 > URL: https://issues.apache.org/jira/browse/SPARK-17941 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > Logistic regression suite currently has many test cases comparing to R's > glmnet. Both libraries support weights, and to make the testing of weights in > Spark LOR more robust, we should add weights to all the test cases. The > current weight testing is quite minimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
Harish created SPARK-17942: -- Summary: OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize= Key: SPARK-17942 URL: https://issues.apache.org/jira/browse/SPARK-17942 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.1 Reporter: Harish My code snipped is in below location. In that snippet i had put only few columns, but in my test case i have data with 10M rows and 10,000 columns. http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark I see below message in spark 2.0.2 snapshot # Stderr of the node OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled. OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize= # stdout of the node CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb bounds [0x7f32c500, 0x7f32d400, 0x7f32d400] total_blobs=41388 nmethods=40792 adapters=501 compilation: disabled (not enough contiguous free space left) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
[ https://issues.apache.org/jira/browse/SPARK-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harish updated SPARK-17942: --- Priority: Minor (was: Major) > OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using > -XX:ReservedCodeCacheSize= > - > > Key: SPARK-17942 > URL: https://issues.apache.org/jira/browse/SPARK-17942 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 >Reporter: Harish >Priority: Minor > > My code snipped is in below location. In that snippet i had put only few > columns, but in my test case i have data with 10M rows and 10,000 columns. > http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark > I see below message in spark 2.0.2 snapshot > # Stderr of the node > OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been > disabled. > OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using > -XX:ReservedCodeCacheSize= > # stdout of the node > CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb > bounds [0x7f32c500, 0x7f32d400, 0x7f32d400] > total_blobs=41388 nmethods=40792 adapters=501 > compilation: disabled (not enough contiguous free space left) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
[ https://issues.apache.org/jira/browse/SPARK-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575759#comment-15575759 ] Sean Owen commented on SPARK-17942: --- You probably need to increase this value -- is there more to it? > OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using > -XX:ReservedCodeCacheSize= > - > > Key: SPARK-17942 > URL: https://issues.apache.org/jira/browse/SPARK-17942 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 >Reporter: Harish >Priority: Minor > > My code snipped is in below location. In that snippet i had put only few > columns, but in my test case i have data with 10M rows and 10,000 columns. > http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark > I see below message in spark 2.0.2 snapshot > # Stderr of the node > OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been > disabled. > OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using > -XX:ReservedCodeCacheSize= > # stdout of the node > CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb > bounds [0x7f32c500, 0x7f32d400, 0x7f32d400] > total_blobs=41388 nmethods=40792 adapters=501 > compilation: disabled (not enough contiguous free space left) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17940) Typo in LAST function error message
[ https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17940: -- Priority: Trivial (was: Minor) > Typo in LAST function error message > --- > > Key: SPARK-17940 > URL: https://issues.apache.org/jira/browse/SPARK-17940 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shuai Lin >Priority: Trivial > > https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40 > {code} > throw new AnalysisException("The second argument of First should be a > boolean literal.") > {code} > "First" should be "Last". > Also the usage string can be improved to match the FIRST function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
[ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575825#comment-15575825 ] Thomas Dunne commented on SPARK-13802: -- This is especially troublesome when combined with creating a DataFrame, while using your own schema. The data I am working on can contain a lot of empty fields, which makes the schema inference potentially have to scan every row to determine their type. Providing our own schema should fix this, right? Nope... Rather than matching up the keys of the Row, with the field names of the provided schema, lets just change the order of one (the Row), and naively use zip(row, schema.fields). This means that even keeping both schema field order, and Row key value is not enough, due to Rows sorting keys, we need to manually sort schema fields too. > Fields order in Row(**kwargs) is not consistent with Schema.toInternal method > - > > Key: SPARK-13802 > URL: https://issues.apache.org/jira/browse/SPARK-13802 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Szymon Matejczyk > > When using Row constructor from kwargs, fields in the tuple underneath are > sorted by name. When Schema is reading the row, it is not using the fields in > this order. > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > schema = StructType([ > StructField("id", StringType()), > StructField("first_name", StringType())]) > row = Row(id="39", first_name="Szymon") > schema.toInternal(row) > Out[5]: ('Szymon', '39') > {code} > {code} > df = sqlContext.createDataFrame([row], schema) > df.show(1) > +--+--+ > |id|first_name| > +--+--+ > |Szymon|39| > +--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
[ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575825#comment-15575825 ] Thomas Dunne edited comment on SPARK-13802 at 10/14/16 4:45 PM: This is especially troublesome when combined with creating a DataFrame, while using your own schema. The data I am working on can contain a lot of empty fields, which makes the schema inference potentially have to scan every row to determine their type. Providing our own schema should fix this, right? Nope... Rather than matching up the keys of the Row, with the field names of the provided schema, lets just change the order of one (the Row), and naively use zip(row, schema.fields). This means that even keeping both schema field order, and Row key value is not enough, due to Rows sorting keys, we need to manually sort schema fields too. Doesn't seem consistent or desirable behavior at all. was (Author: thomas9): This is especially troublesome when combined with creating a DataFrame, while using your own schema. The data I am working on can contain a lot of empty fields, which makes the schema inference potentially have to scan every row to determine their type. Providing our own schema should fix this, right? Nope... Rather than matching up the keys of the Row, with the field names of the provided schema, lets just change the order of one (the Row), and naively use zip(row, schema.fields). This means that even keeping both schema field order, and Row key value is not enough, due to Rows sorting keys, we need to manually sort schema fields too. > Fields order in Row(**kwargs) is not consistent with Schema.toInternal method > - > > Key: SPARK-13802 > URL: https://issues.apache.org/jira/browse/SPARK-13802 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Szymon Matejczyk > > When using Row constructor from kwargs, fields in the tuple underneath are > sorted by name. When Schema is reading the row, it is not using the fields in > this order. > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > schema = StructType([ > StructField("id", StringType()), > StructField("first_name", StringType())]) > row = Row(id="39", first_name="Szymon") > schema.toInternal(row) > Out[5]: ('Szymon', '39') > {code} > {code} > df = sqlContext.createDataFrame([row], schema) > df.show(1) > +--+--+ > |id|first_name| > +--+--+ > |Szymon|39| > +--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575865#comment-15575865 ] Guo-Xun Yuan commented on SPARK-12664: -- Thank you, [~yanboliang]! So, just to confirm, will your PR just cover a method that exposes raw prediction scores in MultilayerPerceptronClassificationModel? Or it will also cover the fix where MultilayerPerceptronClassificationModel is derived from ClassificationModel. Thanks! > Expose raw prediction scores in MultilayerPerceptronClassificationModel > --- > > Key: SPARK-12664 > URL: https://issues.apache.org/jira/browse/SPARK-12664 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Robert Dodier >Assignee: Yanbo Liang > > In > org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, > there isn't any way to get raw prediction scores; only an integer output > (from 0 to #classes - 1) is available via the `predict` method. > `mplModel.predict` is called within the class to get the raw score, but > `mlpModel` is private so that isn't available to outside callers. > The raw score is useful when the user wants to interpret the classifier > output as a probability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17606) New batches are not created when there are 1000 created after restarting streaming from checkpoint.
[ https://issues.apache.org/jira/browse/SPARK-17606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575927#comment-15575927 ] etienne commented on SPARK-17606: - I'm not able to reproduce in local mode. either because the JobGenerator managed to restart either because the streaming resulted to a OOM before the restarting of the JobGenerator. After heapdump it appears the memory is full of MapPartitionRDD I give you here the result of my tries. batch interval : 500ms , real batch duration > 2s (I had to reduce the batch interval to generate batches faster) I have proceeded simply as : Start the streaming without checkpoint wait until checkpoint, stop it, wait a period and restart from the checkpoint. ||stoping time||restarting||batch during down time|| batch pending|| batch to reschedule|| starting of JobGenerator || last time* || |14:23:01|14:33:29|1320 - [14:22:30-14:33:29.500]|88 - [14:22:00-14:22:44]|1379-[14:22:00.500-14:33:29.500]|14:55:00 for time 14:33:30|14:25:11| |15:08:25|15:31:20|2777 - [15:08:18.500-15:31:26.500]|22 - [15:08:11-15:08:21.500]|2792-[15:08:11-15:31:26.500]|OOM| | |09:30:01|09:47:01|2338 - [09:27:33-09:47:01.500]|298 - [09:26:03-09:28:31.500]|2518 - [09:26:03-09:47:01.500]|OOM| | |12:52:49|12:46:01|1838- [12:49:11.500-13:04:30]|116 - [12:49:18.500-12:50:16]|1838- [12:45:11.500-13:04:30]|OOM| | \* last time to reschedule found in log that was executed before the restarting of job generator (strangely there 8 minutes are not missing in UI) All these OOM make me think there is something that is not cleaned correctly. The JobGenerator is not started directly after the beginning (I have looked into the src and I didn't find what is blocking) and may induce a lag in batch generation. > New batches are not created when there are 1000 created after restarting > streaming from checkpoint. > --- > > Key: SPARK-17606 > URL: https://issues.apache.org/jira/browse/SPARK-17606 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: etienne > > When spark restarts from a checkpoint after being down for a while. > It recreates missing batch since the down time. > When there are few missing batches, spark creates new incoming batch every > batchTime, but when there is enough missing time to create 1000 batches no > new batch is created. > So when all these batch are completed the stream is idle ... > I think there is a rigid limit set somewhere. > I was expecting that spark continue to recreate missed batches, maybe not all > at once ( because it's look like it's cause driver memory problem ), and then > recreate batches each batchTime. > Another solution would be to not create missing batches but still restart the > direct input. > Right know for me the only solution to restart a stream after a long break it > to remove the checkpoint to allow the creation of a new stream. But losing > all my states. > ps : I'm speaking about direct Kafka input because it's the source I'm > currently using, I don't know what happens with other sources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575938#comment-15575938 ] Xiao Li commented on SPARK-17709: - I can get an exactly same plan in the master branch, but my job can pass. {noformat} 'Join UsingJoin(Inner,List('companyid, 'productid)) :- Aggregate [companyid#5, productid#6], [companyid#5, productid#6, sum(cast(price#7 as bigint)) AS price#30L] : +- Project [companyid#5, productid#6, price#7, count#8] : +- SubqueryAlias testext2 :+- Relation[companyid#5,productid#6,price#7,count#8] parquet +- Aggregate [companyid#46, productid#47], [companyid#46, productid#47, sum(cast(count#49 as bigint)) AS count#41L] +- Project [companyid#46, productid#47, price#48, count#49] +- SubqueryAlias testext2 +- Relation[companyid#46,productid#47,price#48,count#49] parquet {noformat} The only difference is yours does not trigger deduplication of expression ids. Let me try it in the 2.0.1 branch. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error
[ https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575945#comment-15575945 ] Justin Miller commented on SPARK-17936: --- Hey Sean, I did a bit more digging this morning looking at SpecificUnsafeProjection and saw this commit: https://github.com/apache/spark/commit/b1b47274bfeba17a9e4e9acebd7385289f31f6c8 I thought I'd try running w/2.1.0-SNAPSHOT and see how things went and it appears to work great now! [Stage 1:> (0 + 8) / 8]11:28:33.237 INFO c.p.o.ObservationPersister - (ObservationPersister) - Thrift Parse Success: 0 / Thrift Parse Errors: 0 [Stage 3:> (0 + 8) / 8]11:29:03.236 INFO c.p.o.ObservationPersister - (ObservationPersister) - Thrift Parse Success: 89 / Thrift Parse Errors: 0 [Stage 5:> (4 + 4) / 8]11:29:33.237 INFO c.p.o.ObservationPersister - (ObservationPersister) - Thrift Parse Success: 205 / Thrift Parse Errors: 0 Since we're still testing this out that snapshot works great for now. Do you know when 2.1.0 might be available generally? Best, Justin > "CodeGenerator - failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of" method Error > - > > Key: SPARK-17936 > URL: https://issues.apache.org/jira/browse/SPARK-17936 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Justin Miller > > Greetings. I'm currently in the process of migrating a project I'm working on > from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift > structs coming from Kafka into Parquet files stored in S3. This conversion > process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll > paste the stack trace below. > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058) > Also, later on: > 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception > in thread Thread[Executor task launch worker-6,5,run-main-group-0] > java.lang.OutOfMemoryError: Java heap space > I've seen similar issues posted, but those were always on the query side. I > have a hunch that this is happening at write time as the error occurs after > batchDuration. Here's the write snippet. > stream. > flatMap { > case Success(row) => > thriftParseSuccess += 1 > Some(row) > case Failure(ex) => > thriftParseErrors += 1 > logger.error("Error during deserialization: ", ex) > None > }.foreachRDD { rdd => > val sqlContext = SQLContext.getOrCreate(rdd.context) > transformer(sqlContext.createDataFrame(rdd, converter.schema)) > .coalesce(coalesceSize) > .write > .mode(Append) > .partitionBy(partitioning: _*) > .parquet(parquetPath) > } > Please let me know if you can be of assistance and if there's anything I can > do to help. > Best, > Justin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575995#comment-15575995 ] Xiao Li commented on SPARK-17709: - Still works well in 2.0.1 > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575999#comment-15575999 ] Xiao Li commented on SPARK-17709: - Below is the statements I used to recreate the problem {noformat} sql("CREATE TABLE testext2(companyid int, productid int, price int, count int) using parquet") sql("insert into testext2 values (1, 1, 1, 1)") val d1 = spark.sql("select * from testext2") val df1 = d1.groupBy("companyid","productid").agg(sum("price").as("price")) val df2 = d1.groupBy("companyid","productid").agg(sum("count").as("count")) df1.join(df2, Seq("companyid", "productid")).show {noformat} Can you try it? > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17709: Labels: (was: easyfix) > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org