[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-14 Thread Low Chin Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574502#comment-15574502
 ] 

Low Chin Wei commented on SPARK-13747:
--

I encounter this in 2.0.1, is there any workaround like having separate 
SparkSession will help?

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574540#comment-15574540
 ] 

Apache Spark commented on SPARK-16845:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15480

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17903) MetastoreRelation should talk to external catalog instead of hive client

2016-10-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17903.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> MetastoreRelation should talk to external catalog instead of hive client
> 
>
> Key: SPARK-17903
> URL: https://issues.apache.org/jira/browse/SPARK-17903
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17917) Convert 'Initial job has not accepted any resources..' logWarning to a SparkListener event

2016-10-14 Thread Mario Briggs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574639#comment-15574639
 ] 

Mario Briggs commented on SPARK-17917:
--

>>
I don't have a strong feeling on this partly because I'm not sure what the 
action then is – kill the job?
<<
Here is an example - Lets say i am using a notebook and kicked off some spark 
actions that dont' get executors because user/org/group quota's of executors 
have been exhausted. These events can be used by the notebook implementor to 
then surface the issue to the user via a UI update on that cell itself, maybe 
even additionally query the user/org/group quota's show which apps are using up 
the quota's etc and allow the user to take what action required (kill the other 
jobs, just wait on this job etc). Therefore not looking to define in anyway on 
the event, what the set of actions can be, since that would be very 
implementation specific.

>>
Maybe, I suppose it will be a little tricky to define what the event is here
<<
Where you referring to the actual arguments of the event method. I can give a 
shot at defining and then look for feedback


> Convert 'Initial job has not accepted any resources..' logWarning to a 
> SparkListener event
> --
>
> Key: SPARK-17917
> URL: https://issues.apache.org/jira/browse/SPARK-17917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mario Briggs
>Priority: Minor
>
> When supporting Spark on a multi-tenant shared large cluster with quotas per 
> tenant, often a submitted taskSet might not get executors because quotas have 
> been exhausted (or) resources unavailable. In these situations, firing a 
> SparkListener event instead of just logging the issue (as done currently at 
> https://github.com/apache/spark/blob/9216901d52c9c763bfb908013587dcf5e781f15b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L192),
>  would give applications/listeners an opportunity to handle this more 
> appropriately as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16575) partition calculation mismatch with sc.binaryFiles

2016-10-14 Thread Tarun Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572950#comment-15572950
 ] 

Tarun Kumar edited comment on SPARK-16575 at 10/14/16 8:38 AM:
---

[~rxin] I have now added the support of openCostInBytes, similar to SQL (thanks 
for pointing out). It does now create an optimized number of partitions. 
Request you to review and suggest. Once again, Thanks for your suggestion, It 
worked like a charm!


was (Author: fidato13):
[~rxin] I have now added the support of openCostInBytes, similar to SQL (thanks 
for pointing out). It does now creates an optimized number of partitions. 
Request you to review and suggest. Once again, Thanks for your suggestion, It 
worked like a charm!

> partition calculation mismatch with sc.binaryFiles
> --
>
> Key: SPARK-16575
> URL: https://issues.apache.org/jira/browse/SPARK-16575
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Suhas
>Priority: Critical
>
> sc.binaryFiles is always creating an RDD with number of partitions as 2.
> Steps to reproduce: (Tested this bug on databricks community edition)
> 1. Try to create an RDD using sc.binaryFiles. In this example, airlines 
> folder has 1922 files.
>  Ex: {noformat}val binaryRDD = 
> sc.binaryFiles("/databricks-datasets/airlines/*"){noformat}
> 2. check the number of partitions of the above RDD
> - binaryRDD.partitions.size = 2. (expected value is more than 2)
> 3. If the RDD is created using sc.textFile, then the number of partitions are 
> 1921.
> 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 
> version.
> For explanation with screenshot, please look at the link below,
> http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10872.
---
Resolution: Not A Problem

I don't think this is something to be fixed if I understand it correctly, in 
that it's not intended to stop/start a context within one JVM. 

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the database 
> ~/metastore_db":
> {code:python}
> from pyspark import SparkContext, HiveContext
> sc = SparkContext("local[*]", "app1")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()
> sc.stop()
> sc = SparkContext("local[*]", "app2")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()  # Py4J error
> {code}
> This is related to [#SPARK-9539], and I intend to restart spark context 
> several times for isolated jobs to prevent cache cluttering and GC errors.
> Here's a larger part of the full error trace:
> {noformat}
> Failed to start database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
> org.datanucleus.exceptions.NucleusDataStoreException: Failed to start 
> database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HM

[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor

2016-10-14 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-17933:
-
Attachment: screenshot-2.png
screenshot-1.png

> Shuffle fails when driver is on one of the same machines as executor
> 
>
> Key: SPARK-17933
> URL: https://issues.apache.org/jira/browse/SPARK-17933
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.2
>Reporter: Frank Rosner
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> h4. Problem
> When I run a job that requires some shuffle, some tasks fail because the 
> executor cannot fetch the shuffle blocks from another executor.
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> 10-250-20-140:44042
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.nio.channels.UnresolvedAddressException
>   at sun.nio.ch.Net.checkAddress(Net.java:101)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannel

[jira] [Created] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor

2016-10-14 Thread Frank Rosner (JIRA)
Frank Rosner created SPARK-17933:


 Summary: Shuffle fails when driver is on one of the same machines 
as executor
 Key: SPARK-17933
 URL: https://issues.apache.org/jira/browse/SPARK-17933
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.6.2
Reporter: Frank Rosner
 Attachments: screenshot-1.png, screenshot-2.png

h4. Problem

When I run a job that requires some shuffle, some tasks fail because the 
executor cannot fetch the shuffle blocks from another executor.

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
10-250-20-140:44042
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at 
io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
at 
io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractCh

[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor

2016-10-14 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-17933:
-
Description: 
h4. Problem

When I run a job that requires some shuffle, some tasks fail because the 
executor cannot fetch the shuffle blocks from another executor.

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
10-250-20-140:44042
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at 
io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
at 
io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
at 
io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at 
io.netty.channel.Abstra

[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor

2016-10-14 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-17933:
-
Description: 
h4. Problem

When I run a job that requires some shuffle, some tasks fail because the 
executor cannot fetch the shuffle blocks from another executor.

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
10-250-20-140:44042
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at 
io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
at 
io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
at 
io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at 
io.netty.channel.Abstra

[jira] [Created] (SPARK-17934) Support percentile scale in ml.feature

2016-10-14 Thread Lei Wang (JIRA)
Lei Wang created SPARK-17934:


 Summary: Support percentile scale in ml.feature
 Key: SPARK-17934
 URL: https://issues.apache.org/jira/browse/SPARK-17934
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Lei Wang


Percentile scale is often used in feature scale.
In my project, I need to use this scaler.
Compared to MinMaxScaler, PercentileScaler will not produce unstable result due 
to anomaly large value.

About percentile scale, refer to https://en.wikipedia.org/wiki/Percentile_rank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17934) Support percentile scale in ml.feature

2016-10-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574787#comment-15574787
 ] 

Sean Owen commented on SPARK-17934:
---

This might be on the border of things that are widely used enough to include in 
MLlib. You can of course implement this as a UDF.

> Support percentile scale in ml.feature
> --
>
> Key: SPARK-17934
> URL: https://issues.apache.org/jira/browse/SPARK-17934
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>
> Percentile scale is often used in feature scale.
> In my project, I need to use this scaler.
> Compared to MinMaxScaler, PercentileScaler will not produce unstable result 
> due to anomaly large value.
> About percentile scale, refer to https://en.wikipedia.org/wiki/Percentile_rank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4257) Spark master can only be accessed by hostname

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4257.
--
Resolution: Not A Problem

> Spark master can only be accessed by hostname
> -
>
> Key: SPARK-4257
> URL: https://issues.apache.org/jira/browse/SPARK-4257
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Priority: Critical
>
> After sbin/start-all.sh, the spark shell can not connect to standalone master 
> by spark://IP:7077, it works if replace IP by hostname.
> In the docs[1], it says use `spark://IP:PORT` to connect to master.
> [1] http://spark.apache.org/docs/latest/spark-standalone.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5113.
--
Resolution: Won't Fix

> Audit and document use of hostnames and IP addresses in Spark
> -
>
> Key: SPARK-5113
> URL: https://issues.apache.org/jira/browse/SPARK-5113
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Critical
>
> Spark has multiple network components that start servers and advertise their 
> network addresses to other processes.
> We should go through each of these components and make sure they have 
> consistent and/or documented behavior wrt (a) what interface(s) they bind to 
> and (b) what hostname they use to advertise themselves to other processes. We 
> should document this clearly and explain to people what to do in different 
> cases (e.g. EC2, dockerized containers, etc).
> When Spark initializes, it will search for a network interface until it finds 
> one that is not a loopback address. Then it will do a reverse DNS lookup for 
> a hostname associated with that interface. Then the network components will 
> use that hostname to advertise the component to other processes. That 
> hostname is also the one used for the akka system identifier (akka supports 
> only supplying a single name which it uses both as the bind interface and as 
> the actor identifier). In some cases, that hostname is used as the bind 
> hostname also (e.g. I think this happens in the connection manager and 
> possibly akka) - which will likely internally result in a re-resolution of 
> this to an IP address. In other cases (the web UI and netty shuffle) we seem 
> to bind to all interfaces.
> The best outcome would be to have three configs that can be set on each 
> machine:
> {code}
> SPARK_LOCAL_IP # Ip address we bind to for all services
> SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within 
> the cluster
> SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the 
> cluster (e.g. the UI)
> {code}
> It's not clear how easily we can support that scheme while providing 
> backwards compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy - 
> it's just an alias for what is now SPARK_PUBLIC_DNS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17929:


Assignee: (was: Apache Spark)

> Deadlock when AM restart and send RemoveExecutor on reset
> -
>
> Key: SPARK-17929
> URL: https://issues.apache.org/jira/browse/SPARK-17929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Weizhong
>Priority: Minor
>
> We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
> {code}
>   protected def reset(): Unit = synchronized {
> numPendingExecutors = 0
> executorsPendingToRemove.clear()
> // Remove all the lingering executors that should be removed but not yet. 
> The reason might be
> // because (1) disconnected event is not yet received; (2) executors die 
> silently.
> executorDataMap.toMap.foreach { case (eid, _) =>
>   driverEndpoint.askWithRetry[Boolean](
> RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
> re-registered.")))
> }
>   }
> {code}
> but on removeExecutor also need the lock 
> "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
> and send RPC will failed, and reset failed
> {code}
> private def removeExecutor(executorId: String, reason: 
> ExecutorLossReason): Unit = {
>   logDebug(s"Asked to remove executor $executorId with reason $reason")
>   executorDataMap.get(executorId) match {
> case Some(executorInfo) =>
>   // This must be synchronized because variables mutated
>   // in this block are read when requesting executors
>   val killed = CoarseGrainedSchedulerBackend.this.synchronized {
> addressToExecutorId -= executorInfo.executorAddress
> executorDataMap -= executorId
> executorsPendingLossReason -= executorId
> executorsPendingToRemove.remove(executorId).getOrElse(false)
>   }
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor

2016-10-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574802#comment-15574802
 ] 

Sean Owen commented on SPARK-17933:
---

I think this is related to a lot of existing JIRAs concerning the difference 
between advertised host names and IPs, like 
https://issues.apache.org/jira/browse/SPARK-4563

> Shuffle fails when driver is on one of the same machines as executor
> 
>
> Key: SPARK-17933
> URL: https://issues.apache.org/jira/browse/SPARK-17933
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.2
>Reporter: Frank Rosner
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> h4. Problem
> When I run a job that requires some shuffle, some tasks fail because the 
> executor cannot fetch the shuffle blocks from another executor.
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> 10-250-20-140:44042
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.nio.channels.UnresolvedAddressException
>   at sun.nio.ch.Net.checkAddress(Net.java:101)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
>   at 
> io.netty.channel.DefaultCh

[jira] [Assigned] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17929:


Assignee: Apache Spark

> Deadlock when AM restart and send RemoveExecutor on reset
> -
>
> Key: SPARK-17929
> URL: https://issues.apache.org/jira/browse/SPARK-17929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Weizhong
>Assignee: Apache Spark
>Priority: Minor
>
> We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
> {code}
>   protected def reset(): Unit = synchronized {
> numPendingExecutors = 0
> executorsPendingToRemove.clear()
> // Remove all the lingering executors that should be removed but not yet. 
> The reason might be
> // because (1) disconnected event is not yet received; (2) executors die 
> silently.
> executorDataMap.toMap.foreach { case (eid, _) =>
>   driverEndpoint.askWithRetry[Boolean](
> RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
> re-registered.")))
> }
>   }
> {code}
> but on removeExecutor also need the lock 
> "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
> and send RPC will failed, and reset failed
> {code}
> private def removeExecutor(executorId: String, reason: 
> ExecutorLossReason): Unit = {
>   logDebug(s"Asked to remove executor $executorId with reason $reason")
>   executorDataMap.get(executorId) match {
> case Some(executorInfo) =>
>   // This must be synchronized because variables mutated
>   // in this block are read when requesting executors
>   val killed = CoarseGrainedSchedulerBackend.this.synchronized {
> addressToExecutorId -= executorInfo.executorAddress
> executorDataMap -= executorId
> executorsPendingLossReason -= executorId
> executorsPendingToRemove.remove(executorId).getOrElse(false)
>   }
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset

2016-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574801#comment-15574801
 ] 

Apache Spark commented on SPARK-17929:
--

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/15481

> Deadlock when AM restart and send RemoveExecutor on reset
> -
>
> Key: SPARK-17929
> URL: https://issues.apache.org/jira/browse/SPARK-17929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Weizhong
>Priority: Minor
>
> We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
> {code}
>   protected def reset(): Unit = synchronized {
> numPendingExecutors = 0
> executorsPendingToRemove.clear()
> // Remove all the lingering executors that should be removed but not yet. 
> The reason might be
> // because (1) disconnected event is not yet received; (2) executors die 
> silently.
> executorDataMap.toMap.foreach { case (eid, _) =>
>   driverEndpoint.askWithRetry[Boolean](
> RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
> re-registered.")))
> }
>   }
> {code}
> but on removeExecutor also need the lock 
> "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
> and send RPC will failed, and reset failed
> {code}
> private def removeExecutor(executorId: String, reason: 
> ExecutorLossReason): Unit = {
>   logDebug(s"Asked to remove executor $executorId with reason $reason")
>   executorDataMap.get(executorId) match {
> case Some(executorInfo) =>
>   // This must be synchronized because variables mutated
>   // in this block are read when requesting executors
>   val killed = CoarseGrainedSchedulerBackend.this.synchronized {
> addressToExecutorId -= executorInfo.executorAddress
> executorDataMap -= executorId
> executorsPendingLossReason -= executorId
> executorsPendingToRemove.remove(executorId).getOrElse(false)
>   }
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17898) --repositories needs username and password

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17898.
---
Resolution: Not A Problem

These are Gradle questions, really. It should be perfectly possible to access 
your repo and build an app jar with it.

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Wish
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17572) Write.df is failing on spark cluster

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17572.
---
Resolution: Not A Problem

> Write.df is failing on spark cluster
> 
>
> Key: SPARK-17572
> URL: https://issues.apache.org/jira/browse/SPARK-17572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Sankar Mittapally
>
> Hi,
>  We have spark cluster with four nodes, all four nodes have NFS partition 
> shared(there is no HDFS), We have same uid on all servers. When we are trying 
> to write data we are getting following exceptions. I am not sure whether it 
> is a error or not and not sure will I lost the data in the output.
> The command which I am using to save the data.
> {code}
> saveDF(banking_l1_1,"banking_l1_v2.csv",source="csv",mode="append",schema="true")
> {code}
> {noformat}
> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> java.io.IOException: Failed to rename 
> DeprecatedRawLocalFileStatus{path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task_201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv;
>  isDirectory=false; length=436486316; replication=1; blocksize=33554432; 
> modification_time=147409940; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false} to 
> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:371)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(A

[jira] [Resolved] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-650.
-
Resolution: Duplicate

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16720) Loading CSV file with 2k+ columns fails during attribute resolution on action

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16720.
---
Resolution: Not A Problem

> Loading CSV file with 2k+ columns fails during attribute resolution on action
> -
>
> Key: SPARK-16720
> URL: https://issues.apache.org/jira/browse/SPARK-16720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> Example shell for repro:
> {quote}
> scala> val df =spark.read.format("csv").option("header", 
> "true").option("inferSchema", "true").load("/home/holden/Downloads/ex*.csv")
> df: org.apache.spark.sql.DataFrame = [Date: string, Lifetime Total Likes: int 
> ... 2125 more fields]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(Date,StringType,true), StructField(Lifetime Total 
> Likes,IntegerType,true), StructField(Daily New Likes,IntegerType,true), 
> StructField(Daily Unlikes,IntegerType,true), StructField(Daily Page Engaged 
> Users,IntegerType,true), StructField(Weekly Page Engaged 
> Users,IntegerType,true), StructField(28 Days Page Engaged 
> Users,IntegerType,true), StructField(Daily Like Sources - On Your 
> Page,IntegerType,true), StructField(Daily Total Reach,IntegerType,true), 
> StructField(Weekly Total Reach,IntegerType,true), StructField(28 Days Total 
> Reach,IntegerType,true), StructField(Daily Organic Reach,IntegerType,true), 
> StructField(Weekly Organic Reach,IntegerType,true), StructField(28 Days 
> Organic Reach,IntegerType,true), StructField(Daily T...
> scala> df.take(1)
> [GIANT LIST OF COLUMNS]
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sq

[jira] [Resolved] (SPARK-17555) ExternalShuffleBlockResolver fails randomly with External Shuffle Service and Dynamic Resource Allocation on Mesos running under Marathon

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17555.
---
Resolution: Not A Problem

> ExternalShuffleBlockResolver fails randomly with External Shuffle Service  
> and Dynamic Resource Allocation on Mesos running under Marathon
> --
>
> Key: SPARK-17555
> URL: https://issues.apache.org/jira/browse/SPARK-17555
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1, 1.6.2, 2.0.0
> Environment: Mesos using docker with external shuffle service running 
> on Marathon. Running code from pyspark shell in client mode.
>Reporter: Brad Willard
>  Labels: docker, dynamic_allocation, mesos, shuffle
>
> External Shuffle Service throws these errors about 90% of the time. It seems 
> to die between stages and work inconsistently with these style of errors 
> about missing files. I've tested this same behavior with all the spark 
> versions listed on the jira using the pre-build hadoop 2.6 distributions from 
> the apache spark download page. I also want to mention everything works 
> successfully with dynamic resource allocation turned off.
> I have read other related bugs and have tried some of the 
> workaround/suggestions. Seems like some people have blamed the switch from 
> akka to netty which got me testing this in the 1.5* range with no luck. I'm 
> currently running these config option (informed by reading other bugs on jira 
> that seemed related to my problem). These settings have helped it work 
> sometimes instead of never.
> {code}
> spark.shuffle.service.port7338
> spark.shuffle.io.numConnectionsPerPeer  4
> spark.shuffle.io.connectionTimeout  18000s
> spark.shuffle.service.enabled   true
> spark.dynamicAllocation.enabled true
> {code}
> on the driver for pyspark submit I'm sending along this config
> {code}
> --conf 
> spark.mesos.executor.docker.image=docker-registry.x.net/machine-learning/spark:spark-1-6-2v1
>  \
> --conf spark.shuffle.service.enabled=true \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.mesos.coarse=true \
> --conf spark.cores.max=100 \
> --conf spark.executor.uri=$SPARK_EXECUTOR_URI \
> --conf spark.shuffle.service.port=7338 \
> --executor-memory 15g
> {code}
> Under Marathon I'm pinning each external shuffle service to an agent and 
> starting the service like this.
> {code}
> $SPARK_HOME/sbin/start-mesos-shuffle-service.sh && tail -f 
> $SPARK_HOME/logs/spark--org.apache.spark.deploy.mesos.MesosExternalShuffleService*
> {code}
> On startup it seems like all is well
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Python version 3.5.2 (default, Jul  2 2016 17:52:12)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> 16/09/15 11:35:53 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now 
> >>> TASK_RUNNING
> 16/09/15 11:35:53 INFO MesosExternalShuffleClient: Successfully registered 
> app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service.
> 16/09/15 11:35:55 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now 
> TASK_RUNNING
> 16/09/15 11:35:55 INFO MesosExternalShuffleClient: Successfully registered 
> app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service.
> 16/09/15 11:35:56 INFO CoarseMesosSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (mesos-agent002.[redacted]:61281) with ID 
> 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1
> 16/09/15 11:35:56 INFO ExecutorAllocationManager: New executor 
> 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1 has registered (new total is 1)
> 16/09/15 11:35:56 INFO BlockManagerMasterEndpoint: Registering block manager 
> mesos-agent002.[redacted]:46247 with 10.6 GB RAM, 
> BlockManagerId(55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1, 
> mesos-agent002.[redacted], 46247)
> 16/09/15 11:35:59 INFO CoarseMesosSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (mesos-agent004.[redacted]:42738) with ID 
> 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0
> 16/09/15 11:35:59 INFO ExecutorAllocationManager: New executor 
> 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0 has registered (new total is 2)
> 16/09/15 11:35:59 INFO BlockManagerMasterEndpoint: Registering block manager 
> mesos-agent004.[redacted]:11262 with 10.6 GB RAM, 
> BlockManagerId(55300bb1-aca1-4dd1-8647-ce4a50f6d661-S10/0, 
> mesos-agent004.[redacted], 11262)
> {code}
> I'm running a simple sort command on a spark data frame loaded from hdfs from 
> a parquet file. These are the errors. T

[jira] [Created] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-14 Thread zhangxinyu (JIRA)
zhangxinyu created SPARK-17935:
--

 Summary: Add KafkaForeachWriter in external kafka-0.8.0 for 
structured streaming module
 Key: SPARK-17935
 URL: https://issues.apache.org/jira/browse/SPARK-17935
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Streaming
Affects Versions: 2.0.0
Reporter: zhangxinyu
 Fix For: 2.0.0


Now spark already supports kafkaInputStream. It would be useful that we add 
`KafkaForeachWriter` to output results to kafka in structured streaming module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-14 Thread zhangxinyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangxinyu updated SPARK-17935:
---
External issue URL: https://github.com/apache/spark/pull/15483

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
> Fix For: 2.0.0
>
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers

2016-10-14 Thread Luis Ramos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575005#comment-15575005
 ] 

Luis Ramos commented on SPARK-636:
--

I feel like the broadcasting mechanism doesn't get me "close" enough to solve 
my issue (initialization of a logging system). That's partly because 
initialization would be deferred (meaning a loss of useful logs), and also it 
could enable us to have init code that is 'guaranteed' to only be executed once 
as opposed to implement that 'guarantee' yourself, which currently can lead to 
bad practices.

> Add mechanism to run system management/configuration tasks on all workers
> -
>
> Key: SPARK-636
> URL: https://issues.apache.org/jira/browse/SPARK-636
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>
> It would be useful to have a mechanism to run a task on all workers in order 
> to perform system management tasks, such as purging caches or changing system 
> properties.  This is useful for automated experiments and benchmarking; I 
> don't envision this being used for heavy computation.
> Right now, I can mimic this with something like
> {code}
> sc.parallelize(0 until numMachines, numMachines).foreach { } 
> {code}
> but this does not guarantee that every worker runs a task and requires my 
> user code to know the number of workers.
> One sample use case is setup and teardown for benchmark tests.  For example, 
> I might want to drop cached RDDs, purge shuffle data, and call 
> {{System.gc()}} between test runs.  It makes sense to incorporate some of 
> this functionality, such as dropping cached RDDs, into Spark itself, but it 
> might be helpful to have a general mechanism for running ad-hoc tasks like 
> {{System.gc()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17935:


Assignee: (was: Apache Spark)

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
> Fix For: 2.0.0
>
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-14 Thread zhangxinyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangxinyu updated SPARK-17935:
---
External issue URL:   (was: https://github.com/apache/spark/pull/15483)

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
> Fix For: 2.0.0
>
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575007#comment-15575007
 ] 

Apache Spark commented on SPARK-17935:
--

User 'zhangxinyu1' has created a pull request for this issue:
https://github.com/apache/spark/pull/15483

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
> Fix For: 2.0.0
>
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17935:


Assignee: Apache Spark

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-14 Thread zhangxinyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangxinyu updated SPARK-17935:
---
Description: 
Now spark already supports kafkaInputStream. It would be useful that we add 
`KafkaForeachWriter` to output results to kafka in structured streaming module.
`KafkaForeachWriter.scala` is put in external kafka-0.8.0.

  was:Now spark already supports kafkaInputStream. It would be useful that we 
add `KafkaForeachWriter` to output results to kafka in structured streaming 
module.


> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
> Fix For: 2.0.0
>
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers

2016-10-14 Thread Luis Ramos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575005#comment-15575005
 ] 

Luis Ramos edited comment on SPARK-636 at 10/14/16 11:07 AM:
-

I feel like the broadcasting mechanism doesn't get me "close" enough to solve 
my issue (initialization of a logging system). That's partly because my 
initialization would be deferred (meaning a loss of useful logs), and also it 
could enable us to have some 'init' code that is guaranteed to only be 
evaluated once as opposed to implementing that 'guarantee' yourself, which can 
currently lead to bad practices.


was (Author: luisramos):
I feel like the broadcasting mechanism doesn't get me "close" enough to solve 
my issue (initialization of a logging system). That's partly because 
initialization would be deferred (meaning a loss of useful logs), and also it 
could enable us to have init code that is 'guaranteed' to only be executed once 
as opposed to implement that 'guarantee' yourself, which currently can lead to 
bad practices.

> Add mechanism to run system management/configuration tasks on all workers
> -
>
> Key: SPARK-636
> URL: https://issues.apache.org/jira/browse/SPARK-636
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>
> It would be useful to have a mechanism to run a task on all workers in order 
> to perform system management tasks, such as purging caches or changing system 
> properties.  This is useful for automated experiments and benchmarking; I 
> don't envision this being used for heavy computation.
> Right now, I can mimic this with something like
> {code}
> sc.parallelize(0 until numMachines, numMachines).foreach { } 
> {code}
> but this does not guarantee that every worker runs a task and requires my 
> user code to know the number of workers.
> One sample use case is setup and teardown for benchmark tests.  For example, 
> I might want to drop cached RDDs, purge shuffle data, and call 
> {{System.gc()}} between test runs.  It makes sense to incorporate some of 
> this functionality, such as dropping cached RDDs, into Spark itself, but it 
> might be helpful to have a general mechanism for running ad-hoc tasks like 
> {{System.gc()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers

2016-10-14 Thread Luis Ramos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575005#comment-15575005
 ] 

Luis Ramos edited comment on SPARK-636 at 10/14/16 11:11 AM:
-

I feel like the broadcasting mechanism doesn't get me "close" enough to solve 
my issue (initialization of a logging system). That's partly because my 
initialization would be deferred (meaning a loss of useful logs), and also it 
could enable us to have some 'init' code that is guaranteed to only be 
evaluated once as opposed to implementing that 'guarantee' yourself, which can 
currently lead to bad practices.

Edit: For some context, I'm approaching this issue from SPARK-650


was (Author: luisramos):
I feel like the broadcasting mechanism doesn't get me "close" enough to solve 
my issue (initialization of a logging system). That's partly because my 
initialization would be deferred (meaning a loss of useful logs), and also it 
could enable us to have some 'init' code that is guaranteed to only be 
evaluated once as opposed to implementing that 'guarantee' yourself, which can 
currently lead to bad practices.

> Add mechanism to run system management/configuration tasks on all workers
> -
>
> Key: SPARK-636
> URL: https://issues.apache.org/jira/browse/SPARK-636
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>
> It would be useful to have a mechanism to run a task on all workers in order 
> to perform system management tasks, such as purging caches or changing system 
> properties.  This is useful for automated experiments and benchmarking; I 
> don't envision this being used for heavy computation.
> Right now, I can mimic this with something like
> {code}
> sc.parallelize(0 until numMachines, numMachines).foreach { } 
> {code}
> but this does not guarantee that every worker runs a task and requires my 
> user code to know the number of workers.
> One sample use case is setup and teardown for benchmark tests.  For example, 
> I might want to drop cached RDDs, purge shuffle data, and call 
> {{System.gc()}} between test runs.  It makes sense to incorporate some of 
> this functionality, such as dropping cached RDDs, into Spark itself, but it 
> might be helpful to have a general mechanism for running ad-hoc tasks like 
> {{System.gc()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15402) PySpark ml.evaluation should support save/load

2016-10-14 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-15402.
-
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.1.0

> PySpark ml.evaluation should support save/load
> --
>
> Key: SPARK-15402
> URL: https://issues.apache.org/jira/browse/SPARK-15402
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> Since ml.evaluation has supported save/load at Scala side, supporting it at 
> Python side is very straightforward and easy. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14634) Add BisectingKMeansSummary

2016-10-14 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-14634.
-
   Resolution: Fixed
 Assignee: zhengruifeng
Fix Version/s: 2.1.0

> Add BisectingKMeansSummary
> --
>
> Key: SPARK-14634
> URL: https://issues.apache.org/jira/browse/SPARK-14634
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.1.0
>
>
> Add BisectingKMeansSummary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17870.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15444
[https://github.com/apache/spark/pull/15444]

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
> Fix For: 2.1.0
>
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17870:
--
Assignee: Peng Meng

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Assignee: Peng Meng
>Priority: Critical
> Fix For: 2.1.0
>
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17855) Spark worker throw Exception when uber jar's http url contains query string

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17855.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15420
[https://github.com/apache/spark/pull/15420]

> Spark worker throw Exception when uber jar's http url contains query string
> ---
>
> Key: SPARK-17855
> URL: https://issues.apache.org/jira/browse/SPARK-17855
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Hao Ren
> Fix For: 2.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> spark-submit support jar url with http protocol
> If the url contains any query strings, *worker.DriverRunner.downloadUserJar * 
> method will throw "Did not see expected jar" exception. This is because this 
> method checks the existance of a downloaded jar whose name contains query 
> strings.
> This is a problem when your jar is located on some web service which requires 
> some additional information to retrieve the file. For example, to download a 
> jar from s3 bucket via http, the url contains signature, datetime, etc as 
> query string.
> {code}
> https://s3.amazonaws.com/deploy/spark-job.jar
> ?X-Amz-Algorithm=AWS4-HMAC-SHA256
> &X-Amz-Credential=/20130721/us-east-1/s3/aws4_request
> &X-Amz-Date=20130721T201207Z
> &X-Amz-Expires=86400
> &X-Amz-SignedHeaders=host
> &X-Amz-Signature=  
> {code}
> Worker will look for a jar named
> "spark-job.jar?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=/20130721/us-east-1/s3/aws4_request&X-Amz-Date=20130721T201207Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature="
> instead of
> "spark-job.jar"
> Hence, all the query string should be removed before checking jar existance.
> I created a pr to fix this, if anyone can review it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17855) Spark worker throw Exception when uber jar's http url contains query string

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17855:
--
Assignee: Hao Ren

> Spark worker throw Exception when uber jar's http url contains query string
> ---
>
> Key: SPARK-17855
> URL: https://issues.apache.org/jira/browse/SPARK-17855
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Hao Ren
>Assignee: Hao Ren
> Fix For: 2.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> spark-submit support jar url with http protocol
> If the url contains any query strings, *worker.DriverRunner.downloadUserJar * 
> method will throw "Did not see expected jar" exception. This is because this 
> method checks the existance of a downloaded jar whose name contains query 
> strings.
> This is a problem when your jar is located on some web service which requires 
> some additional information to retrieve the file. For example, to download a 
> jar from s3 bucket via http, the url contains signature, datetime, etc as 
> query string.
> {code}
> https://s3.amazonaws.com/deploy/spark-job.jar
> ?X-Amz-Algorithm=AWS4-HMAC-SHA256
> &X-Amz-Credential=/20130721/us-east-1/s3/aws4_request
> &X-Amz-Date=20130721T201207Z
> &X-Amz-Expires=86400
> &X-Amz-SignedHeaders=host
> &X-Amz-Signature=  
> {code}
> Worker will look for a jar named
> "spark-job.jar?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=/20130721/us-east-1/s3/aws4_request&X-Amz-Date=20130721T201207Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature="
> instead of
> "spark-job.jar"
> Hence, all the query string should be removed before checking jar existance.
> I created a pr to fix this, if anyone can review it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17930) The SerializerInstance instance used when deserializing a TaskResult is not reused

2016-10-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575123#comment-15575123
 ] 

Sean Owen commented on SPARK-17930:
---

But this code is only ever called once per object. Reuse doesn't help here, and 
I don't think you can share the serializer across instances.

> The SerializerInstance instance used when deserializing a TaskResult is not 
> reused 
> ---
>
> Key: SPARK-17930
> URL: https://issues.apache.org/jira/browse/SPARK-17930
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
>Reporter: Guoqiang Li
>
> The following code is called when the DirectTaskResult instance is 
> deserialized
> {noformat}
>   def value(): T = {
> if (valueObjectDeserialized) {
>   valueObject
> } else {
>   // Each deserialization creates a new instance of SerializerInstance, 
> which is very time-consuming
>   val resultSer = SparkEnv.get.serializer.newInstance()
>   valueObject = resultSer.deserialize(valueBytes)
>   valueObjectDeserialized = true
>   valueObject
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10681) DateTimeUtils needs a method to parse string to SQL's timestamp value

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10681.
---
Resolution: Not A Problem

> DateTimeUtils needs a method to parse string to SQL's timestamp value
> -
>
> Key: SPARK-10681
> URL: https://issues.apache.org/jira/browse/SPARK-10681
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, {{DateTimeUtils.stringToTime}} returns a java.util.Date whose 
> getTime returns milliseconds. It will be great if we have a method to parse 
> string directly to SQL's timestamp value (microseconds).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10954.
---
Resolution: Not A Problem

> Parquet version in the "created_by" metadata field of Parquet files written 
> by Spark 1.5 and 1.6 is wrong
> -
>
> Key: SPARK-10954
> URL: https://issues.apache.org/jira/browse/SPARK-10954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Gayathri Murali
>Priority: Minor
>
> We've upgraded to parquet-mr 1.7.0 in Spark 1.5, but the {{created_by}} field 
> still says 1.6.0. This issue can be reproduced by generating any Parquet file 
> with Spark 1.5, and then check the metadata with {{parquet-meta}} CLI tool:
> {noformat}
> $ parquet-meta /tmp/parquet/dec
> file:
> file:/tmp/parquet/dec/part-r-0-f210e968-1be5-40bc-bcbc-007f935e6dc7.gz.parquet
> creator: parquet-mr version 1.6.0
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"dec","type":"decimal(20,2)","nullable":true,"metadata":{}}]}
> file schema: spark_schema
> -
> dec: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1: RC:10 TS:140 OFFSET:4
> -
> dec:  FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:4 SZ:99/140/1.41 VC:10 
> ENC:PLAIN,BIT_PACKED,RLE
> {noformat}
> Note that this field is written by parquet-mr rather than Spark. However, 
> writing Parquet files using parquet-mr 1.7.0 directly without Spark 1.5 only 
> shows {{parquet-mr}} without any version number. Files written by parquet-mr 
> 1.8.1 without Spark look fine though.
> Currently this isn't a big issue. But parquet-mr 1.8 checks for this field to 
> workaround PARQUET-251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17777) Spark Scheduler Hangs Indefinitely

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1.
---
Resolution: Not A Problem

> Spark Scheduler Hangs Indefinitely
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: AWS EMR 4.3, can also be reproduced locally
>Reporter: Ameen Tayyebi
> Attachments: jstack-dump.txt, repro.scala
>
>
> We've identified a problem with Spark scheduling. The issue manifests itself 
> when an RDD calls SparkContext.parallelize within its getPartitions method. 
> This seemingly "recursive" call causes the problem. We have a repro case that 
> can easily be run.
> Please advise on what the issue might be and how we can work around it in the 
> mean time.
> I've attached repro.scala which can simply be pasted in spark-shell to 
> reproduce the problem.
> Why are we calling sc.parallelize in production within getPartitions? Well, 
> we have an RDD that is composed of several thousands of Parquet files. To 
> compute the partitioning strategy for this RDD, we create an RDD to read all 
> file sizes from S3 in parallel, so that we can quickly determine the proper 
> partitions. We do this to avoid executing this serially from the master node 
> which can result in significant slowness in the execution. Pseudo-code:
> val splitInfo = sc.parallelize(filePaths).map(f => (f, 
> s3.getObjectSummary)).collect()
> A similar logic is used in DataFrame by Spark itself:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902
>  
> Thanks,
> -Ameen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17787) spark submit throws error while using kafka Appender log4j:ERROR Could not instantiate class [kafka.producer.KafkaLog4jAppender]. java.lang.ClassNotFoundException: kafk

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17787.
---
Resolution: Not A Problem

> spark submit throws error while using kafka Appender log4j:ERROR Could not 
> instantiate class [kafka.producer.KafkaLog4jAppender]. 
> java.lang.ClassNotFoundException: kafka.producer.KafkaLog4jAppender
> -
>
> Key: SPARK-17787
> URL: https://issues.apache.org/jira/browse/SPARK-17787
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Taukir
>
> While trying to push spark app logs to kafka , I am trying to use 
> KafkaAppender. Please find the command below.It throws the following  error
> spark-submit --class com.SampleApp --conf spark.ui.port=8081 --master 
> yarn-cluster
> --files 
> files/home/hos/KafkaApp/log4j.properties#log4j.properties,/home/hos/KafkaApp/kafka_2.10-0.8.0.jar
> --conf 
> spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/home/hos/KafkaApp/log4j.properties'
> --conf "spark.driver.extraClassPath=/home/hos/KafkaApp/kafka_2.10-0.8.0.jar"
> --conf 
> spark.executor.extraJavaOptions='-Dlog4j.configuration=file:/home/hos/KafkaApp/log4j.properties'
> --conf 
> "spark.executor.extraClassPath=/home/hos/KafkaApp/kafka_2.10-0.8.0.jar" 
> Kafka-App-assembly-1.0.jar
>  it it throws java.lang.ClassNotFoundException: 
> kafka.producer.KafkaLog4jAppender exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor

2016-10-14 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-17933:
-
Attachment: screenshot-1.png

> Shuffle fails when driver is on one of the same machines as executor
> 
>
> Key: SPARK-17933
> URL: https://issues.apache.org/jira/browse/SPARK-17933
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.2
>Reporter: Frank Rosner
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> h4. Problem
> When I run a job that requires some shuffle, some tasks fail because the 
> executor cannot fetch the shuffle blocks from another executor.
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> 10-250-20-140:44042
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.nio.channels.UnresolvedAddressException
>   at sun.nio.ch.Net.checkAddress(Net.java:101)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
>   

[jira] [Updated] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor

2016-10-14 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-17933:
-
Attachment: (was: screenshot-1.png)

> Shuffle fails when driver is on one of the same machines as executor
> 
>
> Key: SPARK-17933
> URL: https://issues.apache.org/jira/browse/SPARK-17933
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.2
>Reporter: Frank Rosner
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> h4. Problem
> When I run a job that requires some shuffle, some tasks fail because the 
> executor cannot fetch the shuffle blocks from another executor.
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> 10-250-20-140:44042
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.nio.channels.UnresolvedAddressException
>   at sun.nio.ch.Net.checkAddress(Net.java:101)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:47

[jira] [Commented] (SPARK-17933) Shuffle fails when driver is on one of the same machines as executor

2016-10-14 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575232#comment-15575232
 ] 

Frank Rosner commented on SPARK-17933:
--

Thanks [~srowen]. I know a lot of discussions about the driver advertised 
address etc. but I am not sure if this is a configuration problem or a bug. The 
ticket you reference talks about the driver IP but I am having troubles with 
the executor (I highlighted the line in the screenshot).

Any ideas how to address this problem?

> Shuffle fails when driver is on one of the same machines as executor
> 
>
> Key: SPARK-17933
> URL: https://issues.apache.org/jira/browse/SPARK-17933
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.2
>Reporter: Frank Rosner
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> h4. Problem
> When I run a job that requires some shuffle, some tasks fail because the 
> executor cannot fetch the shuffle blocks from another executor.
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> 10-250-20-140:44042
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.nio.channels.UnresolvedAddressException
>   at sun.nio.ch.Net.checkAddress(Net.java:101)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketCh

[jira] [Updated] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17935:
--
Target Version/s:   (was: 2.0.0)
   Fix Version/s: (was: 2.0.0)

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)
Justin Miller created SPARK-17936:
-

 Summary: "CodeGenerator - failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of" method Error
 Key: SPARK-17936
 URL: https://issues.apache.org/jira/browse/SPARK-17936
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Justin Miller


Greetings. I'm currently in the process of migrating a project I'm working on 
from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
structs coming from Kafka into Parquet files stored in S3. This conversion 
process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll paste 
the stack trace below.

org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)

Also, later on:
07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
in thread Thread[Executor task launch worker-6,5,run-main-group-0]
java.lang.OutOfMemoryError: Java heap space

I've seen similar issues posted, but those were always on the query side. I 
have a hunch that this is happening at write time as the error occurs after 
batchDuration. Here's the write snippet.

stream.
  flatMap {
case Success(row) =>
  thriftParseSuccess += 1
  Some(row)
case Failure(ex) =>
  thriftParseErrors += 1
  logger.error("Error during deserialization: ", ex)
  None
  }.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.context)
transformer(sqlContext.createDataFrame(rdd, converter.schema))
  .coalesce(coalesceSize)
  .write
  .mode(Append)
  .partitionBy(partitioning: _*)
  .parquet(parquetPath)
  }

Please let me know if you can be of assistance and if there's anything I can do 
to help.

Best,
Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17936.
---
Resolution: Duplicate

Duplicate of several JIRAs -- have a look through first.

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS

2016-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575400#comment-15575400
 ] 

Apache Spark commented on SPARK-17868:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15484

> Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
> 
>
> Key: SPARK-17868
> URL: https://issues.apache.org/jira/browse/SPARK-17868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> We generate bitmasks for grouping sets during the parsing process, and use 
> these during analysis. These bitmasks are difficult to work with in practice 
> and have lead to numerous bugs. I suggest that we remove these and use actual 
> sets instead, however we would need to generate these offsets for the 
> grouping_id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17868:


Assignee: (was: Apache Spark)

> Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
> 
>
> Key: SPARK-17868
> URL: https://issues.apache.org/jira/browse/SPARK-17868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> We generate bitmasks for grouping sets during the parsing process, and use 
> these during analysis. These bitmasks are difficult to work with in practice 
> and have lead to numerous bugs. I suggest that we remove these and use actual 
> sets instead, however we would need to generate these offsets for the 
> grouping_id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17868:


Assignee: Apache Spark

> Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
> 
>
> Key: SPARK-17868
> URL: https://issues.apache.org/jira/browse/SPARK-17868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> We generate bitmasks for grouping sets during the parsing process, and use 
> these during analysis. These bitmasks are difficult to work with in practice 
> and have lead to numerous bugs. I suggest that we remove these and use actual 
> sets instead, however we would need to generate these offsets for the 
> grouping_id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-14 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575421#comment-15575421
 ] 

Harish commented on SPARK-16845:


I have posted a scenario in stack overflow  
http://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe
  ... let me know if you need any help on this. If this is already taken care 
you can ignore my comment.

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575452#comment-15575452
 ] 

Justin Miller commented on SPARK-17936:
---

I did look through them and I don't think they're related. Note that the error 
is different and this is trying to write data not read large amounts of data.

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575471#comment-15575471
 ] 

Justin Miller commented on SPARK-17936:
---

I'd also note this wasn't an issue in Spark 1.6.2. The process would run fine 
for hours and never crashed on this error.

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Miller updated SPARK-17936:
--
Comment: was deleted

(was: I did look through them and I don't think they're related. Note that the 
error is different and this is trying to write data not read large amounts of 
data.)

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575476#comment-15575476
 ] 

Sean Owen commented on SPARK-17936:
---

I doubt it is a different cause given it is the same type of error in the same 
path. That is, do you expect the resolution differs? It can be reopened if so 
but otherwise I think this is more likely to fragment discussion about one 
problem. 

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575494#comment-15575494
 ] 

Justin Miller commented on SPARK-17936:
---

It's just strange that the issue seems to effect previous versions (if they're 
the same issue) but didn't impact me when I was using 1.6.2 and the 0.8 kafka 
consumer. Is it possible that Scala 2.10 vs Scala 2.11 makes a difference? 
There are a lot of variables at play unfortunately.

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Miller reopened SPARK-17936:
---

I don't believe this is a duplicate. This occurs while trying to write to 
parquet (with very little data) and happens almost immediately after 
batchDuration.

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-10-14 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-17937:
--

 Summary: Clarify Kafka offset semantics for Structured Streaming
 Key: SPARK-17937
 URL: https://issues.apache.org/jira/browse/SPARK-17937
 Project: Spark
  Issue Type: Sub-task
Reporter: Cody Koeninger






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17938) Backpressure rate not adjusting

2016-10-14 Thread Samy Dindane (JIRA)
Samy Dindane created SPARK-17938:


 Summary: Backpressure rate not adjusting
 Key: SPARK-17938
 URL: https://issues.apache.org/jira/browse/SPARK-17938
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 2.0.1, 2.0.0
Reporter: Samy Dindane


spark-streaming 2.0.1 and spark-streaming-kafka-0-10 version is 2.0.1. Same 
behavior with 2.0.0 though.

spark.streaming.kafka.consumer.poll.ms is set to 3
spark.streaming.kafka.maxRatePerPartition is set to 10
spark.streaming.backpressure.enabled is set to true

`batchDuration` of the streaming context is set to 1 second.

I consume a Kafka topic using KafkaUtils.createDirectStream().

My system can handle 100k records batches, but it'd take more than 1 seconds to 
process them all. I'd thus expect the backpressure to reduce the number of 
records that would be fetched in the next batch to keep the processing delay 
inferior to 1 second.

Only this does not happen and the rate of the backpressure stays the same: 
stuck in `100.0`, no matter how the other variables change (processing time, 
error, etc.).

Here's a log showing how all these variables change but the chosen rate stays 
the same: https://gist.github.com/Dinduks/d9fa67fc8a036d3cad8e859c508acdba (I 
would have attached a file but I don't see how).

Is this the expected behavior and I am missing something, or is this  a bug?

I'll gladly help by providing more information or writing code if necessary.

Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2016-10-14 Thread Aleksander Eskilson (JIRA)
Aleksander Eskilson created SPARK-17939:
---

 Summary: Spark-SQL Nullability: Optimizations vs. Enforcement 
Clarification
 Key: SPARK-17939
 URL: https://issues.apache.org/jira/browse/SPARK-17939
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Aleksander Eskilson
Priority: Critical


The notion of Nullability of of StructFields in DataFrames and Datasets creates 
some confusion. As has been pointed out previously [1], Nullability is a hint 
to the Catalyst optimizer, and is not meant to be a type-level enforcement. 
Allowing null fields can also help the reader successfully parse certain types 
of more loosely-typed data, like JSON and CSV, where null values are common, 
rather than just failing. 

There's already been some movement to clarify the meaning of Nullable in the 
API, but also some requests for a (perhaps completely separate) type-level 
implementation of Nullable that can act as an enforcement contract.

This bug is logged here to discuss and clarify this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2016-10-14 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson updated SPARK-17939:

Description: 
The notion of Nullability of of StructFields in DataFrames and Datasets creates 
some confusion. As has been pointed out previously [1], Nullability is a hint 
to the Catalyst optimizer, and is not meant to be a type-level enforcement. 
Allowing null fields can also help the reader successfully parse certain types 
of more loosely-typed data, like JSON and CSV, where null values are common, 
rather than just failing. 

There's already been some movement to clarify the meaning of Nullable in the 
API, but also some requests for a (perhaps completely separate) type-level 
implementation of Nullable that can act as an enforcement contract.

This bug is logged here to discuss and clarify this issue.

[1] - 
[https://issues.apache.org/jira/browse/SPARK-11319][https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
[2] - https://github.com/apache/spark/pull/11785

  was:
The notion of Nullability of of StructFields in DataFrames and Datasets creates 
some confusion. As has been pointed out previously [1], Nullability is a hint 
to the Catalyst optimizer, and is not meant to be a type-level enforcement. 
Allowing null fields can also help the reader successfully parse certain types 
of more loosely-typed data, like JSON and CSV, where null values are common, 
rather than just failing. 

There's already been some movement to clarify the meaning of Nullable in the 
API, but also some requests for a (perhaps completely separate) type-level 
implementation of Nullable that can act as an enforcement contract.

This bug is logged here to discuss and clarify this issue.


> Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
> --
>
> Key: SPARK-17939
> URL: https://issues.apache.org/jira/browse/SPARK-17939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>Priority: Critical
>
> The notion of Nullability of of StructFields in DataFrames and Datasets 
> creates some confusion. As has been pointed out previously [1], Nullability 
> is a hint to the Catalyst optimizer, and is not meant to be a type-level 
> enforcement. Allowing null fields can also help the reader successfully parse 
> certain types of more loosely-typed data, like JSON and CSV, where null 
> values are common, rather than just failing. 
> There's already been some movement to clarify the meaning of Nullable in the 
> API, but also some requests for a (perhaps completely separate) type-level 
> implementation of Nullable that can act as an enforcement contract.
> This bug is logged here to discuss and clarify this issue.
> [1] - 
> [https://issues.apache.org/jira/browse/SPARK-11319][https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
> [2] - https://github.com/apache/spark/pull/11785



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-14 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575553#comment-15575553
 ] 

Yanbo Liang commented on SPARK-17904:
-

[~felixcheung] I think the proposal I made in this JIRA is different from what 
you mentioned, I think they are two different scenarios. R users may call 
install.packges across the session, rather than installing all necessary 
libraries before they start the session. From the discussion, I found it's not 
easy to support this feature. Like what [~shivaram]'s suggestion, we can try to 
add required packages when we start the session. This can satisfy parts of 
users' requirements, but not all. 
All in all, I appreciate all your comments. Thanks.

> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2016-10-14 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson updated SPARK-17939:

Description: 
The notion of Nullability of of StructFields in DataFrames and Datasets creates 
some confusion. As has been pointed out previously [1], Nullability is a hint 
to the Catalyst optimizer, and is not meant to be a type-level enforcement. 
Allowing null fields can also help the reader successfully parse certain types 
of more loosely-typed data, like JSON and CSV, where null values are common, 
rather than just failing. 

There's already been some movement to clarify the meaning of Nullable in the 
API, but also some requests for a (perhaps completely separate) type-level 
implementation of Nullable that can act as an enforcement contract.

This bug is logged here to discuss and clarify this issue.

[1] - 
[https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
[2] - https://github.com/apache/spark/pull/11785

  was:
The notion of Nullability of of StructFields in DataFrames and Datasets creates 
some confusion. As has been pointed out previously [1], Nullability is a hint 
to the Catalyst optimizer, and is not meant to be a type-level enforcement. 
Allowing null fields can also help the reader successfully parse certain types 
of more loosely-typed data, like JSON and CSV, where null values are common, 
rather than just failing. 

There's already been some movement to clarify the meaning of Nullable in the 
API, but also some requests for a (perhaps completely separate) type-level 
implementation of Nullable that can act as an enforcement contract.

This bug is logged here to discuss and clarify this issue.

[1] - 
[https://issues.apache.org/jira/browse/SPARK-11319][https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
[2] - https://github.com/apache/spark/pull/11785


> Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
> --
>
> Key: SPARK-17939
> URL: https://issues.apache.org/jira/browse/SPARK-17939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>Priority: Critical
>
> The notion of Nullability of of StructFields in DataFrames and Datasets 
> creates some confusion. As has been pointed out previously [1], Nullability 
> is a hint to the Catalyst optimizer, and is not meant to be a type-level 
> enforcement. Allowing null fields can also help the reader successfully parse 
> certain types of more loosely-typed data, like JSON and CSV, where null 
> values are common, rather than just failing. 
> There's already been some movement to clarify the meaning of Nullable in the 
> API, but also some requests for a (perhaps completely separate) type-level 
> implementation of Nullable that can act as an enforcement contract.
> This bug is logged here to discuss and clarify this issue.
> [1] - 
> [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
> [2] - https://github.com/apache/spark/pull/11785



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-14 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575553#comment-15575553
 ] 

Yanbo Liang edited comment on SPARK-17904 at 10/14/16 2:49 PM:
---

[~felixcheung] The proposal I made in this JIRA is different from what you 
mentioned, I think they are two different scenarios. R users may call 
install.packges across the session, rather than installing all necessary 
libraries before they start the session. From the discussion, I found it's not 
easy to support this feature. Like what [~shivaram]'s suggestion, we can try to 
add required packages when we start the session. This can satisfy parts of 
users' requirements, but not all. 
I appreciate all your comments. Thanks.


was (Author: yanboliang):
[~felixcheung] I think the proposal I made in this JIRA is different from what 
you mentioned, I think they are two different scenarios. R users may call 
install.packges across the session, rather than installing all necessary 
libraries before they start the session. From the discussion, I found it's not 
easy to support this feature. Like what [~shivaram]'s suggestion, we can try to 
add required packages when we start the session. This can satisfy parts of 
users' requirements, but not all. 
All in all, I appreciate all your comments. Thanks.

> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-10-14 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17937:
---
Description: 
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible sources of offsets:
# Earliest position in log
# Latest position in log
# Fail and kill the query
# Checkpoint position
# User specified per topicpartition
# Kafka commit log.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# Timestamp.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# X offsets before or after latest / earliest position.  Currently unsupported. 
 I think the semantics of this are super unclear by comparison with timestamp, 
given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR json per topicpartition  (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is earliest or latest, use that.  If 
startingOffsets is perTopicpartition, and the new partition isn't in the map, 
Fail.  Note that this is effectively undistinguishable from 2a below, because 
partitions may have changed in between pre-query configuration and query start, 
but we treat it differently, and users in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
earliest for a during-query out of range
#* Offset out of range on executor: Fail or Earliest, based on failOnDataLoss.
# During query
#* New partition:  Earliest, only.  This seems to be by fiat, I see no reason 
this can't be configurable.
#* Offset out of range on driver:  this ??probably?? doesn't happen, because 
we're doing explicit seeks to the latest position
#* Offset out of range on executor:  Fail or Earliest, based on failOnDataLoss
# At query restart 
#* New partition: Checkpoint, fall back to Earliest.  Again, no reason this 
couldn't be configurable fall back to Latest
#* Offset out of range on driver:  Fail or Earliest, based on FailOnDataLoss
#* Offset out of range on executor:  Fail or Earliest, based on FailOnDataLoss


I've probably missed something, chime in.


> Clarify Kafka offset semantics for Structured Streaming
> ---
>
> Key: SPARK-17937
> URL: https://issues.apache.org/jira/browse/SPARK-17937
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cody Koeninger
>
> Possible events for which offsets are needed:
> # New partition is discovered
> # Offset out of range (aka, data has been lost)
> Possible sources of offsets:
> # Earliest position in log
> # Latest position in log
> # Fail and kill the query
> # Checkpoint position
> # User specified per topicpartition
> # Kafka commit log.  Currently unsupported.  This means users who want to 
> migrate from existing kafka jobs need to jump through hoops.  Even if we 
> never want to support it, as soon as we take on SPARK-17815 we need to make 
> sure Kafka commit log state is clearly documented and handled.
> # Timestamp.  Currently unsupported.  This could be supported with old, 
> inaccurate Kafka time api, or upcoming time index
> # X offsets before or after latest / earliest position.  Currently 
> unsupported.  I think the semantics of this are super unclear by comparison 
> with timestamp, given that Kafka doesn't have a single range of offsets.
> Currently allowed pre-query configuration, all "ORs" are exclusive:
> # startingOffsets: earliest OR latest OR json per topicpartition  
> (SPARK-17812)
> # failOnDataLoss: true (which implies Fail above) OR false (which implies 
> Earliest above)  In general, I see no reason this couldn't specify Latest as 
> an option.
> Possible lifecycle times in which an offset-related event may happen:
> # At initial query start
> #* New partition: if startingOffsets is earliest or latest, use that.  If 
> startingOffsets is perTopicpartition, and the new partition isn't in the map, 
> Fail.  Note that this is effectively undistinguishable from 2a below, because 
> partitions may have changed in between pre-query configuration and query 
> start, but we treat it diff

[jira] [Commented] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2016-10-14 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575599#comment-15575599
 ] 

Aleksander Eskilson commented on SPARK-17939:
-

[~marmbrus] suggested the opening of this issue after a bit of discussion in 
the email list [1]. 

I'd like to clarify what I proposed in this newer context. First, clarification 
of what nullability means in the current API, like in the GitHub issue linked 
above would be great. Second, it makes total sense for default nullability in 
the reader in many instances, to support more loosely-typed data sources, like 
JSON and CSV. 

However, apart from an analysis-time hint to the Catalyst optimizer, there are 
instances where a (potentially separate?), enforcement-level idea of 
nullability would be quite useful. With the possibility of now writing 
custom-encoders open, other kinds of more strongly-typed data might be read 
into Datasets, e.g. Avro. Avro's UNION type with NULL gives us a harder notion 
of a truly nullable vs. non-nullable type. It was suggested in the other linked 
Jira issue above that the current contract is for users to make sure that they 
do not pass bad data into the reader (as it currently performs conversions that 
might surprise the user, like from null to 0). 

What I mean to suggest is that a type-level notion of nullability could help us 
fail faster and abide by our own data-contracts when we have data to read into 
Datasets that comes from more strongly-typed sources with known schemas. 

Thoughts on this?

> Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
> --
>
> Key: SPARK-17939
> URL: https://issues.apache.org/jira/browse/SPARK-17939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>Priority: Critical
>
> The notion of Nullability of of StructFields in DataFrames and Datasets 
> creates some confusion. As has been pointed out previously [1], Nullability 
> is a hint to the Catalyst optimizer, and is not meant to be a type-level 
> enforcement. Allowing null fields can also help the reader successfully parse 
> certain types of more loosely-typed data, like JSON and CSV, where null 
> values are common, rather than just failing. 
> There's already been some movement to clarify the meaning of Nullable in the 
> API, but also some requests for a (perhaps completely separate) type-level 
> implementation of Nullable that can act as an enforcement contract.
> This bug is logged here to discuss and clarify this issue.
> [1] - 
> [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
> [2] - https://github.com/apache/spark/pull/11785



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-10-14 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17937:
---
Description: 
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible sources of offsets:
# Earliest position in log
# Latest position in log
# Fail and kill the query
# Checkpoint position
# User specified per topicpartition
# Kafka commit log.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# Timestamp.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# X offsets before or after latest / earliest position.  Currently unsupported. 
 I think the semantics of this are super unclear by comparison with timestamp, 
given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR json per topicpartition  (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is earliest or latest, use that.  If 
startingOffsets is perTopicpartition, and the new partition isn't in the map, 
Fail.  Note that this is effectively undistinguishable from new parititon 
during query, because partitions may have changed in between pre-query 
configuration and query start, but we treat it differently, and users in this 
case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
earliest for a during-query out of range
#* Offset out of range on executor: Fail or Earliest, based on failOnDataLoss.
# During query
#* New partition:  Earliest, only.  This seems to be by fiat, I see no reason 
this can't be configurable.
#* Offset out of range on driver:  this ??probably?? doesn't happen, because 
we're doing explicit seeks to the latest position
#* Offset out of range on executor:  Fail or Earliest, based on failOnDataLoss
# At query restart 
#* New partition: Checkpoint, fall back to Earliest.  Again, no reason this 
couldn't be configurable fall back to Latest
#* Offset out of range on driver:  Fail or Earliest, based on FailOnDataLoss
#* Offset out of range on executor:  Fail or Earliest, based on FailOnDataLoss


I've probably missed something, chime in.


  was:
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible sources of offsets:
# Earliest position in log
# Latest position in log
# Fail and kill the query
# Checkpoint position
# User specified per topicpartition
# Kafka commit log.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# Timestamp.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# X offsets before or after latest / earliest position.  Currently unsupported. 
 I think the semantics of this are super unclear by comparison with timestamp, 
given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR json per topicpartition  (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is earliest or latest, use that.  If 
startingOffsets is perTopicpartition, and the new partition isn't in the map, 
Fail.  Note that this is effectively undistinguishable from 2a below, because 
partitions may have changed in between pre-query configuration and query start, 
but we treat it differently, and users in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
earliest for a during-query out of range
#* Offset out of range on executor: Fail or Ea

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-10-14 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17937:
---
Description: 
Possible events for which offsets are needed:
# *New partition* is discovered
# *Offset out of range* (aka, data has been lost)

Possible sources of offsets:
# *Earliest* position in log
# *Latest* position in log
# *Fail* and kill the query
# *Checkpoint* position
# *User specified* per topicpartition
# *Kafka commit log*.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# *Timestamp*.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# *X offsets* before or after latest / earliest position.  Currently 
unsupported.  I think the semantics of this are super unclear by comparison 
with timestamp, given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR json per topicpartition  (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
startingOffsets is *User specified* perTopicpartition, and the new partition 
isn't in the map, *Fail*.  Note that this is effectively undistinguishable from 
new parititon during query, because partitions may have changed in between 
pre-query configuration and query start, but we treat it differently, and users 
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
earliest for a during-query out of range
#* Offset out of range on executor: *Fail* or *Earliest*, based on 
failOnDataLoss.
# During query
#* New partition:  *Earliest*, only.  This seems to be by fiat, I see no reason 
this can't be configurable.
#* Offset out of range on driver:  this _probably_ doesn't happen, because 
we're doing explicit seeks to the latest position
#* Offset out of range on executor:  *Fail* or *Earliest*, based on 
failOnDataLoss
# At query restart 
#* New partition: *Checkpoint*, fall back to *Earliest*.  Again, no reason this 
couldn't be configurable fall back to Latest
#* Offset out of range on driver:  *Fail* or *Earliest*, based on failOnDataLoss
#* Offset out of range on executor:  *Fail* or *Earliest*, based on 
failOnDataLoss


I've probably missed something, chime in.


  was:
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible sources of offsets:
# Earliest position in log
# Latest position in log
# Fail and kill the query
# Checkpoint position
# User specified per topicpartition
# Kafka commit log.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# Timestamp.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# X offsets before or after latest / earliest position.  Currently unsupported. 
 I think the semantics of this are super unclear by comparison with timestamp, 
given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR json per topicpartition  (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is earliest or latest, use that.  If 
startingOffsets is perTopicpartition, and the new partition isn't in the map, 
Fail.  Note that this is effectively undistinguishable from new parititon 
during query, because partitions may have changed in between pre-query 
configuration and query start, but we treat it differently, and users in this 
case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
ea

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-10-14 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17937:
---
Description: 
Possible events for which offsets are needed:
# *New partition* is discovered
# *Offset out of range* (aka, data has been lost)

Possible sources of offsets:
# *Earliest* position in log
# *Latest* position in log
# *Fail* and kill the query
# *Checkpoint* position
# *User specified* per topicpartition
# *Kafka commit log*.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# *Timestamp*.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# *X offsets* before or after latest / earliest position.  Currently 
unsupported.  I think the semantics of this are super unclear by comparison 
with timestamp, given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR User specified json per topicpartition 
 (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
startingOffsets is *User specified* perTopicpartition, and the new partition 
isn't in the map, *Fail*.  Note that this is effectively undistinguishable from 
new parititon during query, because partitions may have changed in between 
pre-query configuration and query start, but we treat it differently, and users 
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
earliest for a during-query out of range
#* Offset out of range on executor: *Fail* or *Earliest*, based on 
failOnDataLoss.
# During query
#* New partition:  *Earliest*, only.  This seems to be by fiat, I see no reason 
this can't be configurable.
#* Offset out of range on driver:  this _probably_ doesn't happen, because 
we're doing explicit seeks to the latest position
#* Offset out of range on executor:  *Fail* or *Earliest*, based on 
failOnDataLoss
# At query restart 
#* New partition: *Checkpoint*, fall back to *Earliest*.  Again, no reason this 
couldn't be configurable fall back to Latest
#* Offset out of range on driver:  *Fail* or *Earliest*, based on failOnDataLoss
#* Offset out of range on executor:  *Fail* or *Earliest*, based on 
failOnDataLoss


I've probably missed something, chime in.


  was:
Possible events for which offsets are needed:
# *New partition* is discovered
# *Offset out of range* (aka, data has been lost)

Possible sources of offsets:
# *Earliest* position in log
# *Latest* position in log
# *Fail* and kill the query
# *Checkpoint* position
# *User specified* per topicpartition
# *Kafka commit log*.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# *Timestamp*.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# *X offsets* before or after latest / earliest position.  Currently 
unsupported.  I think the semantics of this are super unclear by comparison 
with timestamp, given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR json per topicpartition  (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
startingOffsets is *User specified* perTopicpartition, and the new partition 
isn't in the map, *Fail*.  Note that this is effectively undistinguishable from 
new parititon during query, because partitions may have changed in between 
pre-query configuration and query start, but we treat it differently, and users 
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at st

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-10-14 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17937:
---
Description: 
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible sources of offsets:
# *Earliest* position in log
# *Latest* position in log
# *Fail* and kill the query
# *Checkpoint* position
# *User specified* per topicpartition
# *Kafka commit log*.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# *Timestamp*.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# *X offsets* before or after latest / earliest position.  Currently 
unsupported.  I think the semantics of this are super unclear by comparison 
with timestamp, given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: *earliest* OR *latest* OR *User specified* json per 
topicpartition  (SPARK-17812)
# failOnDataLoss: true (which implies *Fail* above) OR false (which implies 
*Earliest* above)  In general, I see no reason this couldn't specify Latest as 
an option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
startingOffsets is *User specified* perTopicpartition, and the new partition 
isn't in the map, *Fail*.  Note that this is effectively undistinguishable from 
new parititon during query, because partitions may have changed in between 
pre-query configuration and query start, but we treat it differently, and users 
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
earliest for a during-query out of range
#* Offset out of range on executor: *Fail* or *Earliest*, based on 
failOnDataLoss.
# During query
#* New partition:  *Earliest*, only.  This seems to be by fiat, I see no reason 
this can't be configurable.
#* Offset out of range on driver:  this _probably_ doesn't happen, because 
we're doing explicit seeks to the latest position
#* Offset out of range on executor:  *Fail* or *Earliest*, based on 
failOnDataLoss
# At query restart 
#* New partition: *Checkpoint*, fall back to *Earliest*.  Again, no reason this 
couldn't be configurable fall back to Latest
#* Offset out of range on driver:  *Fail* or *Earliest*, based on failOnDataLoss
#* Offset out of range on executor:  *Fail* or *Earliest*, based on 
failOnDataLoss


I've probably missed something, chime in.


  was:
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible sources of offsets:
# *Earliest* position in log
# *Latest* position in log
# *Fail* and kill the query
# *Checkpoint* position
# *User specified* per topicpartition
# *Kafka commit log*.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# *Timestamp*.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# *X offsets* before or after latest / earliest position.  Currently 
unsupported.  I think the semantics of this are super unclear by comparison 
with timestamp, given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR User specified json per topicpartition 
 (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
startingOffsets is *User specified* perTopicpartition, and the new partition 
isn't in the map, *Fail*.  Note that this is effectively undistinguishable from 
new parititon during query, because partitions may have changed in between 
pre-query configuration and query start, but we treat it differently, and users 
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-10-14 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17937:
---
Description: 
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible sources of offsets:
# Earliest position in log
# Latest position in log
# Fail and kill the query
# Checkpoint position
# User specified per topicpartition
# Kafka commit log.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# Timestamp.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# X offsets before or after latest / earliest position.  Currently unsupported. 
 I think the semantics of this are super unclear by comparison with timestamp, 
given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR json per topicpartition  (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is earliest or latest, use that.  If 
startingOffsets is perTopicpartition, and the new partition isn't in the map, 
Fail.  Note that this is effectively undistinguishable from new parititon 
during query, because partitions may have changed in between pre-query 
configuration and query start, but we treat it differently, and users in this 
case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
earliest for a during-query out of range
#* Offset out of range on executor: Fail or Earliest, based on failOnDataLoss.
# During query
#* New partition:  Earliest, only.  This seems to be by fiat, I see no reason 
this can't be configurable.
#* Offset out of range on driver:  this _probably_ doesn't happen, because 
we're doing explicit seeks to the latest position
#* Offset out of range on executor:  Fail or Earliest, based on failOnDataLoss
# At query restart 
#* New partition: Checkpoint, fall back to Earliest.  Again, no reason this 
couldn't be configurable fall back to Latest
#* Offset out of range on driver:  Fail or Earliest, based on FailOnDataLoss
#* Offset out of range on executor:  Fail or Earliest, based on FailOnDataLoss


I've probably missed something, chime in.


  was:
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible sources of offsets:
# Earliest position in log
# Latest position in log
# Fail and kill the query
# Checkpoint position
# User specified per topicpartition
# Kafka commit log.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# Timestamp.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# X offsets before or after latest / earliest position.  Currently unsupported. 
 I think the semantics of this are super unclear by comparison with timestamp, 
given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR json per topicpartition  (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is earliest or latest, use that.  If 
startingOffsets is perTopicpartition, and the new partition isn't in the map, 
Fail.  Note that this is effectively undistinguishable from new parititon 
during query, because partitions may have changed in between pre-query 
configuration and query start, but we treat it differently, and users in this 
case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
earliest for a during-query out of range
#* Offset out of range on exe

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-10-14 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17937:
---
Description: 
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible sources of offsets:
# *Earliest* position in log
# *Latest* position in log
# *Fail* and kill the query
# *Checkpoint* position
# *User specified* per topicpartition
# *Kafka commit log*.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# *Timestamp*.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# *X offsets* before or after latest / earliest position.  Currently 
unsupported.  I think the semantics of this are super unclear by comparison 
with timestamp, given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR User specified json per topicpartition 
 (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
startingOffsets is *User specified* perTopicpartition, and the new partition 
isn't in the map, *Fail*.  Note that this is effectively undistinguishable from 
new parititon during query, because partitions may have changed in between 
pre-query configuration and query start, but we treat it differently, and users 
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want to know at startup that something was wrong, even if they're ok with 
earliest for a during-query out of range
#* Offset out of range on executor: *Fail* or *Earliest*, based on 
failOnDataLoss.
# During query
#* New partition:  *Earliest*, only.  This seems to be by fiat, I see no reason 
this can't be configurable.
#* Offset out of range on driver:  this _probably_ doesn't happen, because 
we're doing explicit seeks to the latest position
#* Offset out of range on executor:  *Fail* or *Earliest*, based on 
failOnDataLoss
# At query restart 
#* New partition: *Checkpoint*, fall back to *Earliest*.  Again, no reason this 
couldn't be configurable fall back to Latest
#* Offset out of range on driver:  *Fail* or *Earliest*, based on failOnDataLoss
#* Offset out of range on executor:  *Fail* or *Earliest*, based on 
failOnDataLoss


I've probably missed something, chime in.


  was:
Possible events for which offsets are needed:
# *New partition* is discovered
# *Offset out of range* (aka, data has been lost)

Possible sources of offsets:
# *Earliest* position in log
# *Latest* position in log
# *Fail* and kill the query
# *Checkpoint* position
# *User specified* per topicpartition
# *Kafka commit log*.  Currently unsupported.  This means users who want to 
migrate from existing kafka jobs need to jump through hoops.  Even if we never 
want to support it, as soon as we take on SPARK-17815 we need to make sure 
Kafka commit log state is clearly documented and handled.
# *Timestamp*.  Currently unsupported.  This could be supported with old, 
inaccurate Kafka time api, or upcoming time index
# *X offsets* before or after latest / earliest position.  Currently 
unsupported.  I think the semantics of this are super unclear by comparison 
with timestamp, given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: earliest OR latest OR User specified json per topicpartition 
 (SPARK-17812)
# failOnDataLoss: true (which implies Fail above) OR false (which implies 
Earliest above)  In general, I see no reason this couldn't specify Latest as an 
option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
startingOffsets is *User specified* perTopicpartition, and the new partition 
isn't in the map, *Fail*.  Note that this is effectively undistinguishable from 
new parititon during query, because partitions may have changed in between 
pre-query configuration and query start, but we treat it differently, and users 
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this 
case yet.  Could use the value of failOnDataLoss, but it's possible people may 
want t

[jira] [Updated] (SPARK-17624) Flaky test? StateStoreSuite maintenance

2016-10-14 Thread Adam Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Roberts updated SPARK-17624:
-
Affects Version/s: 2.1.0

> Flaky test? StateStoreSuite maintenance
> ---
>
> Key: SPARK-17624
> URL: https://issues.apache.org/jira/browse/SPARK-17624
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> I've noticed this test failing consistently (25x in a row) with a two core 
> machine but not on an eight core machine
> If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 
> being the default in Spark), the test reliably passes, we can also gain 
> reliability by setting the master to be anything other than just local.
> Is there a reason spark.rpc.numRetries is set to be 1?
> I see this failure is also mentioned here so it's been flaky for a while 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html
> If we run without the "quietly" code so we get debug info:
> {code}
> 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error 
> sending message [message = 
> VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)]
>  in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: org.apache.spark.SparkException: Could not find 
> StateStoreCoordinator.
> at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
> at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:129)
> at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:225)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcE

[jira] [Commented] (SPARK-17624) Flaky test? StateStoreSuite maintenance

2016-10-14 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575683#comment-15575683
 ] 

Adam Roberts commented on SPARK-17624:
--

Having another look at this now as it still fails intermittently

[~tdas] hi, as author of this test and feature can you please explain why 
numRetries was chosen to be 1 instead of the default (2) -- is it important for 
this test? My concern is that there's something actually wrong/unreliable here 
and that the failure is legitimate

[~jerryshao] HW used is a 2x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 16 
GB RAM (standard -Xmx3g used for the tests anyway), ext4 filesystem, Ubuntu 14 
04 LTS

> Flaky test? StateStoreSuite maintenance
> ---
>
> Key: SPARK-17624
> URL: https://issues.apache.org/jira/browse/SPARK-17624
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> I've noticed this test failing consistently (25x in a row) with a two core 
> machine but not on an eight core machine
> If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 
> being the default in Spark), the test reliably passes, we can also gain 
> reliability by setting the master to be anything other than just local.
> Is there a reason spark.rpc.numRetries is set to be 1?
> I see this failure is also mentioned here so it's been flaky for a while 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html
> If we run without the "quietly" code so we get debug info:
> {code}
> 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error 
> sending message [message = 
> VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)]
>  in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent

[jira] [Created] (SPARK-17940) Typo in LAST function error message

2016-10-14 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-17940:
-

 Summary: Typo in LAST function error message
 Key: SPARK-17940
 URL: https://issues.apache.org/jira/browse/SPARK-17940
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Shuai Lin
Priority: Minor


https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40

{code}
  throw new AnalysisException("The second argument of First should be a 
boolean literal.")
{code} 

"First" should be "Last".

Also the usage string can be improved to match the FIRST function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17940) Typo in LAST function error message

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17940:


Assignee: (was: Apache Spark)

> Typo in LAST function error message
> ---
>
> Key: SPARK-17940
> URL: https://issues.apache.org/jira/browse/SPARK-17940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shuai Lin
>Priority: Minor
>
> https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40
> {code}
>   throw new AnalysisException("The second argument of First should be a 
> boolean literal.")
> {code} 
> "First" should be "Last".
> Also the usage string can be improved to match the FIRST function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17940) Typo in LAST function error message

2016-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575694#comment-15575694
 ] 

Apache Spark commented on SPARK-17940:
--

User 'lins05' has created a pull request for this issue:
https://github.com/apache/spark/pull/15487

> Typo in LAST function error message
> ---
>
> Key: SPARK-17940
> URL: https://issues.apache.org/jira/browse/SPARK-17940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shuai Lin
>Priority: Minor
>
> https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40
> {code}
>   throw new AnalysisException("The second argument of First should be a 
> boolean literal.")
> {code} 
> "First" should be "Last".
> Also the usage string can be improved to match the FIRST function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17940) Typo in LAST function error message

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17940:


Assignee: Apache Spark

> Typo in LAST function error message
> ---
>
> Key: SPARK-17940
> URL: https://issues.apache.org/jira/browse/SPARK-17940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shuai Lin
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40
> {code}
>   throw new AnalysisException("The second argument of First should be a 
> boolean literal.")
> {code} 
> "First" should be "Last".
> Also the usage string can be improved to match the FIRST function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17910) Allow users to update the comment of a column

2016-10-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17910:
-
Target Version/s: 2.1.0

> Allow users to update the comment of a column
> -
>
> Key: SPARK-17910
> URL: https://issues.apache.org/jira/browse/SPARK-17910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, once a user set the comment of a column with create table command, 
> he/she cannot update the comment. It will be useful to provide a public 
> interface to do that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet

2016-10-14 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17941:


 Summary: Logistic regression test suites should use weights when 
comparing to glmnet
 Key: SPARK-17941
 URL: https://issues.apache.org/jira/browse/SPARK-17941
 Project: Spark
  Issue Type: Test
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


Logistic regression suite currently has many test cases comparing to R's 
glmnet. Both libraries support weights, and to make the testing of weights in 
Spark LOR more robust, we should add weights to all the test cases. The current 
weight testing is quite minimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17910) Allow users to update the comment of a column

2016-10-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17910:
-
Description: Right now, once a user set the comment of a column with create 
table command, he/she cannot update the comment. It will be useful to provide a 
public interface (e.g. SQL) to do that.   (was: Right now, once a user set the 
comment of a column with create table command, he/she cannot update the 
comment. It will be useful to provide a public interface to do that. )

> Allow users to update the comment of a column
> -
>
> Key: SPARK-17910
> URL: https://issues.apache.org/jira/browse/SPARK-17910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, once a user set the comment of a column with create table command, 
> he/she cannot update the comment. It will be useful to provide a public 
> interface (e.g. SQL) to do that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17941:


Assignee: Apache Spark

> Logistic regression test suites should use weights when comparing to glmnet
> ---
>
> Key: SPARK-17941
> URL: https://issues.apache.org/jira/browse/SPARK-17941
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> Logistic regression suite currently has many test cases comparing to R's 
> glmnet. Both libraries support weights, and to make the testing of weights in 
> Spark LOR more robust, we should add weights to all the test cases. The 
> current weight testing is quite minimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet

2016-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575716#comment-15575716
 ] 

Apache Spark commented on SPARK-17941:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/15488

> Logistic regression test suites should use weights when comparing to glmnet
> ---
>
> Key: SPARK-17941
> URL: https://issues.apache.org/jira/browse/SPARK-17941
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Logistic regression suite currently has many test cases comparing to R's 
> glmnet. Both libraries support weights, and to make the testing of weights in 
> Spark LOR more robust, we should add weights to all the test cases. The 
> current weight testing is quite minimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet

2016-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17941:


Assignee: (was: Apache Spark)

> Logistic regression test suites should use weights when comparing to glmnet
> ---
>
> Key: SPARK-17941
> URL: https://issues.apache.org/jira/browse/SPARK-17941
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Logistic regression suite currently has many test cases comparing to R's 
> glmnet. Both libraries support weights, and to make the testing of weights in 
> Spark LOR more robust, we should add weights to all the test cases. The 
> current weight testing is quite minimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=

2016-10-14 Thread Harish (JIRA)
Harish created SPARK-17942:
--

 Summary: OpenJDK 64-Bit Server VM warning: Try increasing the code 
cache size using -XX:ReservedCodeCacheSize=
 Key: SPARK-17942
 URL: https://issues.apache.org/jira/browse/SPARK-17942
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.1
Reporter: Harish


My code snipped is  in below location. In that  snippet i had put only few 
columns, but in my test case i have data with 10M rows and 10,000 columns.
http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark

I see below message in spark 2.0.2 snapshot
# Stderr of the node
OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
-XX:ReservedCodeCacheSize=

# stdout of the node
CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb
 bounds [0x7f32c500, 0x7f32d400, 0x7f32d400]
 total_blobs=41388 nmethods=40792 adapters=501
 compilation: disabled (not enough contiguous free space left)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=

2016-10-14 Thread Harish (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harish updated SPARK-17942:
---
Priority: Minor  (was: Major)

> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> -
>
> Key: SPARK-17942
> URL: https://issues.apache.org/jira/browse/SPARK-17942
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>Priority: Minor
>
> My code snipped is  in below location. In that  snippet i had put only few 
> columns, but in my test case i have data with 10M rows and 10,000 columns.
> http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark
> I see below message in spark 2.0.2 snapshot
> # Stderr of the node
> OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been 
> disabled.
> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> # stdout of the node
> CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb
>  bounds [0x7f32c500, 0x7f32d400, 0x7f32d400]
>  total_blobs=41388 nmethods=40792 adapters=501
>  compilation: disabled (not enough contiguous free space left)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=

2016-10-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575759#comment-15575759
 ] 

Sean Owen commented on SPARK-17942:
---

You probably need to increase this value -- is there more to it?

> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> -
>
> Key: SPARK-17942
> URL: https://issues.apache.org/jira/browse/SPARK-17942
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>Priority: Minor
>
> My code snipped is  in below location. In that  snippet i had put only few 
> columns, but in my test case i have data with 10M rows and 10,000 columns.
> http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark
> I see below message in spark 2.0.2 snapshot
> # Stderr of the node
> OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been 
> disabled.
> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> # stdout of the node
> CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb
>  bounds [0x7f32c500, 0x7f32d400, 0x7f32d400]
>  total_blobs=41388 nmethods=40792 adapters=501
>  compilation: disabled (not enough contiguous free space left)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17940) Typo in LAST function error message

2016-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17940:
--
Priority: Trivial  (was: Minor)

> Typo in LAST function error message
> ---
>
> Key: SPARK-17940
> URL: https://issues.apache.org/jira/browse/SPARK-17940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shuai Lin
>Priority: Trivial
>
> https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40
> {code}
>   throw new AnalysisException("The second argument of First should be a 
> boolean literal.")
> {code} 
> "First" should be "Last".
> Also the usage string can be improved to match the FIRST function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-10-14 Thread Thomas Dunne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575825#comment-15575825
 ] 

Thomas Dunne commented on SPARK-13802:
--

This is especially troublesome when combined with creating a DataFrame, while 
using your own schema.

The data I am working on can contain a lot of empty fields, which makes the 
schema inference potentially have to scan every row to determine their type. 
Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of 
the provided schema, lets just change the order of one (the Row), and naively 
use zip(row, schema.fields). This means that even keeping both schema field 
order, and Row key value is not enough, due to Rows sorting keys, we need to 
manually sort schema fields too.

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-10-14 Thread Thomas Dunne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575825#comment-15575825
 ] 

Thomas Dunne edited comment on SPARK-13802 at 10/14/16 4:45 PM:


This is especially troublesome when combined with creating a DataFrame, while 
using your own schema.

The data I am working on can contain a lot of empty fields, which makes the 
schema inference potentially have to scan every row to determine their type. 
Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of 
the provided schema, lets just change the order of one (the Row), and naively 
use zip(row, schema.fields). This means that even keeping both schema field 
order, and Row key value is not enough, due to Rows sorting keys, we need to 
manually sort schema fields too.

Doesn't seem consistent or desirable behavior at all.


was (Author: thomas9):
This is especially troublesome when combined with creating a DataFrame, while 
using your own schema.

The data I am working on can contain a lot of empty fields, which makes the 
schema inference potentially have to scan every row to determine their type. 
Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of 
the provided schema, lets just change the order of one (the Row), and naively 
use zip(row, schema.fields). This means that even keeping both schema field 
order, and Row key value is not enough, due to Rows sorting keys, we need to 
manually sort schema fields too.

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-10-14 Thread Guo-Xun Yuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575865#comment-15575865
 ] 

Guo-Xun Yuan commented on SPARK-12664:
--

Thank you, [~yanboliang]! So, just to confirm, will your PR just cover a method 
that exposes raw prediction scores in MultilayerPerceptronClassificationModel? 
Or it will also cover the fix where MultilayerPerceptronClassificationModel is 
derived from ClassificationModel.

Thanks!

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Yanbo Liang
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17606) New batches are not created when there are 1000 created after restarting streaming from checkpoint.

2016-10-14 Thread etienne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575927#comment-15575927
 ] 

etienne commented on SPARK-17606:
-

I'm not able to reproduce in local mode. either because the JobGenerator 
managed to restart either because the streaming resulted to a OOM before the 
restarting of the JobGenerator.

After heapdump it appears the memory is full of MapPartitionRDD

I give you here the result of my tries.
batch interval : 500ms , real batch duration > 2s (I had to reduce the batch 
interval to generate batches faster)

I have proceeded simply as : Start the streaming without checkpoint wait until 
checkpoint, stop it, wait a period and restart from the checkpoint.

||stoping time||restarting||batch during down time|| batch pending|| batch to 
reschedule|| starting of JobGenerator || last time* ||
|14:23:01|14:33:29|1320 - [14:22:30-14:33:29.500]|88 - 
[14:22:00-14:22:44]|1379-[14:22:00.500-14:33:29.500]|14:55:00 for time 
14:33:30|14:25:11|
|15:08:25|15:31:20|2777 - [15:08:18.500-15:31:26.500]|22 - 
[15:08:11-15:08:21.500]|2792-[15:08:11-15:31:26.500]|OOM| |
|09:30:01|09:47:01|2338 - [09:27:33-09:47:01.500]|298 - 
[09:26:03-09:28:31.500]|2518 - [09:26:03-09:47:01.500]|OOM| |
|12:52:49|12:46:01|1838- [12:49:11.500-13:04:30]|116 - 
[12:49:18.500-12:50:16]|1838- [12:45:11.500-13:04:30]|OOM| |



\* last time to reschedule found in log that was executed before the restarting 
of job generator (strangely there 8 minutes are not missing in UI)

All these OOM make me think there is something that is not cleaned correctly.

The JobGenerator is not started directly after the beginning (I have looked 
into the src and I didn't find what is blocking) and may induce a lag in batch 
generation. 


> New batches are not created when there are 1000 created after restarting 
> streaming from checkpoint.
> ---
>
> Key: SPARK-17606
> URL: https://issues.apache.org/jira/browse/SPARK-17606
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: etienne
>
> When spark restarts from a checkpoint after being down for a while.
> It recreates missing batch since the down time.
> When there are few missing batches, spark creates new incoming batch every 
> batchTime, but when there is enough missing time to create 1000 batches no 
> new batch is created.
> So when all these batch are completed the stream is idle ...
> I think there is a rigid limit set somewhere.
> I was expecting that spark continue to recreate missed batches, maybe not all 
> at once ( because it's look like it's cause driver memory problem ), and then 
> recreate batches each batchTime.
> Another solution would be to not create missing batches but still restart the 
> direct input.
> Right know for me the only solution to restart a stream after a long break it 
> to remove the checkpoint to allow the creation of a new stream. But losing 
> all my states.
> ps : I'm speaking about direct Kafka input because it's the source I'm 
> currently using, I don't know what happens with other sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575938#comment-15575938
 ] 

Xiao Li commented on SPARK-17709:
-

I can get an exactly same plan in the master branch, but my job can pass. 

{noformat}
'Join UsingJoin(Inner,List('companyid, 'productid)) 
   
:- Aggregate [companyid#5, productid#6], [companyid#5, productid#6, 
sum(cast(price#7 as bigint)) AS price#30L] 
:  +- Project [companyid#5, productid#6, price#7, count#8]  
   
: +- SubqueryAlias testext2 
   
:+- Relation[companyid#5,productid#6,price#7,count#8] parquet   
   
+- Aggregate [companyid#46, productid#47], [companyid#46, productid#47, 
sum(cast(count#49 as bigint)) AS count#41L]
   +- Project [companyid#46, productid#47, price#48, count#49]  
   
  +- SubqueryAlias testext2 
   
 +- Relation[companyid#46,productid#47,price#48,count#49] parquet  
{noformat}

The only difference is yours does not trigger deduplication of expression ids. 
Let me try it in the 2.0.1 branch. 

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575945#comment-15575945
 ] 

Justin Miller commented on SPARK-17936:
---

Hey Sean,

I did a bit more digging this morning looking at SpecificUnsafeProjection and 
saw this commit: 
https://github.com/apache/spark/commit/b1b47274bfeba17a9e4e9acebd7385289f31f6c8

I thought I'd try running w/2.1.0-SNAPSHOT and see how things went and it 
appears to work great now!

[Stage 1:> (0 + 8) / 8]11:28:33.237 INFO  c.p.o.ObservationPersister - 
(ObservationPersister) - Thrift Parse Success: 0 / Thrift Parse Errors: 0
[Stage 3:> (0 + 8) / 8]11:29:03.236 INFO  c.p.o.ObservationPersister - 
(ObservationPersister) - Thrift Parse Success: 89 / Thrift Parse Errors: 0
[Stage 5:> (4 + 4) / 8]11:29:33.237 INFO  c.p.o.ObservationPersister - 
(ObservationPersister) - Thrift Parse Success: 205 / Thrift Parse Errors: 0

Since we're still testing this out that snapshot works great for now. Do you 
know when 2.1.0 might be available generally?

Best,
Justin


> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575995#comment-15575995
 ] 

Xiao Li commented on SPARK-17709:
-

Still works well in 2.0.1

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575999#comment-15575999
 ] 

Xiao Li commented on SPARK-17709:
-

Below is the statements I used to recreate the problem

{noformat}
sql("CREATE TABLE testext2(companyid int, productid int, price int, count 
int) using parquet")
sql("insert into testext2 values (1, 1, 1, 1)")
val d1 = spark.sql("select * from testext2")
val df1 = d1.groupBy("companyid","productid").agg(sum("price").as("price"))
val df2 = d1.groupBy("companyid","productid").agg(sum("count").as("count"))
df1.join(df2, Seq("companyid", "productid")).show
{noformat}

Can you try it?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17709:

Labels:   (was: easyfix)

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >