[jira] [Commented] (SPARK-16609) Single function for parsing timestamps/dates

2016-08-05 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410501#comment-15410501
 ] 

Sandeep Singh commented on SPARK-16609:
---

I can work on this.

> Single function for parsing timestamps/dates
> 
>
> Key: SPARK-16609
> URL: https://issues.apache.org/jira/browse/SPARK-16609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>
> Today, if you want to parse a date or timestamp, you have to use the unix 
> time function and then cast to a timestamp.  Its a little odd there isn't a 
> single function that does both.  I propose we add
> {code}
> to_date(, )/to_timestamp(, ).
> {code}
> For reference, in other systems there are:
> MS SQL: {{convert(, )}}. See: 
> https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx
> Netezza: {{to_timestamp(, )}}. See: 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html
> Teradata has special casting functionality: {{cast( as timestamp 
> format '')}}
> MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you 
> define both date and time parts. See: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16932) Programming-guide Accumulator section should be more clear w.r.t new API

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410488#comment-15410488
 ] 

Apache Spark commented on SPARK-16932:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/14516

> Programming-guide Accumulator section should be more clear w.r.t new API
> 
>
> Key: SPARK-16932
> URL: https://issues.apache.org/jira/browse/SPARK-16932
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Bryan Cutler
>Priority: Trivial
>
> The programming-guide section on Accumulators starts off describing the old 
> API, which is deprecated now, and then shows examples with the new API and 
> ends with another code snippet of the old API.  For Scala, at least, there 
> should only be mention of the new API to be clear to the user what is 
> recommended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16932) Programming-guide Accumulator section should be more clear w.r.t new API

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16932:


Assignee: Apache Spark

> Programming-guide Accumulator section should be more clear w.r.t new API
> 
>
> Key: SPARK-16932
> URL: https://issues.apache.org/jira/browse/SPARK-16932
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Trivial
>
> The programming-guide section on Accumulators starts off describing the old 
> API, which is deprecated now, and then shows examples with the new API and 
> ends with another code snippet of the old API.  For Scala, at least, there 
> should only be mention of the new API to be clear to the user what is 
> recommended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16932) Programming-guide Accumulator section should be more clear w.r.t new API

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16932:


Assignee: (was: Apache Spark)

> Programming-guide Accumulator section should be more clear w.r.t new API
> 
>
> Key: SPARK-16932
> URL: https://issues.apache.org/jira/browse/SPARK-16932
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Bryan Cutler
>Priority: Trivial
>
> The programming-guide section on Accumulators starts off describing the old 
> API, which is deprecated now, and then shows examples with the new API and 
> ends with another code snippet of the old API.  For Scala, at least, there 
> should only be mention of the new API to be clear to the user what is 
> recommended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16932) Programming-guide Accumulator section should be more clear w.r.t new API

2016-08-05 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-16932:


 Summary: Programming-guide Accumulator section should be more 
clear w.r.t new API
 Key: SPARK-16932
 URL: https://issues.apache.org/jira/browse/SPARK-16932
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Bryan Cutler
Priority: Trivial


The programming-guide section on Accumulators starts off describing the old 
API, which is deprecated now, and then shows examples with the new API and ends 
with another code snippet of the old API.  For Scala, at least, there should 
only be mention of the new API to be clear to the user what is recommended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15702) Update document programming-guide accumulator section

2016-08-05 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler closed SPARK-15702.

Resolution: Fixed

> Update document programming-guide accumulator section
> -
>
> Key: SPARK-15702
> URL: https://issues.apache.org/jira/browse/SPARK-15702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Spark Core
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.0.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Update document programming-guide accumulator section



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16889) Add formatMessage Column expression for formatting strings in java.text.MessageFormat style in Scala API

2016-08-05 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410484#comment-15410484
 ] 

Sandeep Singh commented on SPARK-16889:
---

Why not use something like 
{code}
"Argument '%s' shall not be negative. The given value was %f.".format("12",2.0)
{code}
scala supports string interpolation. 

> Add formatMessage Column expression for formatting strings in 
> java.text.MessageFormat style in Scala API 
> -
>
> Key: SPARK-16889
> URL: https://issues.apache.org/jira/browse/SPARK-16889
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kapil Singh
>
> format_string formats the arguments in printf-style and has following major 
> cons compared to proposed function for formatting java.text.MessageFormat:
> 1. MessageFormat syntax is more readable since it is more explicit
> java.util.Formatter syntax: "Argument '%s' shall not be negative. The given 
> value was %f."
> java.text.MessageFormat syntax: "Argument '{0}' shall not be negative. The 
> given value was {1}."
> 2. Formatter forces user to declare the argument type (e.g. "%s" or "%f"), 
> while MessageFormat infers it from the object type. For example if the 
> argument could be a string or a number, then Formatter forces us to use the 
> "%s" type (passing a string to "%f" causes an exception). However a number 
> formatted with "%s" is formatted using Number.toString(), which produce an 
> unlocalized value. By contrast, MessageFormat produces localized values 
> dynamically for all recognized types.
> To address these drawbacks, a MessageFormat function should be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16852) RejectedExecutionException when exit at some times

2016-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410481#comment-15410481
 ] 

Sean Owen commented on SPARK-16852:
---

Does it cause any problem?

> RejectedExecutionException when exit at some times
> --
>
> Key: SPARK-16852
> URL: https://issues.apache.org/jira/browse/SPARK-16852
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weizhong
>Priority: Minor
>
> If we run a huge job, some times when exit will print 
> RejectedExecutionException
> {noformat}
> 16/05/27 08:30:40 ERROR client.TransportResponseHandler: Still have 3 
> requests outstanding when connection from HGH117808/10.184.66.104:41980 
> is closed
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@6b66dba rejected from 
> java.util.concurrent.ThreadPoolExecutor@60725736[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 269]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.tryFailure(Promise.scala:115)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$1.apply(NettyRpcEnv.scala:214)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$1.apply(NettyRpcEnv.scala:214)
>   at 
> org.apache.spark.rpc.netty.RpcOutboxMessage.onFailure(Outbox.scala:74)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.failOutstandingRequests(TransportResponseHandler.java:90)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:94)
>   at 
> 

[jira] [Commented] (SPARK-16326) Evaluate sparklyr package from RStudio

2016-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410480#comment-15410480
 ] 

Sean Owen commented on SPARK-16326:
---

Is there an action for the Spark code here? I'm inclined to close this as I 
don't know what outcome there would be that would reflect in the Spark code.

> Evaluate sparklyr package from RStudio
> --
>
> Key: SPARK-16326
> URL: https://issues.apache.org/jira/browse/SPARK-16326
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SparkR
>Reporter: Sun Rui
>
> Rstudio has developed sparklyr (https://github.com/rstudio/sparklyr) 
> connecting R community to Spark. A rough review shows that sparklyr provides 
> a dplyr backend and new API for mLLIB and for calling Spark from R. Of 
> course, sparklyr internally uses the low level mechanism in SparkR.
> We can discuss how to position SparkR with sparklyr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16717) Dataframe (jdbc) is missing a way to link and external function to get a connection

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16717:
--
Priority: Minor  (was: Major)

> Dataframe (jdbc) is missing a way to link and external function to get a 
> connection
> ---
>
> Key: SPARK-16717
> URL: https://issues.apache.org/jira/browse/SPARK-16717
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.2
>Reporter: Marco Colombo
>Priority: Minor
>
> In JdbcRRD it was possible to use a function to get a JDBC connection. This 
> allow an external handling of the connections while now this is no more 
> possible with dataframes. 
> Please consider an addition to Dataframes for using an externally provided 
> connectionFactory (such as a connection pool) in order to make data loading 
> more efficient, avoiding connection close/recreation. Connections should be 
> taken from provided function and returned to a second function whenever no 
> more used by the RRD. So this will make jdbc handling more efficient.
> I.e. extending DataFrame class with a method like 
> jdbc(Function0 getConnection, Function0 
> releaseConnection(java.sql.Connection))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16666) Kryo encoder for custom complex classes

2016-08-05 Thread Sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410454#comment-15410454
 ] 

Sam commented on SPARK-1:
-

[~clockfly] in your code sample, there is a case class for Point, not esri's 
point class.

> Kryo encoder for custom complex classes
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Sam
>
> I'm trying to create a dataset with some geo data using spark and esri. If 
> `Foo` only have `Point` field, it'll work but if I add some other fields 
> beyond a `Point`, I get ArrayIndexOutOfBoundsException.
> {code:scala}
> import com.esri.core.geometry.Point
> import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
> import org.apache.spark.{SparkConf, SparkContext}
> 
> object Main {
> 
>   case class Foo(position: Point, name: String)
> 
>   object MyEncoders {
> implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]
> 
> implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
>   }
> 
>   def main(args: Array[String]): Unit = {
> val sc = new SparkContext(new 
> SparkConf().setAppName("app").setMaster("local"))
> val sqlContext = new SQLContext(sc)
> import MyEncoders.{FooEncoder, PointEncoder}
> import sqlContext.implicits._
> Seq(new Foo(new Point(0, 0), "bar")).toDS.show
>   }
> }
> {code}
> {noformat}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
>  
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>  
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
>  
> at 
> org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
>  
> at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) 
> at 
> org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69)
>  
> at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) 
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) 
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230) 
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193) 
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201) 
> at Main$.main(Main.scala:24) 
> at Main.main(Main.scala)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16929) Bad synchronization with regard to speculation

2016-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410441#comment-15410441
 ] 

Sean Owen edited comment on SPARK-16929 at 8/6/16 3:55 AM:
---

At least, one easy optimization we could make is to let checkSpeculatableTasks 
short-circuit. It actually loops through all tasks even if the first one 
returns true. I wonder if in practice that would solve this problem? EDIT: not 
sure that's valid actually. It seems like it wants to check for all possible 
speculative tasks, not just any.

I agree, I don't see the purpose of that lock. It is not protecting rootPool 
since no other access to it is synchronized. I think you can make a PR for this.


was (Author: srowen):
At least, one easy optimization we could make is to let checkSpeculatableTasks 
short-circuit. It actually loops through all tasks even if the first one 
returns true. I wonder if in practice that would solve this problem?

> Bad synchronization with regard to speculation
> --
>
> Key: SPARK-16929
> URL: https://issues.apache.org/jira/browse/SPARK-16929
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Nicholas Brown
>
> Our cluster has been running slowly since I got speculation working, I looked 
> into it and noticed that stderr was saying some tasks were taking almost an 
> hour to run even though in the application logs on the nodes that task only 
> took a minute or so to run.  Digging into the thread dump for the master node 
> I noticed a number of threads are blocked, apparently by speculation thread.  
> At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while 
> it looks through the tasks to see what needs to be rerun.  Unfortunately that 
> code loops through each of the tasks, so when you have even just a couple 
> hundred thousand tasks to run that can be prohibitively slow to run inside of 
> a synchronized block.  Once I disabled speculation, the job went back to 
> having acceptable performance.
> There are no comments around that lock indicating why it was added, and the 
> git history seems to have a couple refactorings so its hard to find where it 
> was added.  I'm tempted to believe it is the result of someone assuming that 
> an extra synchronized block never hurt anyone (in reality I've probably just 
> as many bugs caused by over synchronization as too little) as it looks too 
> broad to be actually guarding any potential concurrency issue.  But, since 
> concurrency issues can be tricky to reproduce (and yes, I understand that's 
> an extreme understatement) I'm not sure just blindly removing it without 
> being familiar with the history is necessarily safe.  
> Can someone look into this?  Or at least make a note in the documentation 
> that speculation should not be used with large clusters?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16929) Bad synchronization with regard to speculation

2016-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410441#comment-15410441
 ] 

Sean Owen commented on SPARK-16929:
---

At least, one easy optimization we could make is to let checkSpeculatableTasks 
short-circuit. It actually loops through all tasks even if the first one 
returns true. I wonder if in practice that would solve this problem?

> Bad synchronization with regard to speculation
> --
>
> Key: SPARK-16929
> URL: https://issues.apache.org/jira/browse/SPARK-16929
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Nicholas Brown
>
> Our cluster has been running slowly since I got speculation working, I looked 
> into it and noticed that stderr was saying some tasks were taking almost an 
> hour to run even though in the application logs on the nodes that task only 
> took a minute or so to run.  Digging into the thread dump for the master node 
> I noticed a number of threads are blocked, apparently by speculation thread.  
> At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while 
> it looks through the tasks to see what needs to be rerun.  Unfortunately that 
> code loops through each of the tasks, so when you have even just a couple 
> hundred thousand tasks to run that can be prohibitively slow to run inside of 
> a synchronized block.  Once I disabled speculation, the job went back to 
> having acceptable performance.
> There are no comments around that lock indicating why it was added, and the 
> git history seems to have a couple refactorings so its hard to find where it 
> was added.  I'm tempted to believe it is the result of someone assuming that 
> an extra synchronized block never hurt anyone (in reality I've probably just 
> as many bugs caused by over synchronization as too little) as it looks too 
> broad to be actually guarding any potential concurrency issue.  But, since 
> concurrency issues can be tricky to reproduce (and yes, I understand that's 
> an extreme understatement) I'm not sure just blindly removing it without 
> being familiar with the history is necessarily safe.  
> Can someone look into this?  Or at least make a note in the documentation 
> that speculation should not be used with large clusters?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15899) file scheme should be used correctly

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15899:
--
  Assignee: Alexander Ulanov
  Priority: Major  (was: Minor)
Issue Type: Bug  (was: Improvement)

> file scheme should be used correctly
> 
>
> Key: SPARK-15899
> URL: https://issues.apache.org/jira/browse/SPARK-15899
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kazuaki Ishizaki
>Assignee: Alexander Ulanov
>
> [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as 
> {{file://host/}} or {{file:///}}. 
> [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme]
> [Some code 
> stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58]
>  use different prefix such as {{file:}}.
> It would be good to prepare a utility method to correctly add {{file://host}} 
> or {{file://} prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16847:
--
Assignee: Hyukjin Kwon

> Prevent to potentially read corrupt statstics on binary in Parquet via 
> VectorizedReader
> ---
>
> Key: SPARK-16847
> URL: https://issues.apache.org/jira/browse/SPARK-16847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> It is still possible to read corrupt Parquet's statistics.
> This problem was found in PARQUET-251 and we disabled filter pushdown on 
> binary columns in Spark before.
> We enabled this after upgrading Parquet but it seems there are potential 
> incompatibility for Parquet files written in lower Spark versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16847) Prevent to potentially read corrupt statstics on binary in Parquet via VectorizedReader

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16847.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14450
[https://github.com/apache/spark/pull/14450]

> Prevent to potentially read corrupt statstics on binary in Parquet via 
> VectorizedReader
> ---
>
> Key: SPARK-16847
> URL: https://issues.apache.org/jira/browse/SPARK-16847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> It is still possible to read corrupt Parquet's statistics.
> This problem was found in PARQUET-251 and we disabled filter pushdown on 
> binary columns in Spark before.
> We enabled this after upgrading Parquet but it seems there are potential 
> incompatibility for Parquet files written in lower Spark versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16928:
--
   Priority: Minor  (was: Major)
Description: 
In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() 
with the following code pattern: 

{code}
  public int getInt(int rowId) {
if (dictionary == null) {
  return intData[rowId];
} else {
  return dictionary.decodeToInt(dictionaryIds.getInt(rowId));
}
  }
{code}

As dictionaryIds is also a ColumnVector, this results in a recursive call of 
getInt() and breaks JIT inlining. As a result, getInt() will not get inlined.


  was:
In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() 
with the following code pattern: 

  public int getInt(int rowId) {
if (dictionary == null) {
  return intData[rowId];
} else {
  return dictionary.decodeToInt(dictionaryIds.getInt(rowId));
}
  }

As dictionaryIds is also a ColumnVector, this results in a recursive call of 
getInt() and breaks JIT inlining. As a result, getInt() will not get inlined.



> Recursive call of ColumnVector::getInt() breaks JIT inlining
> 
>
> Key: SPARK-16928
> URL: https://issues.apache.org/jira/browse/SPARK-16928
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>Priority: Minor
>
> In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() 
> with the following code pattern: 
> {code}
>   public int getInt(int rowId) {
> if (dictionary == null) {
>   return intData[rowId];
> } else {
>   return dictionary.decodeToInt(dictionaryIds.getInt(rowId));
> }
>   }
> {code}
> As dictionaryIds is also a ColumnVector, this results in a recursive call of 
> getInt() and breaks JIT inlining. As a result, getInt() will not get inlined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15702) Update document programming-guide accumulator section

2016-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410431#comment-15410431
 ] 

Sean Owen commented on SPARK-15702:
---

I don't think this should be reopened to add a new and somewhat distinct 
change. You're further changing the same section for a related but different 
purpose. Just make another JIRA.

> Update document programming-guide accumulator section
> -
>
> Key: SPARK-15702
> URL: https://issues.apache.org/jira/browse/SPARK-15702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Spark Core
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.0.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Update document programming-guide accumulator section



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16864) Comprehensive version info

2016-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410430#comment-15410430
 ] 

Sean Owen commented on SPARK-16864:
---

What would an app do with that info at runtime -- you'd really branch logic 
based on a git commit? 
If you're grabbing binaries you still have version or date/time to distinguish 
the build.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16610:


Assignee: Apache Spark

> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16610:


Assignee: (was: Apache Spark)

> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410393#comment-15410393
 ] 

Apache Spark commented on SPARK-16610:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14518

> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16856) Link the application's executor page to the master's UI

2016-08-05 Thread Tao Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Lin updated SPARK-16856:

Summary: Link the application's executor page to the master's UI  (was: 
Link application summary page and detail page to the master page)

> Link the application's executor page to the master's UI
> ---
>
> Key: SPARK-16856
> URL: https://issues.apache.org/jira/browse/SPARK-16856
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Reporter: Tao Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16931) PySpark access to data-frame bucketing api

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16931:


Assignee: (was: Apache Spark)

> PySpark access to data-frame bucketing api
> --
>
> Key: SPARK-16931
> URL: https://issues.apache.org/jira/browse/SPARK-16931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Greg Bowyer
>
> Attached is a patch that enables bucketing for pyspark using the dataframe 
> API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16931) PySpark access to data-frame bucketing api

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410351#comment-15410351
 ] 

Apache Spark commented on SPARK-16931:
--

User 'GregBowyer' has created a pull request for this issue:
https://github.com/apache/spark/pull/14517

> PySpark access to data-frame bucketing api
> --
>
> Key: SPARK-16931
> URL: https://issues.apache.org/jira/browse/SPARK-16931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Greg Bowyer
>
> Attached is a patch that enables bucketing for pyspark using the dataframe 
> API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16931) PySpark access to data-frame bucketing api

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16931:


Assignee: Apache Spark

> PySpark access to data-frame bucketing api
> --
>
> Key: SPARK-16931
> URL: https://issues.apache.org/jira/browse/SPARK-16931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Greg Bowyer
>Assignee: Apache Spark
>
> Attached is a patch that enables bucketing for pyspark using the dataframe 
> API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16931) PySpark access to data-frame bucketing api

2016-08-05 Thread Greg Bowyer (JIRA)
Greg Bowyer created SPARK-16931:
---

 Summary: PySpark access to data-frame bucketing api
 Key: SPARK-16931
 URL: https://issues.apache.org/jira/browse/SPARK-16931
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.0.0
Reporter: Greg Bowyer


Attached is a patch that enables bucketing for pyspark using the dataframe API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-08-05 Thread Junyang Qian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410296#comment-15410296
 ] 

Junyang Qian commented on SPARK-16508:
--

It seems that there are still some warnings in my local check, e.g. 
undocumented arguments in as.data.frame "row.names", "optional". I was 
wondering if I missed something or if we should deal with those?

> Fix documentation warnings found by R CMD check
> ---
>
> Key: SPARK-16508
> URL: https://issues.apache.org/jira/browse/SPARK-16508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> A full list of warnings after the fixes in SPARK-16507 is at 
> https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16930) ApplicationMaster's code that waits for SparkContext is race-prone

2016-08-05 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-16930:
--

 Summary: ApplicationMaster's code that waits for SparkContext is 
race-prone
 Key: SPARK-16930
 URL: https://issues.apache.org/jira/browse/SPARK-16930
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin
Priority: Minor


While taking a look at SPARK-15937 and checking if there's something wrong with 
the code, I noticed two races that explain the behavior.

Because they're really narrow races, I'm a little wary of declaring them the 
cause of that bug. Also because the logs posted there don't really explain what 
went wrong (and don't really look like a SparkContext was run at all).

The races I found are:

- it's possible, but very unlikely, for an application to instantiate a 
SparkContext and stop it before the AM enters the loop where it checks for the 
instance.

- it's possible, but very unlikely, for an application to stop the SparkContext 
after the AM is already waiting for one, has been notified of its creation, but 
hasn't yet stored the SparkContext reference in a local variable.

I'll fix those and clean up the code a bit in the process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16929) Bad synchronization with regard to speculation

2016-08-05 Thread Nicholas Brown (JIRA)
Nicholas Brown created SPARK-16929:
--

 Summary: Bad synchronization with regard to speculation
 Key: SPARK-16929
 URL: https://issues.apache.org/jira/browse/SPARK-16929
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Nicholas Brown


Our cluster has been running slowly since I got speculation working, I looked 
into it and noticed that stderr was saying some tasks were taking almost an 
hour to run even though in the application logs on the nodes that task only 
took a minute or so to run.  Digging into the thread dump for the master node I 
noticed a number of threads are blocked, apparently by speculation thread.  At 
line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while it 
looks through the tasks to see what needs to be rerun.  Unfortunately that code 
loops through each of the tasks, so when you have even just a couple hundred 
thousand tasks to run that can be prohibitively slow to run inside of a 
synchronized block.  Once I disabled speculation, the job went back to having 
acceptable performance.

There are no comments around that lock indicating why it was added, and the git 
history seems to have a couple refactorings so its hard to find where it was 
added.  I'm tempted to believe it is the result of someone assuming that an 
extra synchronized block never hurt anyone (in reality I've probably just as 
many bugs caused by over synchronization as too little) as it looks too broad 
to be actually guarding any potential concurrency issue.  But, since 
concurrency issues can be tricky to reproduce (and yes, I understand that's an 
extreme understatement) I'm not sure just blindly removing it without being 
familiar with the history is necessarily safe.  

Can someone look into this?  Or at least make a note in the documentation that 
speculation should not be used with large clusters?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15354) Topology aware block replication strategies

2016-08-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410269#comment-15410269
 ] 

Davies Liu commented on SPARK-15354:


This strategy used in HDFS is to balance the write traffic (for performance) 
and durability (or availability) of blocks. But the blocks in Spark is much 
different, they could be lost and recovered by re-computing, so we usually have 
only one copy, rarely having two copy. Even with two copies, it's actually to 
place them randomly to have better balance for computing.

Overall, implementing this strategy for Spark may be not that useful. Could you 
share more information on your case?

> Topology aware block replication strategies
> ---
>
> Key: SPARK-15354
> URL: https://issues.apache.org/jira/browse/SPARK-15354
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>
> Implementations of strategies for resilient block replication for different 
> resource managers that replicate the 3-replica strategy used by HDFS, where 
> the first replica is on an executor, the second replica within the same rack 
> as the executor and a third replica on a different rack. 
> The implementation involves providing two pluggable classes, one running in 
> the driver that provides topology information for every host at cluster start 
> and the second prioritizing a list of peer BlockManagerIds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11638) Run Spark on Mesos with bridge networking

2016-08-05 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410261#comment-15410261
 ] 

Stavros Kontopoulos edited comment on SPARK-11638 at 8/5/16 11:08 PM:
--

[~radekg]  [~mgummelt] what do you think?


was (Author: skonto):
[~radekg][~mgummelt] what do you think?

> Run Spark on Mesos with bridge networking
> -
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and 
> {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile 
> with Akka 2.4.x yet.
> What we want is the back port of mentioned {{akka-remote}} settings to 
> {{2.3.x}} versions. These patches are attached to this ticket - 
> 

[jira] [Commented] (SPARK-11638) Run Spark on Mesos with bridge networking

2016-08-05 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410261#comment-15410261
 ] 

Stavros Kontopoulos commented on SPARK-11638:
-

[~radekg][~mgummelt] what do you think?

> Run Spark on Mesos with bridge networking
> -
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and 
> {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile 
> with Akka 2.4.x yet.
> What we want is the back port of mentioned {{akka-remote}} settings to 
> {{2.3.x}} versions. These patches are attached to this ticket - 
> {{2.3.4.patch}} and {{2.3.11.patch}} files provide patches for respective 
> akka versions. These add mentioned settings and 

[jira] [Resolved] (SPARK-16901) Hive settings in hive-site.xml may be overridden by Hive's default values

2016-08-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16901.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14497
[https://github.com/apache/spark/pull/14497]

> Hive settings in hive-site.xml may be overridden by Hive's default values
> -
>
> Key: SPARK-16901
> URL: https://issues.apache.org/jira/browse/SPARK-16901
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.1, 2.1.0
>
>
> When we create the HiveConf for metastore client, we use a Hadoop Conf as the 
> base, which may contain Hive settings in hive-site.xml 
> (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L49).
>  However, HiveConf's initialize function basically ignores the base Hadoop 
> Conf and always its default values (i.e. settings with non-null default 
> values) as the base 
> (https://github.com/apache/hive/blob/release-1.2.1/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2687).
>  So, even a user put {{javax.jdo.option.ConnectionURL}} in hive-site.xml, it 
> is not used and Hive will use its default, which is 
> {{jdbc:derby:;databaseName=metastore_db;create=true}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16864) Comprehensive version info

2016-08-05 Thread Jan Gorecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410238#comment-15410238
 ] 

Jan Gorecki edited comment on SPARK-16864 at 8/5/16 10:52 PM:
--

Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who grab binaries and for example 
just want to track performance in their cluster over spark git history. Git 
commit hash is a natural key for a source code of a project, you won't find 
better field to reference source code. Referencing release version is a 
different thing.


was (Author: jangorecki):
Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who grab binaries and for example 
just want to track performance in their cluster over spark git history. Git 
commit hash is a natural key for a source code a project, you won't find better 
field to references the source code. Referencing release versions is simply a 
different thing.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16864) Comprehensive version info

2016-08-05 Thread Jan Gorecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410238#comment-15410238
 ] 

Jan Gorecki edited comment on SPARK-16864 at 8/5/16 10:51 PM:
--

Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who grab binaries and for example 
just want to track performance in their cluster over spark git history. Git 
commit hash is a natural key for a source code a project, you won't find better 
field to references the source code. Referencing release versions is simply a 
different thing.


was (Author: jangorecki):
Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who just grab binaries and for 
example just want to track performance in their cluster over spark git history. 
Git commit hash is a natural key for a source code a project, you won't find 
better field to references the source code. Referencing release versions is 
simply a different thing.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16864) Comprehensive version info

2016-08-05 Thread Jan Gorecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410238#comment-15410238
 ] 

Jan Gorecki commented on SPARK-16864:
-

Hi, git commit is relevant to applications at runtime as long as the subject 
for an application (in any dimension) is spark itself. I don't understand why 
that info would not be included. This may not be a problem for people who build 
from source, they can eventually put that metadata in plaintext file (still an 
overhead). The bigger problem is for those who just grab binaries and for 
example just want to track performance in their cluster over spark git history. 
Git commit hash is a natural key for a source code a project, you won't find 
better field to references the source code. Referencing release versions is 
simply a different thing.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15702) Update document programming-guide accumulator section

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15702:


Assignee: Weichen Xu  (was: Apache Spark)

> Update document programming-guide accumulator section
> -
>
> Key: SPARK-15702
> URL: https://issues.apache.org/jira/browse/SPARK-15702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Spark Core
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.0.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Update document programming-guide accumulator section



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15702) Update document programming-guide accumulator section

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410239#comment-15410239
 ] 

Apache Spark commented on SPARK-15702:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/14516

> Update document programming-guide accumulator section
> -
>
> Key: SPARK-15702
> URL: https://issues.apache.org/jira/browse/SPARK-15702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Spark Core
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.0.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Update document programming-guide accumulator section



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15702) Update document programming-guide accumulator section

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15702:


Assignee: Apache Spark  (was: Weichen Xu)

> Update document programming-guide accumulator section
> -
>
> Key: SPARK-15702
> URL: https://issues.apache.org/jira/browse/SPARK-15702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Spark Core
>Reporter: Weichen Xu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Update document programming-guide accumulator section



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-15702) Update document programming-guide accumulator section

2016-08-05 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reopened SPARK-15702:
--

I'm reopening this because I think the current programming guide accumulator 
section is confusing.  It mixes both the old and new APIs.

> Update document programming-guide accumulator section
> -
>
> Key: SPARK-15702
> URL: https://issues.apache.org/jira/browse/SPARK-15702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Spark Core
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.0.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Update document programming-guide accumulator section



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source

2016-08-05 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-16924:
---
Issue Type: Improvement  (was: Bug)

> DataStreamReader can not support option("inferSchema", true/false) for csv 
> and json file source
> ---
>
> Key: SPARK-16924
> URL: https://issues.apache.org/jira/browse/SPARK-16924
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> Currently DataStreamReader can not support option("inferSchema", true|false) 
> for csv and json file source. It only takes SQLConf setting 
> "spark.sql.streaming.schemaInference", which needs to be set at session 
> level. 
> For example:
> {code}
> scala> val in = spark.readStream.format("json").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/json/t1")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> scala> val in = spark.readStream.format("csv").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/csv")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> {code}
> In the example, even though users specify the option("inferSchema", true), it 
> does not take it. But for batch data, DataFrameReader can take it:
> {code}
> scala> val in = spark.read.format("csv").option("header", 
> true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1")
> in: org.apache.spark.sql.DataFrame = [signal: string, flash: int]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16926) Partition columns are present in columns metadata for partition but not table

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410202#comment-15410202
 ] 

Apache Spark commented on SPARK-16926:
--

User 'dafrista' has created a pull request for this issue:
https://github.com/apache/spark/pull/14515

> Partition columns are present in columns metadata for partition but not table
> -
>
> Key: SPARK-16926
> URL: https://issues.apache.org/jira/browse/SPARK-16926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>
> A change introduced in SPARK-14388 removes partition columns from the column 
> metadata of tables, but not for partitions. This causes TableReader to 
> believe that the schema is different and create an unnecessary conversion 
> object inspector, taking the else codepath in TableReader below:
> {code}
> val soi = if 
> (rawDeser.getObjectInspector.equals(tableDeser.getObjectInspector)) {
>   rawDeser.getObjectInspector.asInstanceOf[StructObjectInspector]
> } else {
>   ObjectInspectorConverters.getConvertedOI(
> rawDeser.getObjectInspector,
> tableDeser.getObjectInspector).asInstanceOf[StructObjectInspector]
> }
> {code}
> Printing the properties as debug output confirms the difference for the Hive 
> table.
> Table properties (tableDesc.getProperties):
> {code}
> 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, 
> string:bigint:string:bigint:bigint:array
> {code}
> Partition properties (partProps):
> {code}
> 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, 
> string:bigint:string:bigint:bigint:array:string:string:string
> {code}
> Where the final three string columns are partition columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16926) Partition columns are present in columns metadata for partition but not table

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16926:


Assignee: Apache Spark

> Partition columns are present in columns metadata for partition but not table
> -
>
> Key: SPARK-16926
> URL: https://issues.apache.org/jira/browse/SPARK-16926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Apache Spark
>
> A change introduced in SPARK-14388 removes partition columns from the column 
> metadata of tables, but not for partitions. This causes TableReader to 
> believe that the schema is different and create an unnecessary conversion 
> object inspector, taking the else codepath in TableReader below:
> {code}
> val soi = if 
> (rawDeser.getObjectInspector.equals(tableDeser.getObjectInspector)) {
>   rawDeser.getObjectInspector.asInstanceOf[StructObjectInspector]
> } else {
>   ObjectInspectorConverters.getConvertedOI(
> rawDeser.getObjectInspector,
> tableDeser.getObjectInspector).asInstanceOf[StructObjectInspector]
> }
> {code}
> Printing the properties as debug output confirms the difference for the Hive 
> table.
> Table properties (tableDesc.getProperties):
> {code}
> 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, 
> string:bigint:string:bigint:bigint:array
> {code}
> Partition properties (partProps):
> {code}
> 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, 
> string:bigint:string:bigint:bigint:array:string:string:string
> {code}
> Where the final three string columns are partition columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16926) Partition columns are present in columns metadata for partition but not table

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16926:


Assignee: (was: Apache Spark)

> Partition columns are present in columns metadata for partition but not table
> -
>
> Key: SPARK-16926
> URL: https://issues.apache.org/jira/browse/SPARK-16926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>
> A change introduced in SPARK-14388 removes partition columns from the column 
> metadata of tables, but not for partitions. This causes TableReader to 
> believe that the schema is different and create an unnecessary conversion 
> object inspector, taking the else codepath in TableReader below:
> {code}
> val soi = if 
> (rawDeser.getObjectInspector.equals(tableDeser.getObjectInspector)) {
>   rawDeser.getObjectInspector.asInstanceOf[StructObjectInspector]
> } else {
>   ObjectInspectorConverters.getConvertedOI(
> rawDeser.getObjectInspector,
> tableDeser.getObjectInspector).asInstanceOf[StructObjectInspector]
> }
> {code}
> Printing the properties as debug output confirms the difference for the Hive 
> table.
> Table properties (tableDesc.getProperties):
> {code}
> 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, 
> string:bigint:string:bigint:bigint:array
> {code}
> Partition properties (partProps):
> {code}
> 16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, 
> string:bigint:string:bigint:bigint:array:string:string:string
> {code}
> Where the final three string columns are partition columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16928:


Assignee: Apache Spark

> Recursive call of ColumnVector::getInt() breaks JIT inlining
> 
>
> Key: SPARK-16928
> URL: https://issues.apache.org/jira/browse/SPARK-16928
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>Assignee: Apache Spark
>
> In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() 
> with the following code pattern: 
>   public int getInt(int rowId) {
> if (dictionary == null) {
>   return intData[rowId];
> } else {
>   return dictionary.decodeToInt(dictionaryIds.getInt(rowId));
> }
>   }
> As dictionaryIds is also a ColumnVector, this results in a recursive call of 
> getInt() and breaks JIT inlining. As a result, getInt() will not get inlined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410173#comment-15410173
 ] 

Apache Spark commented on SPARK-16928:
--

User 'ooq' has created a pull request for this issue:
https://github.com/apache/spark/pull/14513

> Recursive call of ColumnVector::getInt() breaks JIT inlining
> 
>
> Key: SPARK-16928
> URL: https://issues.apache.org/jira/browse/SPARK-16928
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>
> In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() 
> with the following code pattern: 
>   public int getInt(int rowId) {
> if (dictionary == null) {
>   return intData[rowId];
> } else {
>   return dictionary.decodeToInt(dictionaryIds.getInt(rowId));
> }
>   }
> As dictionaryIds is also a ColumnVector, this results in a recursive call of 
> getInt() and breaks JIT inlining. As a result, getInt() will not get inlined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16928:


Assignee: (was: Apache Spark)

> Recursive call of ColumnVector::getInt() breaks JIT inlining
> 
>
> Key: SPARK-16928
> URL: https://issues.apache.org/jira/browse/SPARK-16928
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>
> In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() 
> with the following code pattern: 
>   public int getInt(int rowId) {
> if (dictionary == null) {
>   return intData[rowId];
> } else {
>   return dictionary.decodeToInt(dictionaryIds.getInt(rowId));
> }
>   }
> As dictionaryIds is also a ColumnVector, this results in a recursive call of 
> getInt() and breaks JIT inlining. As a result, getInt() will not get inlined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining

2016-08-05 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-16928:


 Summary: Recursive call of ColumnVector::getInt() breaks JIT 
inlining
 Key: SPARK-16928
 URL: https://issues.apache.org/jira/browse/SPARK-16928
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Qifan Pu


In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() 
with the following code pattern: 

  public int getInt(int rowId) {
if (dictionary == null) {
  return intData[rowId];
} else {
  return dictionary.decodeToInt(dictionaryIds.getInt(rowId));
}
  }

As dictionaryIds is also a ColumnVector, this results in a recursive call of 
getInt() and breaks JIT inlining. As a result, getInt() will not get inlined.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16927) Mesos Cluster Dispatcher default properties

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16927:


Assignee: Apache Spark

> Mesos Cluster Dispatcher default properties
> ---
>
> Key: SPARK-16927
> URL: https://issues.apache.org/jira/browse/SPARK-16927
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Assignee: Apache Spark
>
> Add the capability to set default driver properties for all jobs submitted 
> through the dispatcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16927) Mesos Cluster Dispatcher default properties

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16927:


Assignee: (was: Apache Spark)

> Mesos Cluster Dispatcher default properties
> ---
>
> Key: SPARK-16927
> URL: https://issues.apache.org/jira/browse/SPARK-16927
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> Add the capability to set default driver properties for all jobs submitted 
> through the dispatcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16923) Mesos cluster scheduler duplicates config vars by setting them in the environment and as --conf

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16923:


Assignee: Apache Spark

> Mesos cluster scheduler duplicates config vars by setting them in the 
> environment and as --conf
> ---
>
> Key: SPARK-16923
> URL: https://issues.apache.org/jira/browse/SPARK-16923
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Assignee: Apache Spark
>
> I don't think this introduces any bugs, but we should fix it nonetheless



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16927) Mesos Cluster Dispatcher default properties

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410129#comment-15410129
 ] 

Apache Spark commented on SPARK-16927:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14511

> Mesos Cluster Dispatcher default properties
> ---
>
> Key: SPARK-16927
> URL: https://issues.apache.org/jira/browse/SPARK-16927
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> Add the capability to set default driver properties for all jobs submitted 
> through the dispatcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16923) Mesos cluster scheduler duplicates config vars by setting them in the environment and as --conf

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16923:


Assignee: (was: Apache Spark)

> Mesos cluster scheduler duplicates config vars by setting them in the 
> environment and as --conf
> ---
>
> Key: SPARK-16923
> URL: https://issues.apache.org/jira/browse/SPARK-16923
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> I don't think this introduces any bugs, but we should fix it nonetheless



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16923) Mesos cluster scheduler duplicates config vars by setting them in the environment and as --conf

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410130#comment-15410130
 ] 

Apache Spark commented on SPARK-16923:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14511

> Mesos cluster scheduler duplicates config vars by setting them in the 
> environment and as --conf
> ---
>
> Key: SPARK-16923
> URL: https://issues.apache.org/jira/browse/SPARK-16923
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> I don't think this introduces any bugs, but we should fix it nonetheless



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16927) Mesos Cluster Dispatcher default properties

2016-08-05 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-16927:
---

 Summary: Mesos Cluster Dispatcher default properties
 Key: SPARK-16927
 URL: https://issues.apache.org/jira/browse/SPARK-16927
 Project: Spark
  Issue Type: New Feature
  Components: Mesos
Affects Versions: 2.0.0
Reporter: Michael Gummelt


Add the capability to set default driver properties for all jobs submitted 
through the dispatcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16260) ML Example Improvements and Cleanup

2016-08-05 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-16260:
-
Description: 
This parent task is to track a few possible improvements and cleanup for 
PySpark ML examples I noticed during 2.0 QA.  These include:

* Parity between different API examples
* Ensure input format is documented in example
* Ensure results of example are clear and demonstrate functionality
* Cleanup unused imports
* Fix minor issues

  was:
This parent task is to track a few possible improvements and cleanup for 
PySpark ML examples I noticed during 2.0 QA.  These include:

* Parity with Scala ML examples
* Ensure input format is documented in example
* Ensure results of example are clear and demonstrate functionality
* Cleanup unused imports
* Fix minor issues


> ML Example Improvements and Cleanup
> ---
>
> Key: SPARK-16260
> URL: https://issues.apache.org/jira/browse/SPARK-16260
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> This parent task is to track a few possible improvements and cleanup for 
> PySpark ML examples I noticed during 2.0 QA.  These include:
> * Parity between different API examples
> * Ensure input format is documented in example
> * Ensure results of example are clear and demonstrate functionality
> * Cleanup unused imports
> * Fix minor issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16260) ML Example Improvements and Cleanup

2016-08-05 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-16260:
-
Summary: ML Example Improvements and Cleanup  (was: PySpark ML Example 
Improvements and Cleanup)

> ML Example Improvements and Cleanup
> ---
>
> Key: SPARK-16260
> URL: https://issues.apache.org/jira/browse/SPARK-16260
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> This parent task is to track a few possible improvements and cleanup for 
> PySpark ML examples I noticed during 2.0 QA.  These include:
> * Parity with Scala ML examples
> * Ensure input format is documented in example
> * Ensure results of example are clear and demonstrate functionality
> * Cleanup unused imports
> * Fix minor issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame

2016-08-05 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410105#comment-15410105
 ] 

Hossein Falaki commented on SPARK-16883:


Thanks [~shivaram]! This may require changing the serialization. Who thought 
this might happen! ;) 

> SQL decimal type is not properly cast to number when collecting SparkDataFrame
> --
>
> Key: SPARK-16883
> URL: https://issues.apache.org/jira/browse/SPARK-16883
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> To reproduce run following code. As you can see "y" is a list of values.
> {code}
> registerTempTable(createDataFrame(iris), "iris")
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  
> from iris limit 5")))
> 'data.frame': 5 obs. of  2 variables:
>  $ x: num  1 1 1 1 1
>  $ y:List of 5
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16817) Enable storing of shuffle data in Alluxio

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16817.
---
Resolution: Won't Fix

We removed the Tachyon dependency a while ago, although its purpose was 
somewhat different. I don't know that this is a good use for Alluxio. You 
indeed really just want fast local storage and not something distributed or 
fault tolerant.

> Enable storing of shuffle data in Alluxio
> -
>
> Key: SPARK-16817
> URL: https://issues.apache.org/jira/browse/SPARK-16817
> Project: Spark
>  Issue Type: New Feature
>Reporter: Tim Bisson
>
> If one is using Alluxio for storage, it would also be useful if Spark can 
> store shuffle spill data in Alluxio. For example:
> spark.local.dir="alluxio://host:port/path"
> Several users on the Alluxio mailing list have asked for this feature:
> https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/90pRZWRVi0s/mgLWLS5aAgAJ
> https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/s9H93PnDebw/v_1_FMjR7vEJ



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16784) Configurable log4j settings

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16784.
---
Resolution: Not A Problem

Reopen if that's not what you meant

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410078#comment-15410078
 ] 

Apache Spark commented on SPARK-16925:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14510

> Spark tasks which cause JVM to exit with a zero exit code may cause app to 
> hang in Standalone mode
> --
>
> Key: SPARK-16925
> URL: https://issues.apache.org/jira/browse/SPARK-16925
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> If you have a Spark standalone cluster which runs a single application and 
> you have a Spark task which repeatedly fails by causing the executor JVM to 
> exit with a _zero_ exit code then this may temporarily freeze / hang the 
> Spark application.
> For example, running
> {code}
> sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
> {code}
> on a cluster will cause all executors to die but those executors won't be 
> replaced unless another Spark application or worker joins or leaves the 
> cluster. This is caused by a bug in the standalone Master where 
> {{schedule()}} is only called on executor exit when the exit code is 
> non-zero, whereas I think that we should always call {{schedule()}} even on a 
> "clean" executor shutdown since {{schedule()}} should always be safe to call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode

2016-08-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16925:
---
Description: 
If you have a Spark standalone cluster which runs a single application and you 
have a Spark task which repeatedly fails by causing the executor JVM to exit 
with a _zero_ exit code then this may temporarily freeze / hang the Spark 
application.

For example, running

{code}
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
{code}

on a cluster will cause all executors to die but those executors won't be 
replaced unless another Spark application or worker joins or leaves the 
cluster. This is caused by a bug in the standalone Master where {{schedule()}} 
is only called on executor exit when the exit code is non-zero, whereas I think 
that we should always call {{schedule()}} even on a "clean" executor shutdown 
since {{schedule()}} should always be safe to call.

  was:
If you have a Spark standalone cluster which runs a single application and you 
have a Spark task which repeatedly fails by causing the executor JVM to exit 
with a _zero_ exit code then this may temporarily freeze / hang the Spark 
application.

For example, running

{code}
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
{code}

on a cluster will cause all executors to die but those executors won't be 
replaced unless another Spark application or worker joins or leaves the 
cluster. This is caused by a bug in the standalone Master where {{schedule()}} 
is only called on executor exit when the exit code is non-zero, whereas I think 
that we should always call {{schedule()}} even on a "clean" executor shutdown 
since {{schedule()}} should be idempotent.


> Spark tasks which cause JVM to exit with a zero exit code may cause app to 
> hang in Standalone mode
> --
>
> Key: SPARK-16925
> URL: https://issues.apache.org/jira/browse/SPARK-16925
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> If you have a Spark standalone cluster which runs a single application and 
> you have a Spark task which repeatedly fails by causing the executor JVM to 
> exit with a _zero_ exit code then this may temporarily freeze / hang the 
> Spark application.
> For example, running
> {code}
> sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
> {code}
> on a cluster will cause all executors to die but those executors won't be 
> replaced unless another Spark application or worker joins or leaves the 
> cluster. This is caused by a bug in the standalone Master where 
> {{schedule()}} is only called on executor exit when the exit code is 
> non-zero, whereas I think that we should always call {{schedule()}} even on a 
> "clean" executor shutdown since {{schedule()}} should always be safe to call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16709.
---
Resolution: Duplicate

> Task with commit failed will retry infinite when speculation set to true
> 
>
> Key: SPARK-16709
> URL: https://issues.apache.org/jira/browse/SPARK-16709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Hong Shen
> Attachments: commit failed.png
>
>
> In our cluster, we set spark.speculation=true,  but when a task throw 
> exception at SparkHadoopMapRedUtil.performCommit(), this task can retry 
> infinite.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16499) Improve applyInPlace function for matrix in ANN code

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16499.
---
Resolution: Won't Fix

> Improve applyInPlace function for matrix in ANN code
> 
>
> Key: SPARK-16499
> URL: https://issues.apache.org/jira/browse/SPARK-16499
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The ApplyInPlace function in ANN code:
> `
> def apply(x: BDM[Double], y: BDM[Double], func: Double => Double): Unit
> `
> is implemented using simple while loop.
> It should be replaced with breeze BDM matrix operating function, so that it
> can take advantage of breeze lib's optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16455) Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16455.
---
Resolution: Won't Fix

> Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling 
> new tasks when cluster is restarting
> 
>
> Key: SPARK-16455
> URL: https://issues.apache.org/jira/browse/SPARK-16455
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: YangyangLiu
>Priority: Minor
>
> In our case, we are implementing a new mechanism which will let driver 
> survive when cluster is temporarily down and restarting. So when the service 
> provided by cluster is not available, scheduler should stop scheduling new 
> tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to 
> avoid new task scheduling when it's necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode

2016-08-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16925:
---
Summary: Spark tasks which cause JVM to exit with a zero exit code may 
cause app to hang in Standalone mode  (was: Spark tasks which cause JVM to exit 
with a zero exit code may cause app to hang)

> Spark tasks which cause JVM to exit with a zero exit code may cause app to 
> hang in Standalone mode
> --
>
> Key: SPARK-16925
> URL: https://issues.apache.org/jira/browse/SPARK-16925
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> If you have a Spark standalone cluster which runs a single application and 
> you have a Spark task which repeatedly fails by causing the executor JVM to 
> exit with a _zero_ exit code then this may temporarily freeze / hang the 
> Spark application.
> For example, running
> {code}
> sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
> {code}
> on a cluster will cause all executors to die but those executors won't be 
> replaced unless another Spark application or worker joins or leaves the 
> cluster. This is caused by a bug in the standalone Master where 
> {{schedule()}} is only called on executor exit when the exit code is 
> non-zero, whereas I think that we should always call {{schedule()}} even on a 
> "clean" executor shutdown since {{schedule()}} should be idempotent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang

2016-08-05 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-16925:
--

 Summary: Spark tasks which cause JVM to exit with a zero exit code 
may cause app to hang
 Key: SPARK-16925
 URL: https://issues.apache.org/jira/browse/SPARK-16925
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.0.0, 1.6.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical


If you have a Spark standalone cluster which runs a single application and you 
have a Spark task which repeatedly fails by causing the executor JVM to exit 
with a _zero_ exit code then this may temporarily freeze / hang the Spark 
application.

For example, running

{code}
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
{code}

on a cluster will cause all executors to die but those executors won't be 
replaced unless another Spark application or worker joins or leaves the 
cluster. This is caused by a bug in the standalone Master where {{schedule()}} 
is only called on executor exit when the exit code is non-zero, whereas I think 
that we should always call {{schedule()}} even on a "clean" executor shutdown 
since {{schedule()}} should be idempotent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16926) Partition columns are present in columns metadata for partition but not table

2016-08-05 Thread Brian Cho (JIRA)
Brian Cho created SPARK-16926:
-

 Summary: Partition columns are present in columns metadata for 
partition but not table
 Key: SPARK-16926
 URL: https://issues.apache.org/jira/browse/SPARK-16926
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Brian Cho


A change introduced in SPARK-14388 removes partition columns from the column 
metadata of tables, but not for partitions. This causes TableReader to believe 
that the schema is different and create an unnecessary conversion object 
inspector, taking the else codepath in TableReader below:

{code}
val soi = if 
(rawDeser.getObjectInspector.equals(tableDeser.getObjectInspector)) {
  rawDeser.getObjectInspector.asInstanceOf[StructObjectInspector]
} else {
  ObjectInspectorConverters.getConvertedOI(
rawDeser.getObjectInspector,
tableDeser.getObjectInspector).asInstanceOf[StructObjectInspector]
}
{code}

Printing the properties as debug output confirms the difference for the Hive 
table.

Table properties (tableDesc.getProperties):
{code}
16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, 
string:bigint:string:bigint:bigint:array
{code}

Partition properties (partProps):
{code}
16/08/04 20:36:58 DEBUG HadoopTableReader: columns.types, 
string:bigint:string:bigint:bigint:array:string:string:string
{code}

Where the final three string columns are partition columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16924:


Assignee: Apache Spark

> DataStreamReader can not support option("inferSchema", true/false) for csv 
> and json file source
> ---
>
> Key: SPARK-16924
> URL: https://issues.apache.org/jira/browse/SPARK-16924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Assignee: Apache Spark
>
> Currently DataStreamReader can not support option("inferSchema", true|false) 
> for csv and json file source. It only takes SQLConf setting 
> "spark.sql.streaming.schemaInference", which needs to be set at session 
> level. 
> For example:
> {code}
> scala> val in = spark.readStream.format("json").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/json/t1")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> scala> val in = spark.readStream.format("csv").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/csv")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> {code}
> In the example, even though users specify the option("inferSchema", true), it 
> does not take it. But for batch data, DataFrameReader can take it:
> {code}
> scala> val in = spark.read.format("csv").option("header", 
> true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1")
> in: org.apache.spark.sql.DataFrame = [signal: string, flash: int]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16924:


Assignee: (was: Apache Spark)

> DataStreamReader can not support option("inferSchema", true/false) for csv 
> and json file source
> ---
>
> Key: SPARK-16924
> URL: https://issues.apache.org/jira/browse/SPARK-16924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> Currently DataStreamReader can not support option("inferSchema", true|false) 
> for csv and json file source. It only takes SQLConf setting 
> "spark.sql.streaming.schemaInference", which needs to be set at session 
> level. 
> For example:
> {code}
> scala> val in = spark.readStream.format("json").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/json/t1")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> scala> val in = spark.readStream.format("csv").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/csv")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> {code}
> In the example, even though users specify the option("inferSchema", true), it 
> does not take it. But for batch data, DataFrameReader can take it:
> {code}
> scala> val in = spark.read.format("csv").option("header", 
> true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1")
> in: org.apache.spark.sql.DataFrame = [signal: string, flash: int]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410059#comment-15410059
 ] 

Apache Spark commented on SPARK-16924:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/14509

> DataStreamReader can not support option("inferSchema", true/false) for csv 
> and json file source
> ---
>
> Key: SPARK-16924
> URL: https://issues.apache.org/jira/browse/SPARK-16924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> Currently DataStreamReader can not support option("inferSchema", true|false) 
> for csv and json file source. It only takes SQLConf setting 
> "spark.sql.streaming.schemaInference", which needs to be set at session 
> level. 
> For example:
> {code}
> scala> val in = spark.readStream.format("json").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/json/t1")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> scala> val in = spark.readStream.format("csv").option("inferSchema", 
> true).load("/Users/xinwu/spark-test/data/csv")
> java.lang.IllegalArgumentException: Schema must be specified when creating a 
> streaming source DataFrame. If some files already exist in the directory, 
> then depending on the file format you may be able to create a static 
> DataFrame on that directory with 'spark.read.load(directory)' and infer 
> schema from it.
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>   ... 48 elided
> {code}
> In the example, even though users specify the option("inferSchema", true), it 
> does not take it. But for batch data, DataFrameReader can take it:
> {code}
> scala> val in = spark.read.format("csv").option("header", 
> true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1")
> in: org.apache.spark.sql.DataFrame = [signal: string, flash: int]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16924) DataStreamReader can not support option("inferSchema", true/false) for csv and json file source

2016-08-05 Thread Xin Wu (JIRA)
Xin Wu created SPARK-16924:
--

 Summary: DataStreamReader can not support option("inferSchema", 
true/false) for csv and json file source
 Key: SPARK-16924
 URL: https://issues.apache.org/jira/browse/SPARK-16924
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu


Currently DataStreamReader can not support option("inferSchema", true|false) 
for csv and json file source. It only takes SQLConf setting 
"spark.sql.streaming.schemaInference", which needs to be set at session level. 

For example:
{code}
scala> val in = spark.readStream.format("json").option("inferSchema", 
true).load("/Users/xinwu/spark-test/data/json/t1")
java.lang.IllegalArgumentException: Schema must be specified when creating a 
streaming source DataFrame. If some files already exist in the directory, then 
depending on the file format you may be able to create a static DataFrame on 
that directory with 'spark.read.load(directory)' and infer schema from it.
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
  at 
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
  ... 48 elided

scala> val in = spark.readStream.format("csv").option("inferSchema", 
true).load("/Users/xinwu/spark-test/data/csv")
java.lang.IllegalArgumentException: Schema must be specified when creating a 
streaming source DataFrame. If some files already exist in the directory, then 
depending on the file format you may be able to create a static DataFrame on 
that directory with 'spark.read.load(directory)' and infer schema from it.
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:223)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
  at 
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
  ... 48 elided

{code}
In the example, even though users specify the option("inferSchema", true), it 
does not take it. But for batch data, DataFrameReader can take it:
{code}
scala> val in = spark.read.format("csv").option("header", 
true).option("inferSchema", true).load("/Users/xinwu/spark-test/data/csv1")
in: org.apache.spark.sql.DataFrame = [signal: string, flash: int]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16923) Mesos cluster scheduler duplicates config vars by setting them in the environment and as --conf

2016-08-05 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-16923:
---

 Summary: Mesos cluster scheduler duplicates config vars by setting 
them in the environment and as --conf
 Key: SPARK-16923
 URL: https://issues.apache.org/jira/browse/SPARK-16923
 Project: Spark
  Issue Type: Task
  Components: Mesos
Affects Versions: 2.0.0
Reporter: Michael Gummelt


I don't think this introduces any bugs, but we should fix it nonetheless



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13238) Add ganglia dmax parameter

2016-08-05 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13238.

   Resolution: Fixed
 Assignee: Ekasit Kijsipongse
Fix Version/s: 2.1.0

> Add ganglia dmax parameter
> --
>
> Key: SPARK-13238
> URL: https://issues.apache.org/jira/browse/SPARK-13238
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Ekasit Kijsipongse
>Assignee: Ekasit Kijsipongse
>Priority: Minor
> Fix For: 2.1.0
>
>
> The current ganglia reporter doesn't set metric expiration time (dmax). The 
> metrics of all finished applications are indefinitely left displayed in 
> ganglia web. The dmax parameter allows user to set the lifetime of the 
> metrics. The default value is 0 for compatibility with previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16586) spark-class crash with "[: too many arguments" instead of displaying the correct error message

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409984#comment-15409984
 ] 

Apache Spark commented on SPARK-16586:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14508

> spark-class crash with "[: too many arguments" instead of displaying the 
> correct error message
> --
>
> Key: SPARK-16586
> URL: https://issues.apache.org/jira/browse/SPARK-16586
> Project: Spark
>  Issue Type: Bug
>Reporter: Xiang Gao
>Priority: Minor
>
> When trying to run spark on a machine that cannot provide enough memory for 
> java to use, instead of printing the correct error message, spark-class will 
> crash with {{spark-class: line 83: [: too many arguments}}
> Simple shell commands to trigger this problem are:
> {code}
> ulimit -v 10
> ./sbin/start-master.sh
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16260) PySpark ML Example Improvements and Cleanup

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16260.
---
Resolution: Done
  Assignee: Bryan Cutler

> PySpark ML Example Improvements and Cleanup
> ---
>
> Key: SPARK-16260
> URL: https://issues.apache.org/jira/browse/SPARK-16260
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> This parent task is to track a few possible improvements and cleanup for 
> PySpark ML examples I noticed during 2.0 QA.  These include:
> * Parity with Scala ML examples
> * Ensure input format is documented in example
> * Ensure results of example are clear and demonstrate functionality
> * Cleanup unused imports
> * Fix minor issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16826:
--
Assignee: Sylvain Zimmer

> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Sylvain Zimmer
> Fix For: 2.1.0
>
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16421) Improve output from ML examples

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16421:
--
Assignee: Bryan Cutler
Priority: Major  (was: Trivial)

> Improve output from ML examples
> ---
>
> Key: SPARK-16421
> URL: https://issues.apache.org/jira/browse/SPARK-16421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, ML
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.1.0
>
>
> In many ML examples, the output is useless.  Sometimes {{show()}} is called 
> and any pertinent results are hidden.  For example, here is the output of 
> max_abs_scaler
> {noformat}
> $ bin/spark-submit examples/src/main/python/ml/max_abs_scaler_example.py 
> +-+++
> |label|features|  scaledFeatures|
> +-+++
> |  0.0|(692,[127,128,129...|(692,[127,128,129...|
> |  1.0|(692,[158,159,160...|(692,[158,159,160...|
> |  1.0|(692,[124,125,126...|(692,[124,125,126...|
> {noformat}
> Other times a few rows are printed out when {{show}} might be more 
> appropriate.  Here is the output from binarizer_example
> {noformat}
> $ bin/spark-submit examples/src/main/python/ml/binarizer_example.py 
> 0.0   
>   
> 1.0
> 0.0
> {noformat}
> But would be much more useful to just {{show()}} the transformed DataFrame
> {noformat}
> +-+---+-+
> |label|feature|binarized_feature|
> +-+---+-+
> |0|0.1|  0.0|
> |1|0.8|  1.0|
> |2|0.2|  0.0|
> +-+---+-+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16421) Improve output from ML examples

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16421.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14308
[https://github.com/apache/spark/pull/14308]

> Improve output from ML examples
> ---
>
> Key: SPARK-16421
> URL: https://issues.apache.org/jira/browse/SPARK-16421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, ML
>Reporter: Bryan Cutler
>Priority: Trivial
> Fix For: 2.1.0
>
>
> In many ML examples, the output is useless.  Sometimes {{show()}} is called 
> and any pertinent results are hidden.  For example, here is the output of 
> max_abs_scaler
> {noformat}
> $ bin/spark-submit examples/src/main/python/ml/max_abs_scaler_example.py 
> +-+++
> |label|features|  scaledFeatures|
> +-+++
> |  0.0|(692,[127,128,129...|(692,[127,128,129...|
> |  1.0|(692,[158,159,160...|(692,[158,159,160...|
> |  1.0|(692,[124,125,126...|(692,[124,125,126...|
> {noformat}
> Other times a few rows are printed out when {{show}} might be more 
> appropriate.  Here is the output from binarizer_example
> {noformat}
> $ bin/spark-submit examples/src/main/python/ml/binarizer_example.py 
> 0.0   
>   
> 1.0
> 0.0
> {noformat}
> But would be much more useful to just {{show()}} the transformed DataFrame
> {noformat}
> +-+---+-+
> |label|feature|binarized_feature|
> +-+---+-+
> |0|0.1|  0.0|
> |1|0.8|  1.0|
> |2|0.2|  0.0|
> +-+---+-+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16826.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14488
[https://github.com/apache/spark/pull/14488]

> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
> Fix For: 2.1.0
>
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query failure due to executor OOM in Spark 2.0

2016-08-05 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409878#comment-15409878
 ] 

Sital Kedia commented on SPARK-16922:
-

PS - Rerunning the query with spark.sql.codegen.aggregate.map.columns.max=0 to 
disabled vectorized aggregation for columnar map also OOMs, but with a 
different stack trace. 

{code}
[Stage 1:>(0 + 4) / 
184]16/08/05 11:26:31 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID 4, 
hadoop4774.prn2.facebook.com): java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code}

> Query failure due to executor OOM in Spark 2.0
> --
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query 

[jira] [Assigned] (SPARK-16905) Support SQL DDL: MSCK REPAIR TABLE

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16905:


Assignee: Apache Spark  (was: Davies Liu)

> Support SQL DDL: MSCK REPAIR TABLE
> --
>
> Key: SPARK-16905
> URL: https://issues.apache.org/jira/browse/SPARK-16905
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> MSCK REPAIR TABLE could be used to recover the partitions in metastore based 
> on partitions in file system.
> Another syntax is:
> ALTER TABLE table RECOVER PARTITIONS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409829#comment-15409829
 ] 

Sean Owen commented on SPARK-6305:
--

The problem, I think, is that then this would not take effect if the user 
overrides with a custom logging configuration.
If this, or the silencing that some of the examples do, is really otherwise 
blocking this change, I'd say yank it out so we can update and just live with 
the message or provide at least the appropriate config in a default logging 
file that users can copy if desired. 

I'm not so so worried about this aspect as much as maintaining the right POM 
configuration to exclude and continue to exclude all external usages of log4j 1 
and do all the appropriate redirecting. That's going to be a small nightmare to 
keep correct, though, if Spark stops using log4j directly, probably less 
problematic.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16905) Support SQL DDL: MSCK REPAIR TABLE

2016-08-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409830#comment-15409830
 ] 

Apache Spark commented on SPARK-16905:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14500

> Support SQL DDL: MSCK REPAIR TABLE
> --
>
> Key: SPARK-16905
> URL: https://issues.apache.org/jira/browse/SPARK-16905
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> MSCK REPAIR TABLE could be used to recover the partitions in metastore based 
> on partitions in file system.
> Another syntax is:
> ALTER TABLE table RECOVER PARTITIONS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16905) Support SQL DDL: MSCK REPAIR TABLE

2016-08-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16905:


Assignee: Davies Liu  (was: Apache Spark)

> Support SQL DDL: MSCK REPAIR TABLE
> --
>
> Key: SPARK-16905
> URL: https://issues.apache.org/jira/browse/SPARK-16905
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> MSCK REPAIR TABLE could be used to recover the partitions in metastore based 
> on partitions in file system.
> Another syntax is:
> ALTER TABLE table RECOVER PARTITIONS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query failure due to executor OOM in Spark 2.0

2016-08-05 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409813#comment-15409813
 ] 

Sital Kedia commented on SPARK-16922:
-

cc - [~rxin] 

> Query failure due to executor OOM in Spark 2.0
> --
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16922) Query failure due to executor OOM in Spark 2.0

2016-08-05 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-16922:
---

 Summary: Query failure due to executor OOM in Spark 2.0
 Key: SPARK-16922
 URL: https://issues.apache.org/jira/browse/SPARK-16922
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.0.0
Reporter: Sital Kedia


A query which used to work in Spark 1.6 fails with executor OOM in 2.0.

Stack trace - 
{code}
at 
org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code}

Query plan in Spark 1.6

{code}
== Physical Plan ==
TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
+- TungstenExchange hashpartitioning(field1#101,200), None
   +- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
  +- Project [field1#101,field2#74]
 +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
decimal(20,0)) as bigint)], BuildRight
:- ConvertToUnsafe
:  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation foo, 
table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
+- ConvertToUnsafe
   +- HiveTableScan [field1#101,field4#97], MetastoreRelation foo, 
table2, Some(b)
{code}


Query plan in 2.0
{code}
== Physical Plan ==
*HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
+- Exchange hashpartitioning(field1#160, 200)
   +- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
100.0))])
  +- *Project [field2#133, field1#160]
 +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
decimal(20,0)) as bigint)], Inner, BuildRight
:- *Filter isnotnull(field5#122L)
:  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
2013-12-31)]
+- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
decimal(20,0)) as bigint)))
   +- *Filter isnotnull(field4#156)
  +- HiveTableScan [field4#156, field1#160], MetastoreRelation 
foo, table2, b
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409801#comment-15409801
 ] 

Yin Huai commented on SPARK-16610:
--

Sure. For ORC, {{orc.compression}} is the actual conf that lets ORC writer know 
what codec to use. I think it is fine to have {{compression}}. But, if a user 
uses {{orc.compression}} and do not use {{compression}}, we just cannot ignore 
the value of {{orc.compression}} and use the default value of {{compression}} 
to override {{orc.compression}}. 

> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2016-08-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5312.
--
Resolution: Won't Fix

> Use sbt to detect new or changed public classes in PRs
> --
>
> Key: SPARK-5312
> URL: https://issues.apache.org/jira/browse/SPARK-5312
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We currently use an [unwieldy grep/sed 
> contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
>  to detect new public classes in PRs.
> -Apparently, sbt lets you get a list of public classes [much more 
> directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
> {{show compile:discoveredMainClasses}}. We should use that instead.-
> There is a tool called [ClassUtil|http://software.clapper.org/classutil/] 
> that seems to help give this kind of information much more directly. We 
> should look into using that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16921) RDD/DataFrame persist() and cache() should return Python context managers

2016-08-05 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-16921:


 Summary: RDD/DataFrame persist() and cache() should return Python 
context managers
 Key: SPARK-16921
 URL: https://issues.apache.org/jira/browse/SPARK-16921
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core, SQL
Reporter: Nicholas Chammas
Priority: Minor


[Context 
managers|https://docs.python.org/3/reference/datamodel.html#context-managers] 
are a natural way to capture closely related setup and teardown code in Python.

For example, they are commonly used when doing file I/O:

{code}
with open('/path/to/file') as f:
contents = f.read()
...
{code}

Once the program exits the with block, {{f}} is automatically closed.

I think it makes sense to apply this pattern to persisting and unpersisting 
DataFrames and RDDs. There are many cases when you want to persist a DataFrame 
for a specific set of operations and then unpersist it immediately afterwards.

For example, take model training. Today, you might do something like this:

{code}
labeled_data.persist()
model = pipeline.fit(labeled_data)
labeled_data.unpersist()
{code}

If {{persist()}} returned a context manager, you could rewrite this as follows:

{code}
with labeled_data.persist():
model = pipeline.fit(labeled_data)
{code}

Upon exiting the {{with}} block, {{labeled_data}} would automatically be 
unpersisted.

This can be done in a backwards-compatible way since {{persist()}} would still 
return the parent DataFrame or RDD as it does today, but add two methods to the 
object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.

2016-08-05 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas closed SPARK-7505.
---
Resolution: Invalid

Closing this as invalid as I believe these issues are no longer important.

> Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, 
> etc.
> 
>
> Key: SPARK-7505
> URL: https://issues.apache.org/jira/browse/SPARK-7505
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The PySpark docs for DataFrame need the following fixes and improvements:
> # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over 
> {{\_\_getattr\_\_}} and change all our examples accordingly.
> # *We should say clearly that the API is experimental.* (That is currently 
> not the case for the PySpark docs.)
> # We should provide an example of how to join and select from 2 DataFrames 
> that have identically named columns, because it is not obvious:
>   {code}
> >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}']))
> >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}']))
> >>> df12 = df1.join(df2, df1['a'] == df2['a'])
> >>> df12.select(df1['a'], df2['other']).show()
> a other   
> 
> 4 I dunno  {code}
> # 
> [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy]
>  and 
> [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort]
>  should be marked as aliases if that's what they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2016-08-05 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409767#comment-15409767
 ] 

Nicholas Chammas commented on SPARK-5312:
-

[~boyork] - Shall we close this? It doesn't look like it has any momentum.

> Use sbt to detect new or changed public classes in PRs
> --
>
> Key: SPARK-5312
> URL: https://issues.apache.org/jira/browse/SPARK-5312
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We currently use an [unwieldy grep/sed 
> contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
>  to detect new public classes in PRs.
> -Apparently, sbt lets you get a list of public classes [much more 
> directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
> {{show compile:discoveredMainClasses}}. We should use that instead.-
> There is a tool called [ClassUtil|http://software.clapper.org/classutil/] 
> that seems to help give this kind of information much more directly. We 
> should look into using that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2016-08-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16920:
--
Issue Type: Improvement  (was: Bug)

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2016-08-05 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16920:
-

 Summary: Investigate and fix issues introduced in SPARK-15858
 Key: SPARK-16920
 URL: https://issues.apache.org/jira/browse/SPARK-16920
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Vladimir Feinberg


There were several issues regarding the PR resolving SPARK-15858, my comments 
are available here:

https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93

The two most important issues are:

1. The PR did not add a stress test proving it resolved the issue it was 
supposed to (though I have no doubt the optimization made is indeed correct).
2. The PR introduced quadratic prediction time in terms of the number of trees, 
which was previously linear. This issue needs to be investigated for whether it 
causes problems for large numbers of trees (say, 1000), an appropriate test 
should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame

2016-08-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409691#comment-15409691
 ] 

Shivaram Venkataraman commented on SPARK-16883:
---

The thing to check then is how the serialization / deserialization is 
happening. That is in SerDe.scala we should be writing the Decimal out as 
numeric for this to work correctly

> SQL decimal type is not properly cast to number when collecting SparkDataFrame
> --
>
> Key: SPARK-16883
> URL: https://issues.apache.org/jira/browse/SPARK-16883
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> To reproduce run following code. As you can see "y" is a list of values.
> {code}
> registerTempTable(createDataFrame(iris), "iris")
> str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  
> from iris limit 5")))
> 'data.frame': 5 obs. of  2 variables:
>  $ x: num  1 1 1 1 1
>  $ y:List of 5
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
>   ..$ : num 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409660#comment-15409660
 ] 

Mikael Ståldal commented on SPARK-6305:
---

For unit tests, you can put a logging configuration file in 
{{src/test/resources}}.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409655#comment-15409655
 ] 

Mikael Ståldal commented on SPARK-6305:
---

It should not be necessary to do explicit logging configuration for REPL as 
done in Logging.scala:(130-141), you can include a log configuration file in 
{{src/main/resources}} in spark-repl instead.

Not as easy for the R and Python backends though, since they are in the 
spark-core module. However, I think you should at least move that code to those 
classes rather than keeping it in Logging.scala.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409646#comment-15409646
 ] 

Mikael Ståldal commented on SPARK-6305:
---

It might not be necessary to load logging configuration explicitly as done in 
Logging.scala:(117-128). You could include a log configuration file in 
{{src/main/resources}} and then the user can override it with a Java system 
property.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16917) Spark streaming kafka version compatibility.

2016-08-05 Thread Sudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudev updated SPARK-16917:
--
Description: 
It would be nice to have Kafka version compatibility information in the 
official documentation. 

It's very confusing now. 
* If you look at this JIRA[1], it seems like Kafka is supported in Spark 2.0.0.
* The documentation lists artifact for (Kafka 0.8)  
spark-streaming-kafka-0-8_2.11

Is Kafka 0.9 supported by Spark 2.0.0 ?

Since I'm confused here even after an hours effort googling on the same, I 
think someone should help add the compatibility matrix.

[1] https://issues.apache.org/jira/browse/SPARK-12177



  was:
It would be nice to have Kafka version compatibility information in the 
official documentation. 

It's very confusing now. 
* If you look at this JIRA[1], it seems like Kafka is supported in Spark 2.0.0.
* The documentation lists attifact for (Kafka 0.8)  
spark-streaming-kafka-0-8_2.11

Is Kafka 0.9 supported by Spark 2.0.0 ?

Since I'm confused here even after an hours effort googling on the same, I 
think someone should help add the compatibility matrix.

[1] https://issues.apache.org/jira/browse/SPARK-12177




> Spark streaming kafka version compatibility. 
> -
>
> Key: SPARK-16917
> URL: https://issues.apache.org/jira/browse/SPARK-16917
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Sudev
>Priority: Trivial
>  Labels: documentation
>
> It would be nice to have Kafka version compatibility information in the 
> official documentation. 
> It's very confusing now. 
> * If you look at this JIRA[1], it seems like Kafka is supported in Spark 
> 2.0.0.
> * The documentation lists artifact for (Kafka 0.8)  
> spark-streaming-kafka-0-8_2.11
> Is Kafka 0.9 supported by Spark 2.0.0 ?
> Since I'm confused here even after an hours effort googling on the same, I 
> think someone should help add the compatibility matrix.
> [1] https://issues.apache.org/jira/browse/SPARK-12177



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >