[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-08-21 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105086#comment-14105086
 ] 

Saisai Shao commented on SPARK-3129:


Hi Hari, I have some high level questions about this:

1. In the design doc, you mentioned to do Once the RDD is generated, the RDD 
is checkpointed to HDFS - at which point it is fully 
recoverable, I'm not sure you checkpoint only the metadata of RDD or also 
about the data? I think RDD checkpointing is little expensive for each batch 
duration if the batch duration is quite short.
2. If we keep executors alive when driver dies, do we still need to keep 
receivers to receive data from external source? If so I think there may 
potentially have some problems: firstly memory usage will be accumulated since 
no data is consumed; secondly when driver comes back how to balance the data 
processing priority, since old data needs to be processed first, this will 
delay the newly coming data processing time and lead to unwanted issue if 
latency is larger than the batch duration.
3. In some scenarios we need to operate DStream with RDD (like join real-time 
data with history log), normally RDD is cached in BM's memory, I think we also 
need to recover this RDD's metadata, not only streaming data if we need to 
recover the processing.

Maybe there are many other details we need to think about, because to do driver 
HA is quite complex. Please correct me if something is misunderstood. Thanks a 
lot.


 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105126#comment-14105126
 ] 

Sandy Ryza commented on SPARK-2978:
---

So I started looking into this a little more and wanted to bring up a semantics 
issue I came across.

The proposed implementation would be to use a similar path to that used by 
sortByKey in each reduce task, and then wrap the Iterator over sorted records 
with an Iterator that groups them.  I.e. wrap an the Iterator[(K, V)] in an 
Iterator[(K, Iterator[V])].  The question is how to handle the validity of an 
inner V iterator with respect to the outer Iterator.  The options as I see it 
are:
1. Calling next() or hasNext() on the outer iterator invalidates the current 
inner V iterator.
2. The inner V iterator must be exhausted before calling next() or hasNext() on 
the outer iterator.
3. On each next() call on the outer iterator, scan over all the values for that 
key and put them in a separate buffer. 

The MapReduce approach, where the outer iterator is replaced by a sequence of 
calls to the reduce function, is similar to (1).

When the Iterators returned by groupByKey are eventually disk-backed, we'll 
face the same issue, so we probably want to make the semantics there consistent 
with whatever we decide here.


 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105128#comment-14105128
 ] 

Sandy Ryza commented on SPARK-2978:
---

[~jerryshao], if I understand correctly, ShuffleRDD already supports what's 
needed here, and satisfying that need is independent of whether we sort on the 
map side.  That said, I think the changes you proposed on SPARK-2926 could 
definitely make this more performant, and we would likely see the same 
improvements you benchmarked for sortByKey.

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs

2014-08-21 Thread Hanwei Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105221#comment-14105221
 ] 

Hanwei Jin commented on SPARK-2096:
---

I think i almost have solved the issue. 

I have passed the test case in JsonSuite (Complex field and type inferring 
(Ignored)) which is ignored, by a little modified.

modified test part :
checkAnswer(
  sql(select arrayOfStruct.field1, arrayOfStruct.field2 from jsonTable),
  (Seq(true, false, null), Seq(str1, null, null)) :: Nil
)

However, another question is repeated nested structure is a problem, like 
arrayOfStruct.field1.arrayOfStruct.field1 or 
arrayOfStruct[0].field1.arrayOfStruct[0].field1

I plan to ignore this problem and try to add  select arrayOfStruct.field1, 
arrayOfStruct.field2 from jsonTable where arrayOfStruct.field1==true 

Besides, my friend anyweil (Wei Li) solved the problem of arrayOfStruct.field1 
and its Filter part( means where parsing).

I am fresh here but will continue working on spark :)


 Correctly parse dot notations for accessing an array of structs
 ---

 Key: SPARK-2096
 URL: https://issues.apache.org/jira/browse/SPARK-2096
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yin Huai
Priority: Minor
  Labels: starter

 For example, arrayOfStruct is an array of structs and every element of this 
 array has a field called field1. arrayOfStruct[0].field1 means to access 
 the value of field1 for the first element of arrayOfStruct, but the SQL 
 parser (in sql-core) treats field1 as an alias. Also, 
 arrayOfStruct.field1 means to access all values of field1 in this array 
 of structs and the returns those values as an array. But, the SQL parser 
 cannot resolve it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2988) Port repl to scala 2.11.

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105238#comment-14105238
 ] 

Apache Spark commented on SPARK-2988:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/2079

 Port repl to scala 2.11.
 

 Key: SPARK-2988
 URL: https://issues.apache.org/jira/browse/SPARK-2988
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Prashant Sharma
Assignee: Prashant Sharma





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2963) The description about building to use HiveServer and CLI is incomplete

2014-08-21 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta reopened SPARK-2963:
---


 The description about building to use HiveServer and CLI is incomplete
 --

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
 Fix For: 1.1.0


 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
 -Phive-thriftserver option when building but it's description is incomplete.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2963) The description about how to build for using CLI and Thrift JDBC server is absent in proper document

2014-08-21 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Summary: The description about how to build for using CLI and Thrift JDBC 
server is absent in proper document   (was: The description about building to 
use HiveServer and CLI is incomplete)

 The description about how to build for using CLI and Thrift JDBC server is 
 absent in proper document 
 -

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
 Fix For: 1.1.0


 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
 -Phive-thriftserver option when building but it's description is incomplete.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2963) The description about how to build for using CLI and Thrift JDBC server is absent in proper document

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105273#comment-14105273
 ] 

Apache Spark commented on SPARK-2963:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2080

 The description about how to build for using CLI and Thrift JDBC server is 
 absent in proper document 
 -

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
 Fix For: 1.1.0


 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
 -Phive-thriftserver option when building but it's description is incomplete.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3169) make-distribution.sh failed

2014-08-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105277#comment-14105277
 ] 

Sean Owen commented on SPARK-3169:
--

Same as https://issues.apache.org/jira/browse/SPARK-2798 ? it's resolving 
similar problems in the Flume build.

 make-distribution.sh failed
 ---

 Key: SPARK-3169
 URL: https://issues.apache.org/jira/browse/SPARK-3169
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Guoqiang Li
Priority: Blocker

 {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive 
 -Dhadoop.version=2.3.0 
 {code}
  =
 {noformat}
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
 Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A 
 signature in TestSuiteBase.class refers to term dstream
 in package org.apache.spark.streaming which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 TestSuiteBase.class.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1449) Please delete old releases from mirroring system

2014-08-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105303#comment-14105303
 ] 

Sean Owen commented on SPARK-1449:
--

[~pwendell] can you or someone else on the PMC zap this one? should be 
straightforward.

 Please delete old releases from mirroring system
 

 Key: SPARK-1449
 URL: https://issues.apache.org/jira/browse/SPARK-1449
 Project: Spark
  Issue Type: Task
Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.9.1
Reporter: Sebb

 To reduce the load on the ASF mirrors, projects are required to delete old 
 releases [1]
 Please can you remove all non-current releases?
 Thanks!
 [Note that older releases are always available from the ASF archive server]
 Any links to older releases on download pages should first be adjusted to 
 point to the archive server.
 [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs

2014-08-21 Thread Hanwei Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105329#comment-14105329
 ] 

Hanwei Jin commented on SPARK-2096:
---

I checked the problem  where arrayOfStruct.field1==true  

this problem will lead to modify every kind of comparisonExpression. And I 
think it makes no sense to add this function. So I discard it. 

Over.

 Correctly parse dot notations for accessing an array of structs
 ---

 Key: SPARK-2096
 URL: https://issues.apache.org/jira/browse/SPARK-2096
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yin Huai
Priority: Minor
  Labels: starter

 For example, arrayOfStruct is an array of structs and every element of this 
 array has a field called field1. arrayOfStruct[0].field1 means to access 
 the value of field1 for the first element of arrayOfStruct, but the SQL 
 parser (in sql-core) treats field1 as an alias. Also, 
 arrayOfStruct.field1 means to access all values of field1 in this array 
 of structs and the returns those values as an array. But, the SQL parser 
 cannot resolve it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2291) Update EC2 scripts to use instance storage on m3 instance types

2014-08-21 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105342#comment-14105342
 ] 

Daniel Darabos commented on SPARK-2291:
---

I don't know if something has changed on Amazon's end or if I'm missing 
something. (I'm pretty clueless.) But we still see missing SSDs. This change 
fixed it for us: https://github.com/apache/spark/pull/2081/files. The block 
device mapping entries are necessary according to 
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#InstanceStore_UsageScenarios.

I guess you tested PR #1156. Actually it seemed to have worked for us too for a 
while. But now some of the machines come up without SSDs. (/dev/sdb and 
/dev/sdc do not exist.) So I read the docs and tried adding the block device 
mappings. Seems to work. With PR #2081 all machines have the SSDs.

Hope this makes sense.

 Update EC2 scripts to use instance storage on m3 instance types
 ---

 Key: SPARK-2291
 URL: https://issues.apache.org/jira/browse/SPARK-2291
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Alessandro Andrioni

 [On January 
 21|https://aws.amazon.com/about-aws/whats-new/2014/01/21/announcing-new-amazon-ec2-m3-instance-sizes-and-lower-prices-for-amazon-s3-and-amazon-ebs/],
  Amazon added SSD-backed instance storages for m3 instances, and also added 
 two new types: m3.medium and m3.large.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105353#comment-14105353
 ] 

Apache Spark commented on SPARK-2096:
-

User 'chuxi' has created a pull request for this issue:
https://github.com/apache/spark/pull/2082

 Correctly parse dot notations for accessing an array of structs
 ---

 Key: SPARK-2096
 URL: https://issues.apache.org/jira/browse/SPARK-2096
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yin Huai
Priority: Minor
  Labels: starter

 For example, arrayOfStruct is an array of structs and every element of this 
 array has a field called field1. arrayOfStruct[0].field1 means to access 
 the value of field1 for the first element of arrayOfStruct, but the SQL 
 parser (in sql-core) treats field1 as an alias. Also, 
 arrayOfStruct.field1 means to access all values of field1 in this array 
 of structs and the returns those values as an array. But, the SQL parser 
 cannot resolve it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3150) NullPointerException in Spark recovery after simultaneous fall of master and driver

2014-08-21 Thread Tatiana Borisova (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tatiana Borisova updated SPARK-3150:


Description: 
The issue happens when Spark is run standalone on a cluster.

When master and driver fall simultaneously on one node in a cluster, master 
tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark 
driver again. It happens over and over in an infinite cycle.

Namely, Spark tries to read DriverInfo state from zookeeper, but after reading 
it happens to be null in DriverInfo.worker.

Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)

2014-08-14 21:44:59,519] ERROR  (akka.actor.OneForOneStrategy)
java.lang.NullPointerException
at 
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at 
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
at 
org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

How to reproduce: kill all Spark processes when running Spark standalone on a 
cluster on some cluster node, where driver runs (kill driver, master and worker 
simultaneously).

  was:
The issue happens when Spark is run standalone on a cluster.

When master and driver fall simultaneously on one node in a cluster, master 
tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark 
driver again. It happens over and over in an infinite cycle.

Namely, Spark tries to read DriverInfo state from zookeeper, but after reading 
it happens to be null in DriverInfo.worker.

Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)

2014-08-14 21:44:59,519] ERROR  (akka.actor.OneForOneStrategy)
java.lang.NullPointerException
at 
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at 
org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
at 
org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

How to reproduce: kill both master and driver processes on some cluster node 
when running Spark standalone on a cluster.


 NullPointerException in Spark recovery after simultaneous fall of master and 
 driver
 ---

 Key: SPARK-3150
   

[jira] [Created] (SPARK-3172) Distinguish between shuffle spill on the map and reduce side

2014-08-21 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3172:
-

 Summary: Distinguish between shuffle spill on the map and reduce 
side
 Key: SPARK-3172
 URL: https://issues.apache.org/jira/browse/SPARK-3172
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3070) Kryo deserialization without using the custom registrator

2014-08-21 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105536#comment-14105536
 ] 

Daniel Darabos commented on SPARK-3070:
---

I think this is almost certainly a duplicate of 
https://issues.apache.org/jira/browse/SPARK-2878. Which is FIXED, thanks to 
Graham Dennis! Can you please check the repro against the fixed code to see if 
this can be closed? Thanks :).

 Kryo deserialization without using the custom registrator
 -

 Key: SPARK-3070
 URL: https://issues.apache.org/jira/browse/SPARK-3070
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andras Nemeth

 If an RDD partition is cached on executor1 and used by a task on executor2 
 then the partition needs to be serialized and sent over. For this particular 
 serialization/deserialization usecase, when using kry, it appears that the 
 custom registrator will not be used on the deserialization side. This of 
 course results in some totally misleading kry deserialization errors.
 The cause for this behavior seems to be that the thread running this 
 deserialization has a classloader which does not have the jars specified in 
 the SparkConf on its classpath. So it fails to load the Registrator with a 
 ClassNotFoundException, but it catches the exception and happily continues 
 without a registrator. (A bug on its own right in my opinion.)
 To reproduce, have two rdds partitioned the same way (as in with the same 
 partitioner) but corresponding partitions cached on different machines, then 
 join them. See below a somewhat convoluted way to achieve this. If you run 
 the below program on a spark cluster with two workers, each with one core, 
 you will be able to trigger the bug. Basically it runs two counts in 
 parallel, which ensures that the two RDDs will be computed in parallel, and 
 as a consequence on different executors.
 {code:java}
 import com.esotericsoftware.kryo.Kryo
 import org.apache.spark.HashPartitioner
 import org.apache.spark.SparkConf
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import org.apache.spark.rdd.RDD
 import org.apache.spark.serializer.KryoRegistrator
 import scala.actors.Actor
 case class MyClass(a: Int)
 class MyKryoRegistrator extends KryoRegistrator {
   override def registerClasses(kryo: Kryo) {
 kryo.register(classOf[MyClass])
   }
 }
 class CountActor(rdd: RDD[_]) extends Actor {
   def act() {
 println(Start count)
 println(rdd.count)
 println(Stop count)
   }
 }
 object KryBugExample  {
   def main(args: Array[String]) {
 val sparkConf = new SparkConf()
   .setMaster(args(0))
   .setAppName(KryBugExample)
   .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
   .set(spark.kryo.registrator, MyKryoRegistrator)
   .setJars(Seq(target/scala-2.10/krybugexample_2.10-0.1-SNAPSHOT.jar))
 val sc = new SparkContext(sparkConf)
 val partitioner = new HashPartitioner(1)
 val rdd1 = sc
   .parallelize((0 until 10).map(i = (i, MyClass(i))), 1)
   .partitionBy(partitioner).cache
 val rdd2 = sc
   .parallelize((0 until 10).map(i = (i, MyClass(i * 2))), 1)
   .partitionBy(partitioner).cache
 new CountActor(rdd1).start
 new CountActor(rdd2).start
 println(rdd1.join(rdd2).count)
 while (true) {}
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive functions

2014-08-21 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105537#comment-14105537
 ] 

William Benton commented on SPARK-2863:
---

I wrote up how Hive handles type coercions in a blog post:

http://chapeau.freevariable.com/2014/08/existing-system-coercion.html

The short version is that strings can be coerced to doubles or decimals and (in 
Hive 0.13) decimals can be coerced to doubles for numeric functions.  As a 
first pass, I propose extending the numeric function helpers to handle strings.

 Emulate Hive type coercion in native reimplementations of Hive functions
 

 Key: SPARK-2863
 URL: https://issues.apache.org/jira/browse/SPARK-2863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Assignee: William Benton

 Native reimplementations of Hive functions no longer have the same 
 type-coercion behavior as they would if executed via Hive.  As [Michael 
 Armbrust points 
 out|https://github.com/apache/spark/pull/1750#discussion_r15790970], queries 
 like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if 
 {{SQRT}} is implemented natively.
 Spark SQL should have Hive-compatible type coercions for arguments to 
 natively-implemented functions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3173) Timestamp support in the parser

2014-08-21 Thread Zdenek Farana (JIRA)
Zdenek Farana created SPARK-3173:


 Summary: Timestamp support in the parser
 Key: SPARK-3173
 URL: https://issues.apache.org/jira/browse/SPARK-3173
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
Reporter: Zdenek Farana


If you have a table with TIMESTAMP column, that column can't be used in WHERE 
clause properly - it is not evaluated properly.

F.e., SELECT * FROM a WHERE timestamp='2014-08-21 00:00:00.0', would return 
nothing even if there would be a row with such a timestamp. The literal is not 
interpreted into a timestamp.

The workaround SELECT * FROM a WHERE timestamp=CAST('2014-08-21 00:00:00.0' AS 
TIMESTAMP) fails, because the parser does not allow anything but STRING in the 
CAST dataType expression.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3065) Add Locale setting to HiveCompatibilitySuite

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105628#comment-14105628
 ] 

Apache Spark commented on SPARK-3065:
-

User 'byF' has created a pull request for this issue:
https://github.com/apache/spark/pull/2084

 Add Locale setting to HiveCompatibilitySuite
 

 Key: SPARK-3065
 URL: https://issues.apache.org/jira/browse/SPARK-3065
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
 Environment: CentOS release 6.3 (Final)
Reporter: luogankun
 Fix For: 1.0.2


 Run the udf_unix_timestamp of 
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite testcase
 with not America/Los_Angeles TimeZone throws: 
 [info] - udf_unix_timestamp *** FAILED ***
 [info]   Results do not match for udf_unix_timestamp:
 [info]   SELECT
 [info] '2009 Mar 20 11:30:01 am',
 [info] unix_timestamp('2009 Mar 20 11:30:01 am', ' MMM dd h:mm:ss a')
 [info]   FROM oneline
 [info]   == Logical Plan ==
 [info]   Project [2009 Mar 20 11:30:01 am AS 
 c_0#25,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2009
  Mar 20 11:30:01 am, MMM dd h:mm:ss a) AS c_1#26L]
 [info]MetastoreRelation default, oneline, None
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Project [2009 Mar 20 11:30:01 am AS 
 c_0#25,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2009
  Mar 20 11:30:01 am, MMM dd h:mm:ss a) AS c_1#26L]
 [info]MetastoreRelation default, oneline, None
 [info]   
 [info]   == Physical Plan ==
 [info]   Project [2009 Mar 20 11:30:01 am AS 
 c_0#25,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2009
  Mar 20 11:30:01 am, MMM dd h:mm:ss a) AS c_1#26L]
 [info]HiveTableScan [], (MetastoreRelation default, oneline, None), None
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   (2) MappedRDD[37] at map at HiveContext.scala:350
 [info] MapPartitionsRDD[36] at mapPartitions at basicOperators.scala:42
 [info] MapPartitionsRDD[35] at mapPartitions at TableReader.scala:112
 [info] MappedRDD[34] at map at TableReader.scala:240
 [info] HadoopRDD[33] at HadoopRDD at TableReader.scala:230
 [info]   c_0c_1
 [info]   !== HIVE - 1 row(s) ==== CATALYST - 1 row(s) ==
 [info]   !2009 Mar 20 11:30:01 am   1237573801   2009 Mar 20 11:30:01 am  
   NULL (HiveComparisonTest.scala:367)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3174) Under YARN, add and remove executors based on load

2014-08-21 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3174:
-

 Summary: Under YARN, add and remove executors based on load
 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza


A common complaint with Spark in a multi-tenant environment is that 
applications have a fixed allocation that doesn't grow and shrink with their 
resource needs.  We're blocked on YARN-1197 for dynamically changing the 
resources within executors, but we can still allocate and discard whole 
executors.

I think it would be useful to have some heuristics that
* Request more executors when many pending tasks are building up
* Request more executors when RDDs can't fit in memory
* Discard executors when few tasks are running / pending and there's not much 
in memory

Bonus points: migrate blocks from executors we're about to discard to executors 
with free space.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3175) Branch-1.1 SBT build failed for Yarn-Alpha

2014-08-21 Thread Chester (JIRA)
Chester created SPARK-3175:
--

 Summary: Branch-1.1 SBT build failed for Yarn-Alpha
 Key: SPARK-3175
 URL: https://issues.apache.org/jira/browse/SPARK-3175
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.1
Reporter: Chester
 Fix For: 1.1.1


When trying to build yarn-alpha on branch-1.1

᚛ |branch-1.1|$  sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha projects

[info] Loading project definition from /Users/chester/projects/spark/project

org.apache.maven.model.building.ModelBuildingException: 1 problem was 
encountered while building the effective model for 
org.apache.spark:spark-yarn-alpha_2.10:1.1.0

[FATAL] Non-resolvable parent POM: Could not find artifact 
org.apache.spark:yarn-parent_2.10:pom:1.1.0 in central ( 
http://repo.maven.apache.org/maven2) and 'parent.relativePath' points at wrong 
local POM @ line 20, column 11







--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-08-21 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105928#comment-14105928
 ] 

Hari Shreedharan commented on SPARK-3129:
-

[~tgraves] - Thanks for the pointers. Yes, using HDFS also allows us to use the 
same file with some protection to store the keys. This is something that might 
some design and discussion first. 

I will also update the PR with the reflection code.

[~jerryshao]:
1. Today RDDs already get checkpointed at the end of every job when the runJob 
method gets called. Nothing is changing here. The entire graph does get 
checkpointed today already.
2. No, this is something that will need to be taken care of. When the driver 
dies, blocks can no longer be batched into RDDs - which means generating blocks 
without the driver makes no sense. Also, when the driver comes back online, new 
receivers get created, which would start receiving the data now. The only 
reason the executors are being kept around is to get the data in their memory - 
any processing/receiving should be killed.
3. Since it is an RDD, there is nothing that stops it from being recovered, 
right? It is recovered by the usual method of regenerating it. Only DStream 
data that has not been converted into an RDD is really lost - so getting the 
RDD back should not be a concern at all (of course, the cache is gone, but it 
can get pulled back into cache once the driver comes back up).


 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3176) Implement math function 'POWER' and 'ABS' for sql

2014-08-21 Thread Xinyun Huang (JIRA)
Xinyun Huang created SPARK-3176:
---

 Summary: Implement math function 'POWER' and 'ABS' for sql
 Key: SPARK-3176
 URL: https://issues.apache.org/jira/browse/SPARK-3176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0


Add support for the mathematical function POWER and ABS within spark sql.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3177) Yarn-alpha ClientBaseSuite Unit test failed

2014-08-21 Thread Chester (JIRA)
Chester created SPARK-3177:
--

 Summary: Yarn-alpha ClientBaseSuite Unit test failed
 Key: SPARK-3177
 URL: https://issues.apache.org/jira/browse/SPARK-3177
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.1
Reporter: Chester
Priority: Minor
 Fix For: 1.1.1


Yarn-alpha ClientBaseSuite Unit test failed due to differences of MRJobConfig 
API between yarn-stable and yarn-alpha. 

The class field
MRJobConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH 

in yarn-alpha 
returns String Array

in yarn 
returns String 

the method will works for yarn-stable but will fail as it try to cast String 
Array to String. 

val knownDefMRAppCP: Seq[String] =
 getFieldValue[String, Seq[String]](classOf[MRJobConfig],
DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH,
 Seq[String]())(a = a.split(,))




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3177) Yarn-alpha ClientBaseSuite Unit test failed

2014-08-21 Thread Chester (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105984#comment-14105984
 ] 

Chester commented on SPARK-3177:


This issue should exists on master branch as well. It has been over there for a 
while. 

 Yarn-alpha ClientBaseSuite Unit test failed
 ---

 Key: SPARK-3177
 URL: https://issues.apache.org/jira/browse/SPARK-3177
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.1
Reporter: Chester
Priority: Minor
  Labels: test
 Fix For: 1.1.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 Yarn-alpha ClientBaseSuite Unit test failed due to differences of MRJobConfig 
 API between yarn-stable and yarn-alpha. 
 The class field
 MRJobConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH 
 in yarn-alpha 
 returns String Array
 in yarn 
 returns String 
 the method will works for yarn-stable but will fail as it try to cast String 
 Array to String. 
 val knownDefMRAppCP: Seq[String] =
  getFieldValue[String, Seq[String]](classOf[MRJobConfig],
 DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH,
  Seq[String]())(a = a.split(,))



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero

2014-08-21 Thread Jon Haddad (JIRA)
Jon Haddad created SPARK-3178:
-

 Summary: setting SPARK_WORKER_MEMORY to a value without a label (m 
or g) sets the worker memory limit to zero
 Key: SPARK-3178
 URL: https://issues.apache.org/jira/browse/SPARK-3178
 Project: Spark
  Issue Type: Bug
 Environment: osx
Reporter: Jon Haddad


This should either default to m or just completely fail.  Starting a worker 
with zero memory isn't very helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3176) Implement 'POWER', 'ABS and 'LAST' for sql

2014-08-21 Thread Xinyun Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyun Huang updated SPARK-3176:


Description: Add support for the mathematical function POWER and ABS 
and  the analytic function last to return a subset of the rows satisfying a 
query within spark sql.  (was: Add support for the mathematical function 
POWER and ABS within spark sql.)
Summary: Implement 'POWER', 'ABS and 'LAST' for sql  (was: Implement 
math function 'POWER' and 'ABS' for sql)

 Implement 'POWER', 'ABS and 'LAST' for sql
 --

 Key: SPARK-3176
 URL: https://issues.apache.org/jira/browse/SPARK-3176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 Add support for the mathematical function POWER and ABS and  the analytic 
 function last to return a subset of the rows satisfying a query within 
 spark sql.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3111) Implement the LAST analytic function for sql

2014-08-21 Thread Xinyun Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106023#comment-14106023
 ] 

Xinyun Huang commented on SPARK-3111:
-

Combined with 3176

 Implement the LAST analytic function for sql
 

 Key: SPARK-3111
 URL: https://issues.apache.org/jira/browse/SPARK-3111
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
  Labels: sql
 Fix For: 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Add support for the analytic function last to return a subset of the rows 
 satisfying a query within spark sql.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2621) Update task InputMetrics incrementally

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106118#comment-14106118
 ] 

Apache Spark commented on SPARK-2621:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/2087

 Update task InputMetrics incrementally
 --

 Key: SPARK-2621
 URL: https://issues.apache.org/jira/browse/SPARK-2621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3179) Add task OutputMetrics

2014-08-21 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3179:
-

 Summary: Add task OutputMetrics
 Key: SPARK-3179
 URL: https://issues.apache.org/jira/browse/SPARK-3179
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza


Track the bytes that tasks write to HDFS or other output destinations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2014-08-21 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106164#comment-14106164
 ] 

Marcelo Vanzin commented on SPARK-1537:
---

No concrete timeline at the moment. I'm just starting to look at the 2.5.0 
version of ATS so I can incorporate things into my patch.

 Add integration with Yarn's Application Timeline Server
 ---

 Key: SPARK-1537
 URL: https://issues.apache.org/jira/browse/SPARK-1537
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin

 It would be nice to have Spark integrate with Yarn's Application Timeline 
 Server (see YARN-321, YARN-1530). This would allow users running Spark on 
 Yarn to have a single place to go for all their history needs, and avoid 
 having to manage a separate service (Spark's built-in server).
 At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
 although there is still some ongoing work. But the basics are there, and I 
 wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3180) Better control of security groups

2014-08-21 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-3180:
---

 Summary: Better control of security groups
 Key: SPARK-3180
 URL: https://issues.apache.org/jira/browse/SPARK-3180
 Project: Spark
  Issue Type: Improvement
Reporter: Allan Douglas R. de Oliveira


Two features can be combined together to provide better control of security 
group policies:

- The ability to specify the address authorized to access the default security 
group (instead of letting everyone: 0.0.0.0/0)
- The possibility to place the created machines on a custom security group

One can use the combinations of the two flags to restrict external access to 
the provided security group (e.g by setting the authorized address to 
127.0.0.1/32) while maintaining compatibility with the current behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3180) Better control of security groups

2014-08-21 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106171#comment-14106171
 ] 

Allan Douglas R. de Oliveira commented on SPARK-3180:
-

PR: https://github.com/apache/spark/pull/2088

 Better control of security groups
 -

 Key: SPARK-3180
 URL: https://issues.apache.org/jira/browse/SPARK-3180
 Project: Spark
  Issue Type: Improvement
Reporter: Allan Douglas R. de Oliveira

 Two features can be combined together to provide better control of security 
 group policies:
 - The ability to specify the address authorized to access the default 
 security group (instead of letting everyone: 0.0.0.0/0)
 - The possibility to place the created machines on a custom security group
 One can use the combinations of the two flags to restrict external access to 
 the provided security group (e.g by setting the authorized address to 
 127.0.0.1/32) while maintaining compatibility with the current behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3180) Better control of security groups

2014-08-21 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106177#comment-14106177
 ] 

Allan Douglas R. de Oliveira commented on SPARK-3180:
-

Perhaps it also solves SPARK-2528

 Better control of security groups
 -

 Key: SPARK-3180
 URL: https://issues.apache.org/jira/browse/SPARK-3180
 Project: Spark
  Issue Type: Improvement
Reporter: Allan Douglas R. de Oliveira

 Two features can be combined together to provide better control of security 
 group policies:
 - The ability to specify the address authorized to access the default 
 security group (instead of letting everyone: 0.0.0.0/0)
 - The possibility to place the created machines on a custom security group
 One can use the combinations of the two flags to restrict external access to 
 the provided security group (e.g by setting the authorized address to 
 127.0.0.1/32) while maintaining compatibility with the current behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3180) Better control of security groups

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106176#comment-14106176
 ] 

Apache Spark commented on SPARK-3180:
-

User 'douglaz' has created a pull request for this issue:
https://github.com/apache/spark/pull/2088

 Better control of security groups
 -

 Key: SPARK-3180
 URL: https://issues.apache.org/jira/browse/SPARK-3180
 Project: Spark
  Issue Type: Improvement
Reporter: Allan Douglas R. de Oliveira

 Two features can be combined together to provide better control of security 
 group policies:
 - The ability to specify the address authorized to access the default 
 security group (instead of letting everyone: 0.0.0.0/0)
 - The possibility to place the created machines on a custom security group
 One can use the combinations of the two flags to restrict external access to 
 the provided security group (e.g by setting the authorized address to 
 127.0.0.1/32) while maintaining compatibility with the current behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-08-21 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106200#comment-14106200
 ] 

Evan Chan commented on SPARK-2360:
--

+1 for this feature.

I just had to write something for importing tab-delimited CSVs and converting 
the types of each column.

As for API, it really needs to do type conversion into the built-in types;  
otherwise it really affects the caching compression efficiency and query speed, 
as well as what functions can be run on it.   I think this is crucial. 

Maybe one can pass in a Map[String, ColumnType] or something like that.  If a 
type is not specified for a column, then it is assumed to be String.

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Hossein Falaki

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2871) Missing API in PySpark

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106446#comment-14106446
 ] 

Apache Spark commented on SPARK-2871:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2091

 Missing API in PySpark
 --

 Key: SPARK-2871
 URL: https://issues.apache.org/jira/browse/SPARK-2871
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu

 There are several APIs missing in PySpark:
 RDD.collectPartitions()
 RDD.histogram()
 RDD.zipWithIndex()
 RDD.zipWithUniqueId()
 RDD.min(comp)
 RDD.max(comp)
 A bunch of API related to approximate jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2871) Missing API in PySpark

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106449#comment-14106449
 ] 

Apache Spark commented on SPARK-2871:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2092

 Missing API in PySpark
 --

 Key: SPARK-2871
 URL: https://issues.apache.org/jira/browse/SPARK-2871
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu

 There are several APIs missing in PySpark:
 RDD.collectPartitions()
 RDD.histogram()
 RDD.zipWithIndex()
 RDD.zipWithUniqueId()
 RDD.min(comp)
 RDD.max(comp)
 A bunch of API related to approximate jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2871) Missing API in PySpark

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106456#comment-14106456
 ] 

Apache Spark commented on SPARK-2871:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2093

 Missing API in PySpark
 --

 Key: SPARK-2871
 URL: https://issues.apache.org/jira/browse/SPARK-2871
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu

 There are several APIs missing in PySpark:
 RDD.collectPartitions()
 RDD.histogram()
 RDD.zipWithIndex()
 RDD.zipWithUniqueId()
 RDD.min(comp)
 RDD.max(comp)
 A bunch of API related to approximate jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2871) Missing API in PySpark

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106462#comment-14106462
 ] 

Apache Spark commented on SPARK-2871:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2094

 Missing API in PySpark
 --

 Key: SPARK-2871
 URL: https://issues.apache.org/jira/browse/SPARK-2871
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu

 There are several APIs missing in PySpark:
 RDD.collectPartitions()
 RDD.histogram()
 RDD.zipWithIndex()
 RDD.zipWithUniqueId()
 RDD.min(comp)
 RDD.max(comp)
 A bunch of API related to approximate jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2871) Missing API in PySpark

2014-08-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106481#comment-14106481
 ] 

Apache Spark commented on SPARK-2871:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2095

 Missing API in PySpark
 --

 Key: SPARK-2871
 URL: https://issues.apache.org/jira/browse/SPARK-2871
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu

 There are several APIs missing in PySpark:
 RDD.collectPartitions()
 RDD.histogram()
 RDD.zipWithIndex()
 RDD.zipWithUniqueId()
 RDD.min(comp)
 RDD.max(comp)
 A bunch of API related to approximate jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org