Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
some other solution. On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi wrote: > Ok, so latency problem is being generated because I'm using SQL as source? > how about csv, hive, or another source? > > On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra > wrote: > >> It is no

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
ncy it may > not be a good fit. > > M > > On Dec 1, 2015, at 7:23 PM, Andrés Ivaldi wrote: > > Ok, so latency problem is being generated because I'm using SQL as source? > how about csv, hive, or another source? > > On Tue, Dec 1, 2015 at 9:18 PM, Mark

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra
> > It is not designed for interactive queries. You might want to ask the designers of Spark, Spark SQL, and particularly some things built on top of Spark (such as BlinkDB) about their intent with regard to interactive queries. Interactive queries are not the only designed use of Spark, but it

Re: Spark on yarn vs spark standalone

2015-11-30 Thread Mark Hamstra
Standalone mode also supports running the driver on a cluster node. See "cluster" mode in http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications . Also, http://spark.apache.org/docs/latest/spark-standalone.html#high-availability On Mon, Nov 30, 2015 at 9:47 AM, Ja

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Mark Hamstra
> > In the near future, I guess GUI interfaces of Spark will be available > soon. Spark users (e.g, CEOs) might not need to know what are RDDs at all. > They can analyze their data by clicking a few buttons, instead of writing > the programs. : ) That's not in the future. :) On Mon, Nov 23, 201

Re: Slow stage?

2015-11-11 Thread Mark Hamstra
Those are from the Application Web UI -- look for the "DAG Visualization" and "Event Timeline" elements on Job and Stage pages. On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky wrote: > Hi Simone, > I'm afraid I don't have an answer to your question. However I noticed the > DAG figures in the att

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using SparkContext#union instead of RDD#union. That will avoid building up a lengthy lineage. On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote: > Hey Jeff, > Do you mean reading from multiple text files? In that case, as a > workar

Re: Why does sortByKey() transformation trigger a job in spark-shell?

2015-11-02 Thread Mark Hamstra
Hah! No, that is not a "starter" issue. It touches on some fairly deep Spark architecture, and there have already been a few attempts to resolve the issue -- none entirely satisfactory, but you should definitely search out the work that has already been done. On Mon, Nov 2, 2015 at 5:51 AM, Jace

Re: foreachPartition

2015-10-30 Thread Mark Hamstra
The closure is sent to and executed an Executor, so you need to be looking at the stdout of the Executors, not on the Driver. On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I'm just trying to do some operation inside foreachPartition, but I can't > even

Re: SQL queries in Spark / YARN

2015-09-28 Thread Mark Hamstra
Yes. On Mon, Sep 28, 2015 at 12:46 PM, Robert Grandl wrote: > Hi guys, > > I was wondering if it's possible to submit SQL queries to Spark SQL, when > Spark is running atop YARN instead of standalone mode. > > Thanks, > Robert >

Re: Potential racing condition in DAGScheduler when Spark 1.5 caching

2015-09-24 Thread Mark Hamstra
Where do you see a race in the DAGScheduler? On a quick look at your stack trace, this just looks to me like a Job where a Stage failed and then the DAGScheduler aborted the failed Job. On Thu, Sep 24, 2015 at 12:00 PM, robin_up wrote: > Hi > > After upgrade to 1.5, we found a possible racing c

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Mark Hamstra
There is the Async API ( https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala), which makes use of FutureAction ( https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/FutureAction.scala). You could als

Re: Set Job Descriptions for Scala application

2015-08-05 Thread Mark Hamstra
SparkContext#setJobDescription or SparkContext#setJobGroup On Wed, Aug 5, 2015 at 12:29 PM, Rares Vernica wrote: > Hello, > > My Spark application is written in Scala and submitted to a Spark cluster > in standalone mode. The Spark Jobs for my application are listed in the > Spark UI like this:

Re: TCP/IP speedup

2015-08-01 Thread Mark Hamstra
https://spark-summit.org/2015/events/making-sense-of-spark-performance/ On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus wrote: > Hi All! > > How important would be a significant performance improvement to TCP/IP > itself, in terms of > overall job performance improvement. Which part would be most

Re: Spark shell crumbles after memory is full

2015-06-29 Thread Mark Hamstra
No. He is collecting the results of the SQL query, not the whole dataset. The REPL does retain references to prior results, so it's not really the best tool to be using when you want no-longer-needed results to be automatically garbage collected. On Mon, Jun 29, 2015 at 9:13 AM, ayan guha wrote:

Re: Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Mark Hamstra
Do you want to transform the RDD, or just produce some side effect with its contents? If the latter, you want foreachPartition, not mapPartitions. On Fri, Jun 26, 2015 at 11:52 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > In rdd.mapPartition(…) if I try to iterate through

Re: Fully in-memory shuffles

2015-06-11 Thread Mark Hamstra
> > I would guess in such shuffles the bottleneck is serializing the data > rather than raw IO, so I'm not sure explicitly buffering the data in the > JVM process would yield a large improvement. Good guess! It is very hard to beat the performance of retrieving shuffle outputs from the OS buffe

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
t; Collect() would involve gathering all the data on a single machine as well. > > Thanks, > Raghav > > On Tuesday, June 9, 2015, Mark Hamstra wrote: > >> Correct. Trading away scalability for increased performance is not an >> option for the standard Spark API. >

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
Correct. Trading away scalability for increased performance is not an option for the standard Spark API. On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > It would be even faster to load the data on the driver and sort it there > without using Spark :).

Re: RDD of RDDs

2015-06-09 Thread Mark Hamstra
That would constitute a major change in Spark's architecture. It's not happening anytime soon. On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar wrote: > Possibly in future, if and when spark architecture allows workers to > launch spark jobs (the functions passed to transformation or action APIs o

Re: Spark error "value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]"

2015-06-08 Thread Mark Hamstra
Correct; and PairRDDFunctions#join does still exist in versions of Spark that do have DataFrame, so you don't necessarily have to use DataFrame to do this even then (although there are advantages to using the DataFrame approach.) Your basic problem is that you have an RDD of tuples, where each tup

Re: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Mark Hamstra
This discussion belongs on the dev list. Please post any replies there. On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park wrote: > Hi, > > I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to > confirm whether these are bugs or not before opening a jira. > > *1)* I can no longer

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Mark Hamstra
If you don't send jobs to different pools, then they will all end up in the default pool. If you leave the intra-pool scheduling policy as the default FIFO, then this will effectively be the same thing as using the default FIFO scheduling. Depending on what you are trying to accomplish, you need

Re: Skipped Jobs

2015-04-19 Thread Mark Hamstra
Almost. Jobs don't get skipped. Stages and Tasks do if the needed results are already available. On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee wrote: > The job is skipped because the results are available in memory from a > prior run. More info at: > http://mail-archives.apache.org/mod_mbox/spar

Re: Using 'fair' scheduler mode

2015-04-01 Thread Mark Hamstra
> > I am using the Spark ‘fair’ scheduler mode. What do you mean by this? Fair scheduling mode is not one thing in Spark, but allows for multiple configurations and usages. Presumably, at a minimum you are using SparkConf to set spark.scheduling.mode to "FAIR", but then how are you setting up s

Re: What is the meaning to of 'STATE' in a worker/ an executor?

2015-03-29 Thread Mark Hamstra
A LOADING Executor is on the way to RUNNING, but hasn't yet been registered with the Master, so it isn't quite ready to do useful work. > On Mar 29, 2015, at 9:09 PM, Niranda Perera wrote: > > Hi, > > I have noticed in the Spark UI, workers and executors run on several states, > ALIVE, LOAD

Re: Combining Many RDDs

2015-03-26 Thread Mark Hamstra
RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen wrote: > Hi Noorul, > > Thank you for your suggestion. I tried that, but ran out of memory. I did > some search and found some suggestions > that we should try to avoid rdd.union( > http://stackoverf

Re: How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Mark Hamstra
You can also always take the more extreme approach of using SparkContext#runJob (or submitJob) to write a custom Action that does what you want in one pass. Usually that's not worth the extra effort. On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen wrote: > To avoid computing twice you need to persis

Re: Priority queue in spark

2015-03-16 Thread Mark Hamstra
http://apache-spark-developers-list.1001551.n3.nabble.com/Job-priority-td10076.html#a10079 On Mon, Mar 16, 2015 at 10:26 PM, abhi wrote: > If i understand correctly , the above document creates pool for priority > which is static in nature and has to be defined before submitting the job . > .in

Re: Numbering RDD members Sequentially

2015-03-11 Thread Mark Hamstra
> > not quite sure why it is called zipWithIndex since zipping is not involved > It isn't? http://stackoverflow.com/questions/1115563/what-is-zip-functional-programming On Wed, Mar 11, 2015 at 5:18 PM, Steve Lewis wrote: > > -- Forwarded message -- > From: Steve Lewis > Date: W

Re: how to map and filter in one step?

2015-02-26 Thread Mark Hamstra
rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be pipelined into a single stage, so there generally isn't any need to complect the map and filter into a single function. Additionally, there is RDD#collect[U](f: PartialFunction[T, U])(implicit arg0: ClassTag[U]): RDD[U], wh

Re: How to print more lines in spark-shell

2015-02-23 Thread Mark Hamstra
intln) is the most > straightforward thing but yeah you can probably change shell default > behavior too. > > On Mon, Feb 23, 2015 at 7:15 PM, Mark Hamstra > wrote: > > That will produce very different output than just the 10 items that Manas > > wants. > > > &

Re: How to print more lines in spark-shell

2015-02-23 Thread Mark Hamstra
That will produce very different output than just the 10 items that Manas wants. This is essentially a Scala shell issue, so this should apply: http://stackoverflow.com/questions/9516567/settings-maxprintstring-for-scala-2-9-repl On Mon, Feb 23, 2015 at 10:25 AM, Akhil Das wrote: > You can do i

Re: percentil UDAF in spark 1.2.0

2015-02-19 Thread Mark Hamstra
ailable in a release > ? > > > On Thu, Feb 19, 2015 at 3:27 PM, Mark Hamstra > wrote: > >> Already fixed: https://github.com/apache/spark/pull/2802 >> >> >> On Thu, Feb 19, 2015 at 3:17 PM, Mohnish Kodnani < >> mohnish.kodn...@gmail.com> w

Re: percentil UDAF in spark 1.2.0

2015-02-19 Thread Mark Hamstra
Already fixed: https://github.com/apache/spark/pull/2802 On Thu, Feb 19, 2015 at 3:17 PM, Mohnish Kodnani wrote: > Hi, > I am trying to use percentile and getting the following error. I am using > spark 1.2.0. Does UDAF percentile exist in that code line and do i have to > do something to get t

Re: Counters in Spark

2015-02-13 Thread Mark Hamstra
Except that transformations don't have an exactly-once guarantee, so this way of doing counters may produce different answers across various forms of failures and speculative execution. On Fri, Feb 13, 2015 at 8:56 AM, Sean McNamara wrote: > .map is just a transformation, so no work will actual

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra
No, only each group should need to fit. On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet wrote: > Doesn't iter still need to fit entirely into memory? > > On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra > wrote: > >> rdd.mapPartitions { iter => >> val grouped

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra
rdd.mapPartitions { iter => val grouped = iter.grouped(batchSize) for (group <- grouped) { ... } } On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet wrote: > I think the word "partition" here is a tad different than the term > "partition" that we use in Spark. Basically, I want something similar

Re: Reg Job Server

2015-02-05 Thread Mark Hamstra
https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Thu, Feb 5, 2015 at 9:18 PM, Deep Pradhan wrote: > Yes, I want to know, the reason about the job being slow. > I will look at YourKit. > Can you redirect me to that, some tutorial in how to use? > > T

Re: How many stages in my application?

2015-02-05 Thread Mark Hamstra
remain to be run to complete a job is a surprisingly tricky problem -- take a look at the discussion that went into Josh's Job page PR to get an idea of the issues and subtleties involved: https://github.com/apache/spark/pull/3009 On Thu, Feb 5, 2015 at 1:27 AM, Mark Hamstra wrote:

Re: How many stages in my application?

2015-02-05 Thread Mark Hamstra
ange the code. So I rather find a > way of doing this automatically if possible. > > On 4 February 2015 at 19:41, Mark Hamstra wrote: > >> But there isn't a 1-1 mapping from operations to stages since multiple >> operations will be pipelined into a single stage if no shu

Re: How many stages in my application?

2015-02-04 Thread Mark Hamstra
But there isn't a 1-1 mapping from operations to stages since multiple operations will be pipelined into a single stage if no shuffle is required. To determine the number of stages in a job you really need to be looking for shuffle boundaries. On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das wrote: >

Re: StackOverflowError on RDD.union

2015-02-03 Thread Mark Hamstra
Use SparkContext#union[T](rdds: Seq[RDD[T]]) On Tue, Feb 3, 2015 at 7:43 PM, Thomas Kwan wrote: > I am trying to combine multiple RDDs into 1 RDD, and I am using the union > function. I wonder if anyone has seen StackOverflowError as follows: > > Exception in thread "main" java.lang.StackOverflo

Re: "Loading" status

2015-02-02 Thread Mark Hamstra
b 2, 2015 at 10:49 PM, Mark Hamstra > wrote: > >> Curious. I guess the first question is whether we've got some sort of >> Listener/UI error so that the UI is not accurately reflecting the >> Executor's actual state, or whether the "LOADING" Executor reall

Re: "Loading" status

2015-02-02 Thread Mark Hamstra
cess of being created, but not yet doing anything useful" state. If you can figure out a little more of what is going on or how to reproduce this state, please do file a JIRA. On Mon, Feb 2, 2015 at 8:28 AM, Ami Khandeshi wrote: > Yes > > > On Monday, February 2, 2015, Mark Ha

Re: "Loading" status

2015-02-02 Thread Mark Hamstra
LOADING is just the state in which new Executors are created but before they have everything they need and are fully registered to transition to state RUNNING and begin doing actual work: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L35

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Mark Hamstra
A map followed by a filter will not be two stages, but rather one stage that pipelines the map and filter. > On Jan 20, 2015, at 10:26 AM, Kane Kim wrote: > > Related question - is execution of different stages optimized? I.e. > map followed by a filter will require 2 loops or they will be com

Re: Job priority

2015-01-10 Thread Mark Hamstra
-dev, +user http://spark.apache.org/docs/latest/job-scheduling.html On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta wrote: > Is it possible to specify a priority level for a job, such that the active > jobs might be scheduled in order of priority? > > Alex >

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Mark Hamstra
That's what you want to see. The computation of a stage is skipped if the results for that stage are still available from the evaluation of a prior job run: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L163 On Wed, Jan 7, 2015

Re: a vague question, but perhaps it might ring a bell

2015-01-05 Thread Mark Hamstra
"[T]his sort of Exception" is deeply misleading since what Michael posted is just the very tail-end of the process of Spark shutting down when an unhandled exception is thrown somewhere else. "[T]his sort of Exception" is not the root cause of the problem, but rather will be the common outcome fro

Re: v1.2.0 (re?)introduces Wrong FS behavior in thriftserver

2014-12-20 Thread Mark Hamstra
This makes no sense. There is no difference between v1.2.0-rc2 and v1.2.0: https://github.com/apache/spark/compare/v1.2.0-rc2...v1.2.0 On Sat, Dec 20, 2014 at 12:44 PM, Matt Mead wrote: > First, thanks for the efforts and contribution to such a useful software > stack! Spark is great! > > I ha

Re: Accessing rows of a row in Spark

2014-12-15 Thread Mark Hamstra
[1,orange],[2,apple])] > > I tried to iterate the items as you suggested but no luck. > > Best Regards, > > Jerry > > > On Mon, Dec 15, 2014 at 2:18 PM, Mark Hamstra > wrote: >> >> scala> val items = Row(1 -> "orange", 2 -> "apple&quo

Re: Accessing rows of a row in Spark

2014-12-15 Thread Mark Hamstra
scala> val items = Row(1 -> "orange", 2 -> "apple") items: org.apache.spark.sql.catalyst.expressions.Row = [(1,orange),(2,apple)] If you literally want an iterator, then this: scala> items.toIterator.count { case (user_id, name) => user_id == 1 } res0: Int = 1 ...else: scala> items.count

Re: Spark SQL API Doc & IsCached as SQL command

2014-12-12 Thread Mark Hamstra
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory On Fri, Dec 12, 2014 at 3:14 PM, Judy Nash wrote: > > Hello, > > > > Few questions on Spark SQL: > > > > 1) Does Spark SQL support equivalent SQL Query for Scala command: > IsCached() ? > > > > 2) Is

Re: drop table if exists throws exception

2014-12-05 Thread Mark Hamstra
And that is no different from how Hive has worked for a long time. On Fri, Dec 5, 2014 at 11:42 AM, Michael Armbrust wrote: > The command run fine for me on master. Note that Hive does print an > exception in the logs, but that exception does not propogate to user code. > > On Thu, Dec 4, 2014

Re: map function

2014-12-04 Thread Mark Hamstra
rdd.flatMap { case (k, coll) => coll.map { elem => (elem, k) } } On Thu, Dec 4, 2014 at 1:26 AM, Yifan LI wrote: > Hi, > > I have a RDD like below: > (1, (10, 20)) > (2, (30, 40, 10)) > (3, (30)) > … > > Is there any way to map it to this: > (10,1) > (20,1) > (30,2) > (40,2) > (10,2) > (30,3) >

Re: Getting spark job progress programmatically

2014-11-19 Thread Mark Hamstra
This is already being covered by SPARK-2321 and SPARK-4145. There are pull requests that are already merged or already very far along -- e.g., https://github.com/apache/spark/pull/3009 If there is anything that needs to be added, please add it to those issues or PRs. On Wed, Nov 19, 2014 at 7:55

Re: Debian package for spark?

2014-11-08 Thread Mark Hamstra
No change from 1.1.0 to 1.1.1-SNAPSHOT. The deb profile hasn't changed since before the 1.0.2 release. On Sat, Nov 8, 2014 at 3:12 PM, Kevin Burton wrote: > Weird… I’m using a 1.1.0 source tar.gz … > > but if it’s fixed in 1.1.1 that’s good. > > On Sat, Nov 8, 2014 at 2

Re: Debian package for spark?

2014-11-08 Thread Mark Hamstra
The building of the Debian package in Spark works just fine for me -- I just did it using a clean check-out of 1.1.1-SNAPSHOT and `mvn -U -Pdeb -DskipTests clean package`. There's likely something else amiss in your build. Actually, that's not quite true. There is one small problem with the Debi

Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread Mark Hamstra
Sometime after Nov. 15: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage On Wed, Oct 29, 2014 at 5:28 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > Thanks for your update. Any idea when will Spark 1.2 be GA? > > Regards > Arthur > > > On 29 Oct, 2014, at

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Mark Hamstra
I believe that you are overstating your case. If you want to work with with Spark, then the Java API is entirely adequate with a very few exceptions -- unfortunately, though, one of those exceptions is with something that you are interested in, JdbcRDD. If you want to work on Spark -- customizing

Re: Change RDDs using map()

2014-09-17 Thread Mark Hamstra
You don't. That's what filter or the partial function version of collect are for: val transformedRDD = yourRDD.collect { case (k, v) if k == 1 => v } On Wed, Sep 17, 2014 at 3:24 AM, Deep Pradhan wrote: > Hi, > I want to make the following changes in the RDD (create new RDD from the > existing

Re: scala 2.11?

2014-09-15 Thread Mark Hamstra
d be hesitant to do > that as a maintenance release on 1.1.x and 1.0.x since it would require > nontrivial changes to the build that could break things on Scala 2.10. > > Matei > > On September 15, 2014 at 12:19:04 PM, Mark Hamstra ( > m...@clearstorydata.com) wrote: > > Are we go

Re: scala 2.11?

2014-09-15 Thread Mark Hamstra
Scala 2.11 work is under way in open pull requests though, so hopefully it > will be in soon. > > Matei > > On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) > wrote: > > ah...thanks! > > On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra

Re: scala 2.11?

2014-09-15 Thread Mark Hamstra
No, not yet. Spark SQL is using org.scalamacros:quasiquotes_2.10. On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi wrote: > Folks, > I understand Spark SQL uses quasiquotes. Does that mean Spark has now > moved to Scala 2.11? > > Mohit. >

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra
Sorry, posting too late at night. That should be "...transformations, that produce further RDDs; and actions, that return values to the driver program." On Sat, Sep 13, 2014 at 12:45 AM, Mark Hamstra wrote: > Again, RDD operations are of two basic varieties: transformations, t

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra
Again, RDD operations are of two basic varieties: transformations, that produce further RDDs; and operations, that return values to the driver program. You've used several RDD transformations and then finally the top(1) action, which returns an array of one element to your driver program. That is

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra
This is all covered in http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations By definition, RDD transformations take an RDD to another RDD; actions produce some other type as a value on the driver program. On Fri, Sep 12, 2014 at 11:15 PM, Deep Pradhan wrote: > Is it always

Re: Release date for new pyspark

2014-07-16 Thread Mark Hamstra
You should expect master to compile and run: patches aren't merged unless they build and pass tests on Jenkins. You shouldn't expect new features to be added to stable code in maintenance releases (e.g. 1.0.1). AFAIK, we're still on track with Spark 1.1.0 development, which means that it should b

Re: Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Mark Hamstra
And if you can relax your constraints even further to only require RDD[List[Int]], then it's even simpler: rdd.mapPartitions(_.grouped(batchedDegree)) On Sat, Jul 12, 2014 at 6:26 PM, Aaron Davidson wrote: > If you don't really care about the batchedDegree, but rather just want to > do operati

Re: how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Mark Hamstra
See Working with Key-Value Pairs . In particular: "In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)), as long as you import org

Re: Hive classes for Catalyst

2014-06-11 Thread Mark Hamstra
And the code is right here: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala On Wed, Jun 11, 2014 at 5:38 PM, Michael Armbrust wrote: > You will need to compile spark with SPARK_HIVE=true. > > > On Wed, Jun 11, 2014 at 5:37 PM, Step

Re: Are "scala.MatchError" messages a problem?

2014-06-08 Thread Mark Hamstra
> > The solution is either to add a default case which does nothing, or > probably better to add a .filter such that you filter out anything that's > not a command before matching. > And you probably want to push down that filter into the cluster -- collecting all of the elements of an RDD only to

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-04 Thread Mark Hamstra
Actually, what the stack trace is showing is the result of an exception being thrown by the DAGScheduler's event processing actor. What happens is that the Supervisor tries to shut down Spark when an exception is thrown by that actor. As part of the shutdown procedure, the DAGScheduler tries to c

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Mark Hamstra
Are you using spark-submit to run your application? On Wed, Jun 4, 2014 at 8:49 AM, ajatix wrote: > I am also getting the exact error, with the exact logs when I run Spark > 1.0.0 > in coarse-grained mode. > Coarse grained mode works perfectly with earlier versions that I tested - > 0.9.1 and 0

Re: PySpark & Mesos random crashes

2014-05-25 Thread Mark Hamstra
The end of your example is the same as SPARK-1749. When a Mesos job causes an exception to be thrown in the DAGScheduler, that causes the DAGScheduler to need to shutdown the system. As part of that shutdown procedure, the DAGScheduler tries to kill any running jobs; but Mesos doesn't support tha

Re: advice on maintaining a production spark cluster?

2014-05-21 Thread Mark Hamstra
After the several fixes that we have made to exception handling in Spark 1.0.0, I expect that this behavior will be quite different from 0.9.1. Executors should be far more likely to shutdown cleanly in the event of errors, allowing easier restarts. But I expect that there will be more bugs to fi

Re: Counting things only once

2014-05-16 Thread Mark Hamstra
https://spark-project.atlassian.net/browse/SPARK-732 On Fri, May 16, 2014 at 9:05 AM, Daniel Siegmann wrote: > I want to use accumulators to keep counts of things like invalid lines > found and such, for reporting purposes. Similar to Hadoop counters. This > may seem simple, but my case is a bit

Re: Counting things only once

2014-05-16 Thread Mark Hamstra
Better, the current location: https://issues.apache.org/jira/browse/SPARK-732 On Fri, May 16, 2014 at 1:47 PM, Mark Hamstra wrote: > https://spark-project.atlassian.net/browse/SPARK-732 > > > On Fri, May 16, 2014 at 9:05 AM, Daniel Siegmann > wrote: > >> I want to

Re: Spark unit testing best practices

2014-05-15 Thread Mark Hamstra
Local mode does serDe, so it should expose serialization problems. On Wed, May 14, 2014 at 10:53 AM, Philip Ogren wrote: > Have you actually found this to be true? I have found Spark local mode to > be quite good about blowing up if there is something non-serializable and > so my unit tests hav

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mark Hamstra
Which is what you shouldn't be doing as an API user, since that implementation code might change. The documentation doesn't mention a row ordering guarantee, so none should be assumed. It is hard enough for us to correctly document all of the things that the API does do. We really shouldn't be f

Re: packaging time

2014-04-29 Thread Mark Hamstra
Tip: read the wiki -- https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Tue, Apr 29, 2014 at 12:48 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Tips from my experience. Disable scaladoc: > > sources in doc in Compile := List() > > Do not package the s

Re: Adding to an RDD

2014-04-21 Thread Mark Hamstra
As long as the function that you are mapping over the RDD is pure, preserving referential transparency so that anytime you map the same function over the same initial RDD elements you get the same result elements, then there is no problem in doing what you suggest. In fact, it's common practice.

Re: running tests selectively

2014-04-20 Thread Mark Hamstra
You should add the hub command line wrapper of git for github to that wiki page: https://github.com/github/hub -- doesn't look like I have edit access to the wiki, or I've forgotten a password, or something Once you've got hub installed and aliased, you've got some nice additional options, suc

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Mark Hamstra
Please file an issue: Spark Project JIRA On Fri, Apr 18, 2014 at 10:25 AM, Aureliano Buendia wrote: > Hi, > > I just notices that sc.makeRDD() does not make all values given with input > type of NumericRange, try this in spark shell: > > > $ MASTER=l

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Mark Hamstra
https://issues.apache.org/jira/browse/SPARK-1081?jql=project%20%3D%20SPARK%20AND%20text%20~%20Annotate On Thu, Apr 3, 2014 at 9:24 AM, Philip Ogren wrote: > I can appreciate the reluctance to expose something like the > JobProgressListener as a public interface. It's exactly the sort of thing

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra
Will be in 1.0.0 On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas wrote: > Ah, now I see what Aaron was referring to. So I'm guessing we will get > this in the next release or two. Thank you. > > > > On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra wrote: > >>

Re: Spark output compression on HDFS

2014-04-02 Thread Mark Hamstra
che/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 >> >> For saveAsSequenceFile yep, I think Mark is right, you need an option. >> >> >> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra wrote: >> >>> http://www.scala-lang.org/

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra
There is a repartition method in pyspark master: https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128 On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas wrote: > Update: I'm now using this ghetto function to partition the RDD I get back > when I call textFile() on a gzipped fil

Re: Spark output compression on HDFS

2014-04-02 Thread Mark Hamstra
http://www.scala-lang.org/api/2.10.3/index.html#scala.Option The signature is 'def saveAsSequenceFile(path: String, codec: Option[Class[_ <: CompressionCodec]] = None)', but you are providing a Class, not an Option[Class]. Try counts.saveAsSequenceFile(output, Some(classOf[org.apache.hadoop.io.co

Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Mark Hamstra
Some related discussion: https://github.com/apache/spark/pull/246 On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren wrote: > Hi DB, > > Just wondering if you ever got an answer to your question about monitoring > progress - either offline or through your own investigation. Any findings > would be ap

Re: function state lost when next RDD is processed

2014-03-28 Thread Mark Hamstra
As long as the amount of state being passed is relatively small, it's probably easiest to send it back to the driver and to introduce it into RDD transformations as the zero value of a fold. On Fri, Mar 28, 2014 at 7:12 AM, Adrian Mocanu wrote: > I'd like to resurrect this thread since I don't

Re: RDD usage

2014-03-24 Thread Mark Hamstra
No, it won't. The type of RDD#foreach is Unit, so it doesn't return an RDD. The utility of foreach is purely for the side effects it generates, not for its return value -- and modifying an RDD in place via foreach is generally not a very good idea. On Mon, Mar 24, 2014 at 6:35 PM, hequn cheng

Re: How many partitions is my RDD split into?

2014-03-23 Thread Mark Hamstra
It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Hey there fellow Dukes of Data, > > How can I tell how many partitions my RDD is split into? > > I'm interested in knowing because, from what I gather, having a good >

Re: Apache Spark 0.9.0 Build Error

2014-03-17 Thread Mark Hamstra
Try ./sbt/sbt assembly On Mon, Mar 17, 2014 at 9:06 PM, wapisani wrote: > Good morning! I'm attempting to build Apache Spark 0.9.0 on Windows 8. I've > installed all prerequisites (except Hadoop) and run "sbt/sbt assembly" > while > in the root directory. I'm getting an error after the line "Se

Re: spark config params conventions

2014-03-12 Thread Mark Hamstra
That's the whole reason why some of the intended configuration changes were backed out just before the 0.9.0 release. It's a well-known issue, even if a completely satisfactory solution isn't as well-known and is probably something which should do another iteration on. On Wed, Mar 12, 2014 at 9:

Re: How to create RDD from Java in-memory data?

2014-03-11 Thread Mark Hamstra
https://github.com/apache/incubator-spark/pull/421 Works pretty good, but really needs to be enhanced to work with AsyncRDDActions. On Tue, Mar 11, 2014 at 4:50 PM, wallacemann wrote: > In a similar vein, it would be helpful to have an Iterable way to access > the > data inside an RDD. The co

Re: is spark.cleaner.ttl safe?

2014-03-11 Thread Mark Hamstra
Actually, TD's work-in-progress is probably more what you want: https://github.com/apache/spark/pull/126 On Tue, Mar 11, 2014 at 1:58 PM, Michael Allman wrote: > Hello, > > I've been trying to run an iterative spark job that spills 1+ GB to disk > per iteration on a system with limited disk spa

<    1   2