Re: almost sorted data

2013-10-28 Thread Arun Kumar
I will try using per partition sorted data. Can I also use groupBy and join per partition? Basically I want to restrict the computation per partition like using this data.mapPartitions(_.toList.sortBy(...).toIterator). Is there a more direct way to create a RDD that does partition wise operations?

Re: Stage failures

2013-10-28 Thread Tom Vacek
Yes, I looked at the log, and the serialized tasks were about 2k bytes as well. Is there anything I can do to move this along? On Thu, Oct 24, 2013 at 2:05 PM, Josh Rosen rosenvi...@gmail.com wrote: Maybe this is a bug in the ClosureCleaner. If you look at the 13/10/23 14:16:39 INFO

Task output before a shuffle

2013-10-28 Thread Ufuk Celebi
Hey everybody, I just watched the Spark Internals presentation [1] from the December 2012 dev meetup and have a couple of questions regarding the output of tasks before a shuffle. 1. Can anybody confirm that the default is still to persist stage output to RAM/disk and then have the following

Re: Job duration

2013-10-28 Thread Ewen Cheslack-Postava
Well, he did mention that not everything was staying in the cache, so even with an ongoing job they're probably be re-reading from Cassandra. It sounds to me like the first issue to address is why things are being evicted. -Ewen - Ewen Cheslack-Postava StraightUp | http://readstraightup.com

compare/contrast Spark with Cascading

2013-10-28 Thread Philip Ogren
My team is investigating a number of technologies in the Big Data space. A team member recently got turned on to Cascading http://www.cascading.org/about-cascading/ as an application layer for orchestrating complex workflows/scenarios. He asked me if Spark had an application layer? My

Re: Spark integration with HDFS and Cassandra simultaneously

2013-10-28 Thread Rohit Rai
Hello Thunder, We don't use the hive branch underneath current Calliope release as it focuses on Spark and Cassandra integration. In next EA release coming later this month we plan to bring in the cas-handler to support Shark on Cassandra. Regards, Rohit On Mon, Oct 28, 2013 at 9:53 PM,

Re: Spark Build using Scala 2.10 on Windows

2013-10-28 Thread Vadim Chekan
I think mesos repository is the legacy one and after becoming apache project, you need t use apache's repo: https://github.com/apache/incubator-spark/tree/scala-2.10 It has more recent patches. Vadim. On Thu, Oct 24, 2013 at 5:11 PM, Yogesh Shetty yogesh.she...@gmail.comwrote: It is bit

Modeling and implementation

2013-10-28 Thread Amit Mor
Hello friends. Newbie here, at least when it goes to Spark. I would be very thankful for data modeling suggestions for this scenario : I have 3 types of logs, with more than 48 columns each. For simplicity I modeled each as Tuple(PKsTuple, FinanceDataTuple, AuxData), i.e. Tuple of tuples.

Re: Reading custom inputformat from hadoop dfs

2013-10-28 Thread Silvio Fiorito
I was having the same probs trying to read from HCatalog with Scala API. The way around this was that I created a wrapper InputFormat in Java that uses Spark's SerializableWritable. I hacked this up Friday afternoon, tested a few times, and it seemed to work well. Here's an example:

Re: Job duration

2013-10-28 Thread Patrick Wendell
Hey Lucas, This code still needs to read the entire initial dataset from Cassandra, so that's probably what's taking most of the time. Also, it doesn't show here the operations you are actually doing. What happens when you look in the Spark web UI or the logs? Can you tell which stages are

Re: Spark Build using Scala 2.10 on Windows

2013-10-28 Thread Yogesh Shetty
Thanks Vadim. I was able to resolve it, successfully using spark on scala 2.10 On Mon, Oct 28, 2013 at 2:01 PM, Vadim Chekan kot.bege...@gmail.com wrote: I think mesos repository is the legacy one and after becoming apache project, you need t use apache's repo:

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Paco Nathan
Hi Philip, Cascading is relatively agnostic about the distributed topology underneath it, especially as of the 2.0 release over a year ago. There's been some discussion about writing a flow planner for Spark -- e.g., which would replace the Hadoop flow planner. Not sure if there's active work on

Re: Job duration

2013-10-28 Thread Lucas Fernandes Brunialti
Hello, I count everts per date/time after that code, like the code below: JavaPairRDDString, Integer eventPerDate = events.map( new PairFunctionTuple2String, String, String, Integer() { @Override public

Re: Job duration

2013-10-28 Thread Patrick Wendell
Hey Lucas, How many unique keys do you have when you do these aggregations? Also, when you look in the web UI, can you tell how much in-memory storage is being used overall by the events RDD and the casRDD? - Patrick On Mon, Oct 28, 2013 at 1:21 PM, Lucas Fernandes Brunialti

Re: set up spark in eclipse

2013-10-28 Thread Philip Ogren
Hi Arun, I had recent success getting a Spark project set up in Eclipse Juno. Here are the notes that I wrote down for the rest of my team that you may perhaps find useful: Spark version 0.8.0 requires Scala version 2.9.3. This is a bit inconvenient because Scala is now on version 2.10.3

Re: oome from blockmanager

2013-10-28 Thread Stephen Haberman
Hey guys, As a follow up, I raised our target partition size to 600mb (up from 64mb), which split this report's 500gb of tiny S3 files into ~700 partitions, and everything ran much smoother. In retrospect, this was the same issue we'd ran into before, having too many partitions, and had

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Mark Hamstra
1) when you say Cascading is relatively agnostic about the distributed topology underneath it I take that as a hedge that suggests that while it could be possible to run Spark underneath Cascading this is not something commonly done or would necessarily be straightforward. Is this an unfair

Re: set up spark in eclipse

2013-10-28 Thread Josh Rosen
It would be awesome if someone could edit these Eclipse instructions and add them to the IDE Setup section of the Contributing to Spark wiki page: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Mon, Oct 28, 2013 at 2:30 PM, Philip Ogren philip.og...@oracle.comwrote:

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Mark Hamstra
And I didn't mean to skip over you, Koert. I'm just more familiar with what Oscar said on the subject than with your opinion. On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra m...@clearstorydata.comwrote: Hmmm... I was unaware of this concept that Spark is for medium to large datasets but not

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Matei Zaharia
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY caching is the input to each reduce task. Those currently don't spill to disk. The solution if datasets are large is to add more reduce tasks, whereas Hadoop would run along with a small number of tasks that do lots

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Matei Zaharia
By the way, the reason we have this goal is simple -- nobody wants to be managing different compute engines for the same computation. For established MapReduce users, it may be easy to write the same code on MR, but we have lots of users who've never installed MR and don't want to manage it. So

Re: Task output before a shuffle

2013-10-28 Thread Matei Zaharia
Hi Ufuk, Yes, we still write out data after these tasks in Spark 0.8, and it needs to be written out before any stage that reads it can start. The main reason is simplicity when there are faults, as well as more flexible scheduling (you don't have to decide where each reduce task is in

Questions about the files that Spark will produce during its running

2013-10-28 Thread Shangyu Luo
Hello, I have some questions about the files that Spark will create and use during its running. (1) I am running a python program on Spark with a cluster of EC2. The data comes from hdfs file system. I have met the following error in the console of the master node: *java.io.FileNotFoundException:

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Mark Hamstra
i am actually not familiar with what oscar has said on this. can you share or point me to the conversation thread? One of the places was is this panel discussionhttp://www.meetup.com/hadoopsf/events/141368262/, but it doesn't look like there is a recording of it available, so I guess that's

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Koert Kuipers
Matei, We have some jobs where even the input for a single key in a groupBy would not fit in the the tasks memory. We rely on mapred to stream from disk to disk as it reduces. I think spark should be able to handle that situation to truly be able to claim it can replace map-red (or not?). Best,

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Prashant Sharma
Hey Koert, Can you give me steps to reproduce this ? On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers ko...@tresata.com wrote: Matei, We have some jobs where even the input for a single key in a groupBy would not fit in the the tasks memory. We rely on mapred to stream from disk to disk as