I will try using per partition sorted data. Can I also use groupBy and join
per partition? Basically I want to restrict the computation per partition
like using this data.mapPartitions(_.toList.sortBy(...).toIterator). Is
there a more direct way to create a RDD that does partition wise operations?
Yes, I looked at the log, and the serialized tasks were about 2k bytes as
well. Is there anything I can do to move this along?
On Thu, Oct 24, 2013 at 2:05 PM, Josh Rosen rosenvi...@gmail.com wrote:
Maybe this is a bug in the ClosureCleaner. If you look at the
13/10/23 14:16:39 INFO
Hey everybody,
I just watched the Spark Internals presentation [1] from the December 2012 dev
meetup and have a couple of questions regarding the output of tasks before a
shuffle.
1. Can anybody confirm that the default is still to persist stage output to
RAM/disk and then have the following
Well, he did mention that not everything was staying in the cache, so even
with an ongoing job they're probably be re-reading from Cassandra. It
sounds to me like the first issue to address is why things are being
evicted.
-Ewen
-
Ewen Cheslack-Postava
StraightUp | http://readstraightup.com
My team is investigating a number of technologies in the Big Data
space. A team member recently got turned on to Cascading
http://www.cascading.org/about-cascading/ as an application layer for
orchestrating complex workflows/scenarios. He asked me if Spark had an
application layer? My
Hello Thunder,
We don't use the hive branch underneath current Calliope release as it
focuses on Spark and Cassandra integration. In next EA release coming later
this month we plan to bring in the cas-handler to support Shark on
Cassandra.
Regards,
Rohit
On Mon, Oct 28, 2013 at 9:53 PM,
I think mesos repository is the legacy one and after becoming apache
project, you need t use apache's repo:
https://github.com/apache/incubator-spark/tree/scala-2.10
It has more recent patches.
Vadim.
On Thu, Oct 24, 2013 at 5:11 PM, Yogesh Shetty yogesh.she...@gmail.comwrote:
It is bit
Hello friends. Newbie here, at least when it goes to Spark. I would be very
thankful for data modeling suggestions for this scenario : I have 3 types
of logs, with more than 48 columns each. For simplicity I modeled each as
Tuple(PKsTuple, FinanceDataTuple, AuxData), i.e. Tuple of tuples.
I was having the same probs trying to read from HCatalog with Scala API. The
way around this was that I created a wrapper InputFormat in Java that uses
Spark's SerializableWritable.
I hacked this up Friday afternoon, tested a few times, and it seemed to work
well.
Here's an example:
Hey Lucas,
This code still needs to read the entire initial dataset from Cassandra, so
that's probably what's taking most of the time. Also, it doesn't show here
the operations you are actually doing.
What happens when you look in the Spark web UI or the logs? Can you tell
which stages are
Thanks Vadim. I was able to resolve it, successfully using spark on scala
2.10
On Mon, Oct 28, 2013 at 2:01 PM, Vadim Chekan kot.bege...@gmail.com wrote:
I think mesos repository is the legacy one and after becoming apache
project, you need t use apache's repo:
Hi Philip,
Cascading is relatively agnostic about the distributed topology underneath
it, especially as of the 2.0 release over a year ago. There's been some
discussion about writing a flow planner for Spark -- e.g., which would
replace the Hadoop flow planner. Not sure if there's active work on
Hello,
I count everts per date/time after that code, like the code below:
JavaPairRDDString, Integer eventPerDate = events.map(
new PairFunctionTuple2String, String, String,
Integer() {
@Override
public
Hey Lucas,
How many unique keys do you have when you do these aggregations? Also, when
you look in the web UI, can you tell how much in-memory storage is being
used overall by the events RDD and the casRDD?
- Patrick
On Mon, Oct 28, 2013 at 1:21 PM, Lucas Fernandes Brunialti
Hi Arun,
I had recent success getting a Spark project set up in Eclipse Juno.
Here are the notes that I wrote down for the rest of my team that you
may perhaps find useful:
Spark version 0.8.0 requires Scala version 2.9.3. This is a bit
inconvenient because Scala is now on version 2.10.3
Hey guys,
As a follow up, I raised our target partition size to 600mb (up from
64mb), which split this report's 500gb of tiny S3 files into ~700
partitions, and everything ran much smoother.
In retrospect, this was the same issue we'd ran into before, having too
many partitions, and had
1) when you say Cascading is relatively agnostic about the distributed
topology underneath it I take that as a hedge that suggests that while it
could be possible to run Spark underneath Cascading this is not something
commonly done or would necessarily be straightforward. Is this an unfair
It would be awesome if someone could edit these Eclipse instructions and
add them to the IDE Setup section of the Contributing to Spark wiki page:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
On Mon, Oct 28, 2013 at 2:30 PM, Philip Ogren philip.og...@oracle.comwrote:
And I didn't mean to skip over you, Koert. I'm just more familiar with
what Oscar said on the subject than with your opinion.
On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra m...@clearstorydata.comwrote:
Hmmm... I was unaware of this concept that Spark is for medium to large
datasets but not
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY
caching is the input to each reduce task. Those currently don't spill to disk.
The solution if datasets are large is to add more reduce tasks, whereas Hadoop
would run along with a small number of tasks that do lots
By the way, the reason we have this goal is simple -- nobody wants to be
managing different compute engines for the same computation. For established
MapReduce users, it may be easy to write the same code on MR, but we have lots
of users who've never installed MR and don't want to manage it. So
Hi Ufuk,
Yes, we still write out data after these tasks in Spark 0.8, and it needs to be
written out before any stage that reads it can start. The main reason is
simplicity when there are faults, as well as more flexible scheduling (you
don't have to decide where each reduce task is in
Hello,
I have some questions about the files that Spark will create and use during
its running.
(1) I am running a python program on Spark with a cluster of EC2. The data
comes from hdfs file system. I have met the following error in the console
of the master node:
*java.io.FileNotFoundException:
i am actually not familiar with what oscar has said on this. can you share
or point me to the conversation thread?
One of the places was is this panel
discussionhttp://www.meetup.com/hadoopsf/events/141368262/,
but it doesn't look like there is a recording of it available, so I guess
that's
Matei,
We have some jobs where even the input for a single key in a groupBy would
not fit in the the tasks memory. We rely on mapred to stream from disk to
disk as it reduces.
I think spark should be able to handle that situation to truly be able to
claim it can replace map-red (or not?).
Best,
Hey Koert,
Can you give me steps to reproduce this ?
On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers ko...@tresata.com wrote:
Matei,
We have some jobs where even the input for a single key in a groupBy would
not fit in the the tasks memory. We rely on mapred to stream from disk to
disk as
26 matches
Mail list logo