Hi,
I'm struggling a bit with something : I have several datasets
RDD[((Int,Int),Double)] that I want to merge.
I've tried with union+reduceByKey and cogroup+mapValues, but in all
cases it seems that if I don't force the computation of the RDD, the
final
Thanks again Tathagata for your help
Best Regards
On Jan 20, 2014, at 19:11, Tathagata Das tathagata.das1...@gmail.com wrote:
Hi Eduardo,
You can do arbitrary stuff with the data in a DStream using the operation
foreachRDD.
yourDStream.foreachRDD(rdd = {
// Get and print first n
On Mon, Jan 20, 2014 at 11:05 PM, Ognen Duzlevski
og...@nengoiksvelzud.comwrote:
Thanks. I will try that but your assumption is that something is failing
in an obvious way with a message. By the look of the spark-shell - just
frozen I would say something is stuck. Will report back.
Given
Hi all,
I¹d like the added jars on worker nodes (i.e. SparkContext.addJar()) to be
cleaned up on tear down. However, SparkContext.stop() doesn¹t seem to delete
them. What would be the best way to clear them? Or, is there an easy way to
add this functionality?
Mingyu
smime.p7s
Description:
Hi,
You can call less expensive operations like first or take to trigger the
computation.
On Tue, Jan 21, 2014 at 2:32 PM, Guillaume Pitel guillaume.pi...@exensa.com
wrote:
Hi,
I'm struggling a bit with something : I have several datasets
RDD[((Int,Int),Double)] that I want to merge.
Many thanks for the replies.
The way I currently have my set up is as follows;
6 nodes running Hadoop with each node having approximately 5GB of data.
Launched a Spark Master (and Shark via ./shark) on one of the Hadoop nodes
and launched 5 worker Spark nodes on the remaining 5 Hadoop nodes.
So
How to stop a streaming job using scc.stop. how to it pass it into running
code.
Want to schedule a streaming job (by specifying start and stop time) based
on user input. If user want to stop the streaming then a request scc.stop
to running streaming job.
Thanks. So you mean that first()
trigger the computation of the WHOLE RDD ? That does not sound
right, I thought it was lazy.
Guillaume
Hi,
You can call less expensive operations like first or take
to trigger the computation.
Hi,
I am newbe for Spark and starting to setup my development environment.
first thing I noticed is that I need to be connected to the internet while
compiling, *is that right?*
*How can I setup my development environment on a closed network?*
Thanks,
--
Eran | CTO
--
Eran | CTO
What I did on the VPC in Amazon is allow all outgoing traffic from it to
the world and allow all traffic to flow freely on the ingress WITHIN my
subnet (so for example, if your subnet is 10.10.0.0/24, you allow all
machines on the network to connect to each other on any port but that's
about all
Hi,
We enabled JmxSink in metric.properties, but we see minimal metrics like
memory usage and all. How can we introduce metrics at application level
like throughput (message processed / sec).
We tried to create MetricRegistry and increment counter in Map function but
that doesn't help.
Or is
Hi Vipul
Have you tried looking at the HBase and Cassandra examples under the spark
example project? These use custom InputFormats and may provide guidance as
to how to go about using the relevant Protobuf inputformat.
On Mon, Jan 20, 2014 at 11:48 PM, Vipul Pandey vipan...@gmail.com wrote:
Hi Vipul,
I use something like this to read from LZO compressed text files, it may be
helpful:
import com.twitter.elephantbird.mapreduce.input.LzoTextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.Job
val sc = new SparkContext(sparkMaster,
Hi,
Sorry for the late reply, this kind of slipped under my radar.
On 01/13/2014 12:12 PM, Aureliano Buendia wrote:
On Mon, Jan 13, 2014 at 5:59 PM, Josh Rosen rosenvi...@gmail.com wrote:
If you'd like to use Spark with Docker, the AMPLab's Docker scripts might
be a nice starting point:
Surprisingly, this turned out to be more complicated than what I expected.
I had the impression that this would be trivial in spark. Am I missing
something here?
On Tue, Jan 21, 2014 at 5:42 AM, Aureliano Buendia buendia...@gmail.comwrote:
Hi,
It seems spark does not support nested RDD's,
Hi,
When the data is read from HDFS using textFile, and then map function is
performed as the following code to make the format right in order to feed
it into mllib training algorithms.
rddFile = sc.textFile(Some file on HDFS)
rddData = rddFile.map(line = {
val temp =
If you don’t cache the RDD, the computation will happen over and over each time
we scan through it. This is done to save memory in that case and because Spark
can’t know at the beginning whether you plan to access a dataset multiple
times. If you’d like to prevent this, use cache(), or maybe
Hi Matei,
It does make sense that the computation will happen over and over each
time. If I understand correctly, do you mean that it will only compute the
transformation of one particular line and then destroy it to save the
memory?
Or it will create the full result of the map operation, and
Hi Tathagata-
Thanks for your help. I appreciate your suggestions. I am having trouble
following the code for the mapPartitions
According to the documentation I need to pass an a function of type
iter[t], thus I receive an error on the foreach.
var textFile = sc.textFile(input.txt)
Just FYI I got it working in standalone mode with no probs, so thanks for
the help Tathagata. I was not able to get it working on YARN, I gave up.
Thanks,
Mike
On Fri, Jan 10, 2014 at 6:17 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
Let me know if you have any trouble getting it
This is what docs/configuration.md says about the property:
Default number of tasks to use for distributed shuffle operations
(codegroupByKey/code,
codereduceByKey/code, etc) when not set by user.
If I set this property to, let's say, 4 - what does this mean? 4 tasks per
core, per worker,
Documentation suggestion:
Default number of tasks to use *across the cluster* for distributed shuffle
operations (codegroupByKey/code, codereduceByKey/code, etc) when
not set by user.
Ognen would that have clarified for you?
On Tue, Jan 21, 2014 at 3:35 PM, Matei Zaharia
Hello
I'm trying to run shark 0.8.1-rc0 with spark 0.8.1 but i have lucky.
I use HADOOP 2.2.0 I set the SHARK_HADOOP_VERSION=2.2.0 sbt/sbt package
but I am not able to connect to the master.
I have hadoop 2.2.0 running perfectly and also spark 0.8.1 standalone.
The shell works perfectly,
On Tue, Jan 21, 2014 at 10:37 PM, Andrew Ash and...@andrewash.com wrote:
Documentation suggestion:
Default number of tasks to use *across the cluster* for distributed
shuffle operations (codegroupByKey/code, codereduceByKey/code,
etc) when not set by user.
Ognen would that have clarified
https://github.com/apache/incubator-spark/pull/489
On Tue, Jan 21, 2014 at 3:41 PM, Ognen Duzlevski
og...@plainvanillagames.com wrote:
On Tue, Jan 21, 2014 at 10:37 PM, Andrew Ash and...@andrewash.com wrote:
Documentation suggestion:
Default number of tasks to use *across the cluster* for
Thanks Nick. I really appreciate your help. I understand the iterator and
have modified the code to return the iterator. What I am noticing is that
it returns:
import collection.mutable.HashMap
import collection.mutable.ListBuffer
def getArray(line: String):List[Int] = {
var a =
Hi Rajeev,
Did you get past this exception?
Thanks,
Vipul
On Dec 26, 2013, at 12:48 PM, Rajeev Srivastava raj...@silverline-da.com
wrote:
Hi Andrew,
Thanks for your example
I used your command and i get the following errors from worker ( missing
codec from worker i guess)
How do
Hi Vipul,
Andrew Ash suggested the answer which i am yet to try.
apparently his experiment worked for his LZO files. I don't think i will be
able to try his suggestions before Feb
Do share if his solution works for you.
regards
Rajeev
Rajeev Srivastava
Silverline Design Inc
2118 Walsh ave,
28 matches
Mail list logo