hu, Nov 13, 2014 at 10:11 AM, Rishi Yadav wrote:
> If your data is in hdfs and you are reading as textFile and each file is
> less than block size, my understanding is it would always have one
> partition per file.
>
>
> On Thursday, November 13, 2014, Daniel Siegmann
> wrote:
&
)
>
> The same goes for any other action I am trying to perform inside the map
> statement. I am failing to understand what I am doing wrong.
> Can anyone help with this?
>
> Thanks,
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
--
Daniel Siegm
very RDD
val rdds = paths.map { path =>
sc.textFile(path).map(myFunc)
}
val completeRdd = sc.union(rdds)
Does that make any sense?
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io
g MacAddress to determine which machine is running the
> code.
> As far as I can tell a simple word count is running in one thread on one
> machine and the remainder of the cluster does nothing,
> This is consistent with tests where I write to sdout from functions and
> see little
I've never used Mesos, sorry.
On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis wrote:
> The cluster runs Mesos and I can see the tasks in the Mesos UI but most
> are not doing much - any hints about that UI
>
> On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann <
> daniel.s
ulables seem to have some extra complications and overhead.
>
>
>
> So…
>
>
>
> What’s the real difference between an accumulator/accumulable and
> aggregating an RDD? When is one method of aggregation preferred over the
> other?
>
>
>
> Thanks,
>
>
Is there a mechanism similar to MR where we can ensure each
> partition is assigned some amount of data by size, by setting some block
> size parameter?
>
>
>
> On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann > wrote:
>
>> On Thu, Nov 13, 2014 at 3:24 PM, Pala
inalRow)
> as map/reduce tasks such that rows with same original string key get same
> numeric consecutive key?
>
> Any hints?
>
> best,
> /Shahab
>
>
>
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io
uld be
to define my own equivalent of PairRDDFunctions which works with my class,
does type conversions to Tuple2, and delegates to PairRDDFunctions.
Does anyone know a better way? Anyone know if there will be a significant
performance penalty with that approach?
--
Daniel Siegmann, Software Dev
at 7:45 PM, Michael Armbrust
wrote:
> I think you should also be able to get away with casting it back and forth
> in this case using .asInstanceOf.
>
> On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann > wrote:
>
>> I have a class which is a subclass of Tupl
3), (2, 4)]
>
> and
>
> y = [(3, 5), (4, 7)]
>
> and I want to have
>
> z = [(1, 3), (2, 4), (3, 5), (4, 7)]
>
> How can I achieve this. I know you can use outerJoin followed by map to
> achieve this, but is there a more direct way for this.
>
--
Daniel
e:
>>
>>> Say I have two RDDs with the following values
>>>
>>> x = [(1, 3), (2, 4)]
>>>
>>> and
>>>
>>> y = [(3, 5), (4, 7)]
>>>
>>> and I want to have
>>>
>>> z = [(1, 3),
I am trying to load a Parquet file which has a comma in its name. Yes, this
is a valid file name in HDFS. However, sqlContext.parquetFile interprets
this as a comma-separated list of parquet files.
Is there any way to escape the comma so it is treated as part of a single
file name?
--
Daniel
Thanks for the replies. Hopefully this will not be too difficult to fix.
Why not support multiple paths by overloading the parquetFile method to
take a collection of strings? That way we don't need an appropriate
delimiter.
On Thu, Dec 25, 2014 at 3:46 AM, Cheng, Hao wrote:
> I’ve created a ji
piler error "value join is not a member of
org.apache.spark.rdd.RDD[(K, Int)]".
The reason is probably obvious, but I don't have much Scala experience. Can
anyone explain what I'm doing wrong?
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH A
_
>> import org.apache.spark.rdd.RDD
>> import scala.reflect.ClassTag
>>
>> def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) :
>> RDD[(K, Int)] = {
>>
>> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
>> }
>&
e Spark context can be initialized wherever you like and passed around
just as any other object. Just don't try to create multiple contexts
against "local" (without stopping the previous one first), or you may get
ArrayStoreExceptions (I learned that one the hard way).
--
Daniel Siegmann,
better to say I
need a way to ensure the accumulator value is computed exactly once for a
given RDD. Anyone know a way to do this? Or anything I might look into? Or
is this something that just isn't supported in Spark?
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learnin
The Spark 1.0.0 release notes state "Internal instrumentation has been
added to allow applications to monitor and instrument Spark jobs." Can
anyone point me to the docs for this?
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR
101 - 119 of 119 matches
Mail list logo