Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
hu, Nov 13, 2014 at 10:11 AM, Rishi Yadav wrote: > If your data is in hdfs and you are reading as textFile and each file is > less than block size, my understanding is it would always have one > partition per file. > > > On Thursday, November 13, 2014, Daniel Siegmann > wrote: &

Re: Accessing RDD within another RDD map

2014-11-13 Thread Daniel Siegmann
) > > The same goes for any other action I am trying to perform inside the map > statement. I am failing to understand what I am doing wrong. > Can anyone help with this? > > Thanks, > Simone Franzini, PhD > > http://www.linkedin.com/in/simonefranzini > -- Daniel Siegm

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
very RDD val rdds = paths.map { path => sc.textFile(path).map(myFunc) } val completeRdd = sc.union(rdds) Does that make any sense? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Daniel Siegmann
g MacAddress to determine which machine is running the > code. > As far as I can tell a simple word count is running in one thread on one > machine and the remainder of the cluster does nothing, > This is consistent with tests where I write to sdout from functions and > see little

Re: How do you force a Spark Application to run in multiple tasks

2014-11-17 Thread Daniel Siegmann
I've never used Mesos, sorry. On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis wrote: > The cluster runs Mesos and I can see the tasks in the Mesos UI but most > are not doing much - any hints about that UI > > On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann < > daniel.s

Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Daniel Siegmann
ulables seem to have some extra complications and overhead. > > > > So… > > > > What’s the real difference between an accumulator/accumulable and > aggregating an RDD? When is one method of aggregation preferred over the > other? > > > > Thanks, > >

Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann
Is there a mechanism similar to MR where we can ensure each > partition is assigned some amount of data by size, by setting some block > size parameter? > > > > On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann > wrote: > >> On Thu, Nov 13, 2014 at 3:24 PM, Pala

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Daniel Siegmann
inalRow) > as map/reduce tasks such that rows with same original string key get same > numeric consecutive key? > > Any hints? > > best, > /Shahab > > ​ > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
uld be to define my own equivalent of PairRDDFunctions which works with my class, does type conversions to Tuple2, and delegates to PairRDDFunctions. Does anyone know a better way? Anyone know if there will be a significant performance penalty with that approach? -- Daniel Siegmann, Software Dev

Re: PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
at 7:45 PM, Michael Armbrust wrote: > I think you should also be able to get away with casting it back and forth > in this case using .asInstanceOf. > > On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann > wrote: > >> I have a class which is a subclass of Tupl

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
3), (2, 4)] > > and > > y = [(3, 5), (4, 7)] > > and I want to have > > z = [(1, 3), (2, 4), (3, 5), (4, 7)] > > How can I achieve this. I know you can use outerJoin followed by map to > achieve this, but is there a more direct way for this. > -- Daniel

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
e: >> >>> Say I have two RDDs with the following values >>> >>> x = [(1, 3), (2, 4)] >>> >>> and >>> >>> y = [(3, 5), (4, 7)] >>> >>> and I want to have >>> >>> z = [(1, 3),

Escape commas in file names

2014-12-23 Thread Daniel Siegmann
I am trying to load a Parquet file which has a comma in its name. Yes, this is a valid file name in HDFS. However, sqlContext.parquetFile interprets this as a comma-separated list of parquet files. Is there any way to escape the comma so it is treated as part of a single file name? -- Daniel

Re: Escape commas in file names

2014-12-26 Thread Daniel Siegmann
Thanks for the replies. Hopefully this will not be too difficult to fix. Why not support multiple paths by overloading the parquetFile method to take a collection of strings? That way we don't need an appropriate delimiter. On Thu, Dec 25, 2014 at 3:46 AM, Cheng, Hao wrote: > I’ve created a ji

Generic types and pair RDDs

2014-04-01 Thread Daniel Siegmann
piler error "value join is not a member of org.apache.spark.rdd.RDD[(K, Int)]". The reason is probably obvious, but I don't have much Scala experience. Can anyone explain what I'm doing wrong? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH A

Re: Generic types and pair RDDs

2014-04-01 Thread Daniel Siegmann
_ >> import org.apache.spark.rdd.RDD >> import scala.reflect.ClassTag >> >> def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : >> RDD[(K, Int)] = { >> >> rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) } >> } >&

Re: Regarding Sparkcontext object

2014-04-04 Thread Daniel Siegmann
e Spark context can be initialized wherever you like and passed around just as any other object. Just don't try to create multiple contexts against "local" (without stopping the previous one first), or you may get ArrayStoreExceptions (I learned that one the hard way). -- Daniel Siegmann,

Counting things only once

2014-05-16 Thread Daniel Siegmann
better to say I need a way to ensure the accumulator value is computed exactly once for a given RDD. Anyone know a way to do this? Or anything I might look into? Or is this something that just isn't supported in Spark? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learnin

Monitoring / Instrumenting jobs in 1.0

2014-05-30 Thread Daniel Siegmann
The Spark 1.0.0 release notes state "Internal instrumentation has been added to allow applications to monitor and instrument Spark jobs." Can anyone point me to the docs for this? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR

<    1   2