I've played around with it.  The CSV file looks like it gives 130
partitions.  I'm assuming that's the standard 64MB split size for HDFS
files.  I have increased number of partitions and number of tasks for
things like groupByKey and such.  Usually I start blowing up on GC
Overlimit or sometimes Heap OOM.  I recently tried throwing coalesce with
shuffle = true,  into the mix thinking it would bring the keys into the
same partition. E.g.,

    (fileA ++ fileB.map{case (k,v) => (k,
Array(v)}).coalesce(fileA.partitions.length + fileB.partitions.length,
shuffle = true).groupBy...

(Which should effectively be imitating map-reduce) but I see GC Overlimit
when I do that.

I've got a stock install with num cores and worker memory set as mentioned
but even something like this

    fileA.sortByKey().map{_ => 1}.reduce{_ + _}

blows up with GC Overlimit (as did .count instead of the by-hand count).

    fileA.count

works.  It seems to be able to load the file as an RDD but not manipulate
it.




On Fri, Mar 28, 2014 at 3:04 AM, Sonal Goyal [via Apache Spark User List] <
ml-node+s1001560n3417...@n3.nabble.com> wrote:

> Have you tried setting the partitioning ?
>
> Best Regards,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
> On Thu, Mar 27, 2014 at 10:04 AM, lannyripple <[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=3417&i=0>
> > wrote:
>
>> Hi all,
>>
>> I've got something which I think should be straightforward but it's not so
>> I'm not getting it.
>>
>> I have an 8 node spark 0.9.0 cluster also running HDFS.  Workers have 16g
>> of
>> memory using 8 cores.
>>
>> In HDFS I have a CSV file of 110M lines of 9 columns (e.g.,
>> [key,a,b,c...]).
>> I have another file of 25K lines containing some number of keys which
>> might
>> be in my CSV file.  (Yes, I know I should use an RDBMS or shark or
>> something.  I'll get to that but this is toy problem that I'm using to get
>> some intuition with spark.)
>>
>> Working on each file individually spark has no problem manipulating the
>> files.  If I try and join or union+filter though I can't seem to find the
>> join of the two files.  Code is along the lines of
>>
>> val fileA =
>> sc.textFile("hdfs://.../fileA_110M.csv").map{_.split(",")}.keyBy{_(0)}
>> val fileB = sc.textFile("hdfs://.../fileB_25k.csv").keyBy{x => x}
>>
>> And trying things like fileA.join(fileB) gives me heap OOM.  Trying
>>
>> (fileA ++ fileB.map{case (k,v) => (k,
>> Array(v))}).groupBy{_._1}.filter{case
>> (k, (_, xs)) => xs.exists{_.length == 1}
>>
>> just causes spark to freeze.  (In all the cases I'm trying I just use a
>> final .count to force the results.)
>>
>> I suspect I'm missing something fundamental about bringing the keyed data
>> together into the same partitions so it can be efficiently joined but I've
>> given up for now.  If anyone can shed some light (Beyond, "No really.  Use
>> shark.") on what I'm not understanding it would be most helpful.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316p3417.html
>  To unsubscribe from Not getting it, click 
> here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3316&code=bGFubnkucmlwcGxlQGdtYWlsLmNvbXwzMzE2fDExMzI5OTY5Nzc=>
> .
> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Not-getting-it-tp3316p3437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to