Difference btw MEMORY_ONLY and MEMORY_AND_DISK

2015-08-18 Thread Harsha HN
Hello Sparkers,

I would like to understand difference btw these Storage levels for a RDD
portion that doesn't fit in memory.
As it seems like in both storage levels, whatever portion doesnt fit in
memory will be spilled to disk. Any difference as such?

Thanks,
Harsha


SPARK UI - Details post job processiong

2014-09-25 Thread Harsha HN
Hi,

Details laid out in Spark UI for the job in progress is really interesting
and very useful.
But this gets vanished once the job is done.
Is there a way to get job details post processing?

Looking for Spark UI data, not standard input,output and error info.

Thanks,
Harsha


Working on LZOP Files

2014-09-25 Thread Harsha HN
Hi,

Anybody using LZOP files to process in Spark?

We have a huge volume of LZOP files in HDFS to process through Spark. In
MapReduce framework, it automatically detects the file format and sends the
decompressed version to Mappers.
Any such support in Spark?
As of now I am manually downloading, decompressing it before processing.

Thanks,
Harsha


PairRDD's lookup method Performance

2014-09-18 Thread Harsha HN
Hi All,

My question is related to improving performance of pairRDD's lookup method.
I went through below link where "Tathagata Das
"
 explains
creating Hash Map over Partitions using "mappartition" method to get search
performance of O(1).
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-RDD-over-hashmap-td893.html

How can this be done in Java? HashMap is not a supported return type for
any overloaded version of "mappartition" methods.

Thanks and Regards,
Harsha


Re: Adjacency List representation in Spark

2014-09-18 Thread Harsha HN
Hi Andrew,

The only reason that I avoided GraphX approach is that I didnt see any
explanation on Java side nor API documentation on Java.
Do you have any code piece of using GraphX API in JAVA?

Thanks,
Harsha

On Wed, Sep 17, 2014 at 10:44 PM, Andrew Ash  wrote:

> Hi Harsha,
>
> You could look through the GraphX source to see the approach taken there
> for ideas in your own.  I'd recommend starting at
> https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala#L385
> to see the storage technique.
>
> Why do you want to avoid using GraphX?
>
> Good luck!
> Andrew
>
> On Wed, Sep 17, 2014 at 6:43 AM, Harsha HN <99harsha.h@gmail.com>
> wrote:
>
>> Hello
>>
>> We are building an adjacency list to represent a graph. Vertexes, Edges
>> and Weights for the same has been extracted from hdfs files by a Spark job.
>> Further we expect size of the adjacency list(Hash Map) could grow over
>> 20Gigs.
>> How can we represent this in RDD, so that it will distributed in nature?
>>
>> Basically we are trying to fit HashMap(Adjacency List) into Spark RDD. Is
>> there any other way other than GraphX?
>>
>> Thanks and Regards,
>> Harsha
>>
>
>


Adjacency List representation in Spark

2014-09-17 Thread Harsha HN
Hello

We are building an adjacency list to represent a graph. Vertexes, Edges and
Weights for the same has been extracted from hdfs files by a Spark job.
Further we expect size of the adjacency list(Hash Map) could grow over
20Gigs.
How can we represent this in RDD, so that it will distributed in nature?

Basically we are trying to fit HashMap(Adjacency List) into Spark RDD. Is
there any other way other than GraphX?

Thanks and Regards,
Harsha