Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Lewis John Mcgibbney Sat, 21 Mar 2015 13:43:40 -0700

Henry mentored Crunch through incubation... Maybe he can tell you more
context.
For me, Gora is essentially an extremely easy storage abstraction
framework. I do not currently use the Query API meaning that the analysis
of data is delegated to Gora data store.
This is my current usage of the code base.


On Saturday, March 21, 2015, Furkan KAMACI <[email protected]> wrote:

> Hi Lewis,
>
> I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've
> talked with Talat about design of that implementation. I just wanted to
> check other projects for does any of them such kind of feature.
>
> Here is what is in my mind for Apache Gora for Spark supoort: developing a
> layer which abstracts functionality of Spark, Tez, etc (GORA-418). There
> will be implementations for each of them (and Spark will be one of them:
> GORA-386)
>
> i.e. you will write a word count example as Gora style, you will use one
> of implementation and run it (as like storing data at Solr or Mongo via
> Gora).
>
> When I check Crunch I realize that:
>
> "*Every Crunch job begins with a Pipeline instance that manages the
> execution lifecycle of your data pipeline. As of the 0.9.0 release, there
> are three implementations of the Pipeline interface:*
>
> *MRPipeline: Executes a pipeline as a series of MapReduce jobs that can
> run locally or on a Hadoop cluster.*
> *MemPipeline: Executes a pipeline in-memory on the client.*
> *SparkPipeline: Executes a pipeline by running a series of Apache Spark
> jobs, either locally or on a Hadoop cluster.*"
>
> So, I am curious about that supporting Crunch may help us what we want
> with Spark support at Gora? Actually, I am new to such projects, I want to
> learn what should be achieved with GORA-386 and not to be get lost because
> of overthinking :) I see that you can use Gora for storing your data with
> Gora-style, running jobs with Gora-style but have a flexibility of using
> either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.
>
> PS: I know there is a similar issue at Apache Gora for Cascading support:
> https://issues.apache.org/jira/browse/GORA-112
>
> Kind Regards,
> Furkan KAMACI
>
> On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney <
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> Hi Furkan,
>> In what context are we talking here?
>> GSoC or Just development?
>> I am very keen to essentially work towards what we can release as Gora 1.0
>> Thank you Furkan
>>
>>
>> On Saturday, March 21, 2015, Furkan KAMACI <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> As you know that there is an issue for integration Apache Spark and
>>> Apache Gora [1]. Apache Spark is a popular project and in contrast to
>>> Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
>>> primitives provide performance up to 100 times faster for certain
>>> applications [2]. There are also some alternatives to Apache Spark, i.e.
>>> Apache Tez [3].
>>>
>>> When implementing an integration for Spark, it should be considered to
>>> have an abstraction for such kind of projects as an architectural design
>>> and there is a related issue for it: [4].
>>>
>>> There is another Apache project which aims to provide a framework named
>>> as Apache Crunch [5] for writing, testing, and running MapReduce pipelines.
>>> Its goal is to make pipelines that are composed of many user-defined
>>> functions simple to write, easy to test, and efficient to run. It is an
>>> high-level tool for writing data pipelines, as opposed to developing
>>> against the MapReduce, Spark, Tez APIs or etc. directly [6].
>>>
>>> I would like to learn how Apache Crunch fits with creating a multi
>>> execution engine for Gora [4]? What kind of benefits we can get with
>>> integrating Apache Gora and Apache Crunch and what kind of gaps we still
>>> can have instead of developing a custom engine for our purpose?
>>>
>>> Kind Regards,
>>> Furkan KAMACI
>>>
>>> [1] https://issues.apache.org/jira/browse/GORA-386
>>> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
>>> Shenker, Scott; Stoica, Ion (June 2013).
>>> [3] http://tez.apache.org/
>>> [4] https://issues.apache.org/jira/browse/GORA-418
>>> [5] https://crunch.apache.org/
>>> [6] https://crunch.apache.org/user-guide.html#motivation
>>>
>>
>>
>> --
>> *Lewis*
>>
>>
>

-- 
*Lewis*

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Reply via email to