Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Henry Saputra Tue, 24 Mar 2015 14:45:51 -0700

HI Furkan,

Yes, you are right. In the code execution for Spark or Flink, Gora
should be part of the data ingest and storing.


So, is  the idea is to make data store in Spark to access Gora instead
of default store options?

- Henry

On Mon, Mar 23, 2015 at 11:34 AM, Furkan KAMACI <[email protected]> wrote:
> Hi Henry,
>
> So, as far as I understand instead of wrapping Apache Spark within Gora with
> full functionality, I have to wrap its functionality of storing and
> accessing data. I mean one will use Gora input/output format  and at the
> backend it will me mapped to RDD and will able to run Map/Reduce via Apache
> Spark etc. over Gora.
>
> Kind Regards,
> Furkan KAMACI
>
> On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra <[email protected]>
> wrote:
>>
>> Integration with Gora will mostly in the data ingest portion of the flow.
>>
>> Distributed processing frameworks like Spark, or Flink, already
>> support Hadoop input format as data sources so Gora should be able to
>> be used directly with Gor input format.
>>
>> The interesting portion is probably tighter integration such as custom
>> RDD or custom Data Manager to store and get data from Gora directly.
>>
>> - Henry
>>
>> On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney
>> <[email protected]> wrote:
>> > Henry mentored Crunch through incubation... Maybe he can tell you more
>> > context.
>> > For me, Gora is essentially an extremely easy storage abstraction
>> > framework.
>> > I do not currently use the Query API meaning that the analysis of data
>> > is
>> > delegated to Gora data store.
>> > This is my current usage of the code base.
>> >
>> >
>> > On Saturday, March 21, 2015, Furkan KAMACI <[email protected]>
>> > wrote:
>> >>
>> >> Hi Lewis,
>> >>
>> >> I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've
>> >> talked with Talat about design of that implementation. I just wanted to
>> >> check other projects for does any of them such kind of feature.
>> >>
>> >> Here is what is in my mind for Apache Gora for Spark supoort:
>> >> developing a
>> >> layer which abstracts functionality of Spark, Tez, etc (GORA-418).
>> >> There
>> >> will be implementations for each of them (and Spark will be one of
>> >> them:
>> >> GORA-386)
>> >>
>> >> i.e. you will write a word count example as Gora style, you will use
>> >> one
>> >> of implementation and run it (as like storing data at Solr or Mongo via
>> >> Gora).
>> >>
>> >> When I check Crunch I realize that:
>> >>
>> >> "Every Crunch job begins with a Pipeline instance that manages the
>> >> execution lifecycle of your data pipeline. As of the 0.9.0 release,
>> >> there
>> >> are three implementations of the Pipeline interface:
>> >>
>> >> MRPipeline: Executes a pipeline as a series of MapReduce jobs that can
>> >> run
>> >> locally or on a Hadoop cluster.
>> >> MemPipeline: Executes a pipeline in-memory on the client.
>> >> SparkPipeline: Executes a pipeline by running a series of Apache Spark
>> >> jobs, either locally or on a Hadoop cluster."
>> >>
>> >> So, I am curious about that supporting Crunch may help us what we want
>> >> with Spark support at Gora? Actually, I am new to such projects, I want
>> >> to
>> >> learn what should be achieved with GORA-386 and not to be get lost
>> >> because
>> >> of overthinking :) I see that you can use Gora for storing your data
>> >> with
>> >> Gora-style, running jobs with Gora-style but have a flexibility of
>> >> using
>> >> either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.
>> >>
>> >> PS: I know there is a similar issue at Apache Gora for Cascading
>> >> support:
>> >> https://issues.apache.org/jira/browse/GORA-112
>> >>
>> >> Kind Regards,
>> >> Furkan KAMACI
>> >>
>> >> On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney
>> >> <[email protected]> wrote:
>> >>>
>> >>> Hi Furkan,
>> >>> In what context are we talking here?
>> >>> GSoC or Just development?
>> >>> I am very keen to essentially work towards what we can release as Gora
>> >>> 1.0
>> >>> Thank you Furkan
>> >>>
>> >>>
>> >>> On Saturday, March 21, 2015, Furkan KAMACI <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>> As you know that there is an issue for integration Apache Spark and
>> >>>> Apache Gora [1]. Apache Spark is a popular project and in contrast to
>> >>>> Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
>> >>>> primitives provide performance up to 100 times faster for certain
>> >>>> applications [2]. There are also some alternatives to Apache Spark,
>> >>>> i.e.
>> >>>> Apache Tez [3].
>> >>>>
>> >>>> When implementing an integration for Spark, it should be considered
>> >>>> to
>> >>>> have an abstraction for such kind of projects as an architectural
>> >>>> design and
>> >>>> there is a related issue for it: [4].
>> >>>>
>> >>>> There is another Apache project which aims to provide a framework
>> >>>> named
>> >>>> as Apache Crunch [5] for writing, testing, and running MapReduce
>> >>>> pipelines.
>> >>>> Its goal is to make pipelines that are composed of many user-defined
>> >>>> functions simple to write, easy to test, and efficient to run. It is
>> >>>> an
>> >>>> high-level tool for writing data pipelines, as opposed to developing
>> >>>> against
>> >>>> the MapReduce, Spark, Tez APIs or etc. directly [6].
>> >>>>
>> >>>> I would like to learn how Apache Crunch fits with creating a multi
>> >>>> execution engine for Gora [4]? What kind of benefits we can get with
>> >>>> integrating Apache Gora and Apache Crunch and what kind of gaps we
>> >>>> still can
>> >>>> have instead of developing a custom engine for our purpose?
>> >>>>
>> >>>> Kind Regards,
>> >>>> Furkan KAMACI
>> >>>>
>> >>>> [1] https://issues.apache.org/jira/browse/GORA-386
>> >>>> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
>> >>>> Shenker, Scott; Stoica, Ion (June 2013).
>> >>>> [3] http://tez.apache.org/
>> >>>> [4] https://issues.apache.org/jira/browse/GORA-418
>> >>>> [5] https://crunch.apache.org/
>> >>>> [6] https://crunch.apache.org/user-guide.html#motivation
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Lewis
>> >>>
>> >>
>> >
>> >
>> > --
>> > Lewis
>> >
>
>

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Reply via email to