Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Furkan KAMACI Fri, 27 Mar 2015 10:22:08 -0700

Hi Henry,

I've submitted a proposal for Spark backend support. Most important key
point is that: Gora input format will have an RDD transformation ability
and so one able to access either Hadoop Map/Reduce or Spark.


On Tue, Mar 24, 2015 at 11:43 PM, Henry Saputra <[email protected]>
wrote:

> HI Furkan,
>
> Yes, you are right. In the code execution for Spark or Flink, Gora
> should be part of the data ingest and storing.
>
> So, is  the idea is to make data store in Spark to access Gora instead
> of default store options?
>
> - Henry
>
> On Mon, Mar 23, 2015 at 11:34 AM, Furkan KAMACI <[email protected]>
> wrote:
> > Hi Henry,
> >
> > So, as far as I understand instead of wrapping Apache Spark within Gora
> with
> > full functionality, I have to wrap its functionality of storing and
> > accessing data. I mean one will use Gora input/output format  and at the
> > backend it will me mapped to RDD and will able to run Map/Reduce via
> Apache
> > Spark etc. over Gora.
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra <[email protected]>
> > wrote:
> >>
> >> Integration with Gora will mostly in the data ingest portion of the
> flow.
> >>
> >> Distributed processing frameworks like Spark, or Flink, already
> >> support Hadoop input format as data sources so Gora should be able to
> >> be used directly with Gor input format.
> >>
> >> The interesting portion is probably tighter integration such as custom
> >> RDD or custom Data Manager to store and get data from Gora directly.
> >>
> >> - Henry
> >>
> >> On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney
> >> <[email protected]> wrote:
> >> > Henry mentored Crunch through incubation... Maybe he can tell you more
> >> > context.
> >> > For me, Gora is essentially an extremely easy storage abstraction
> >> > framework.
> >> > I do not currently use the Query API meaning that the analysis of data
> >> > is
> >> > delegated to Gora data store.
> >> > This is my current usage of the code base.
> >> >
> >> >
> >> > On Saturday, March 21, 2015, Furkan KAMACI <[email protected]>
> >> > wrote:
> >> >>
> >> >> Hi Lewis,
> >> >>
> >> >> I am talking in context of GORA-418 and GORA-386, we can say GSoC.
> I've
> >> >> talked with Talat about design of that implementation. I just wanted
> to
> >> >> check other projects for does any of them such kind of feature.
> >> >>
> >> >> Here is what is in my mind for Apache Gora for Spark supoort:
> >> >> developing a
> >> >> layer which abstracts functionality of Spark, Tez, etc (GORA-418).
> >> >> There
> >> >> will be implementations for each of them (and Spark will be one of
> >> >> them:
> >> >> GORA-386)
> >> >>
> >> >> i.e. you will write a word count example as Gora style, you will use
> >> >> one
> >> >> of implementation and run it (as like storing data at Solr or Mongo
> via
> >> >> Gora).
> >> >>
> >> >> When I check Crunch I realize that:
> >> >>
> >> >> "Every Crunch job begins with a Pipeline instance that manages the
> >> >> execution lifecycle of your data pipeline. As of the 0.9.0 release,
> >> >> there
> >> >> are three implementations of the Pipeline interface:
> >> >>
> >> >> MRPipeline: Executes a pipeline as a series of MapReduce jobs that
> can
> >> >> run
> >> >> locally or on a Hadoop cluster.
> >> >> MemPipeline: Executes a pipeline in-memory on the client.
> >> >> SparkPipeline: Executes a pipeline by running a series of Apache
> Spark
> >> >> jobs, either locally or on a Hadoop cluster."
> >> >>
> >> >> So, I am curious about that supporting Crunch may help us what we
> want
> >> >> with Spark support at Gora? Actually, I am new to such projects, I
> want
> >> >> to
> >> >> learn what should be achieved with GORA-386 and not to be get lost
> >> >> because
> >> >> of overthinking :) I see that you can use Gora for storing your data
> >> >> with
> >> >> Gora-style, running jobs with Gora-style but have a flexibility of
> >> >> using
> >> >> either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.
> >> >>
> >> >> PS: I know there is a similar issue at Apache Gora for Cascading
> >> >> support:
> >> >> https://issues.apache.org/jira/browse/GORA-112
> >> >>
> >> >> Kind Regards,
> >> >> Furkan KAMACI
> >> >>
> >> >> On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney
> >> >> <[email protected]> wrote:
> >> >>>
> >> >>> Hi Furkan,
> >> >>> In what context are we talking here?
> >> >>> GSoC or Just development?
> >> >>> I am very keen to essentially work towards what we can release as
> Gora
> >> >>> 1.0
> >> >>> Thank you Furkan
> >> >>>
> >> >>>
> >> >>> On Saturday, March 21, 2015, Furkan KAMACI <[email protected]>
> >> >>> wrote:
> >> >>>>
> >> >>>> As you know that there is an issue for integration Apache Spark and
> >> >>>> Apache Gora [1]. Apache Spark is a popular project and in contrast
> to
> >> >>>> Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
> >> >>>> primitives provide performance up to 100 times faster for certain
> >> >>>> applications [2]. There are also some alternatives to Apache Spark,
> >> >>>> i.e.
> >> >>>> Apache Tez [3].
> >> >>>>
> >> >>>> When implementing an integration for Spark, it should be considered
> >> >>>> to
> >> >>>> have an abstraction for such kind of projects as an architectural
> >> >>>> design and
> >> >>>> there is a related issue for it: [4].
> >> >>>>
> >> >>>> There is another Apache project which aims to provide a framework
> >> >>>> named
> >> >>>> as Apache Crunch [5] for writing, testing, and running MapReduce
> >> >>>> pipelines.
> >> >>>> Its goal is to make pipelines that are composed of many
> user-defined
> >> >>>> functions simple to write, easy to test, and efficient to run. It
> is
> >> >>>> an
> >> >>>> high-level tool for writing data pipelines, as opposed to
> developing
> >> >>>> against
> >> >>>> the MapReduce, Spark, Tez APIs or etc. directly [6].
> >> >>>>
> >> >>>> I would like to learn how Apache Crunch fits with creating a multi
> >> >>>> execution engine for Gora [4]? What kind of benefits we can get
> with
> >> >>>> integrating Apache Gora and Apache Crunch and what kind of gaps we
> >> >>>> still can
> >> >>>> have instead of developing a custom engine for our purpose?
> >> >>>>
> >> >>>> Kind Regards,
> >> >>>> Furkan KAMACI
> >> >>>>
> >> >>>> [1] https://issues.apache.org/jira/browse/GORA-386
> >> >>>> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
> >> >>>> Shenker, Scott; Stoica, Ion (June 2013).
> >> >>>> [3] http://tez.apache.org/
> >> >>>> [4] https://issues.apache.org/jira/browse/GORA-418
> >> >>>> [5] https://crunch.apache.org/
> >> >>>> [6] https://crunch.apache.org/user-guide.html#motivation
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Lewis
> >> >>>
> >> >>
> >> >
> >> >
> >> > --
> >> > Lewis
> >> >
> >
> >
>

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Reply via email to