Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Furkan KAMACI Mon, 23 Mar 2015 11:42:41 -0700

Hi Henry,

So, as far as I understand instead of wrapping Apache Spark within Gora
with full functionality, I have to wrap its functionality of storing and
accessing data. I mean one will use Gora input/output format  and at the
backend it will me mapped to RDD and will able to run Map/Reduce via Apache
Spark etc. over Gora.


Kind Regards,
Furkan KAMACI

On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra <[email protected]>
wrote:

> Integration with Gora will mostly in the data ingest portion of the flow.
>
> Distributed processing frameworks like Spark, or Flink, already
> support Hadoop input format as data sources so Gora should be able to
> be used directly with Gor input format.
>
> The interesting portion is probably tighter integration such as custom
> RDD or custom Data Manager to store and get data from Gora directly.
>
> - Henry
>
> On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney
> <[email protected]> wrote:
> > Henry mentored Crunch through incubation... Maybe he can tell you more
> > context.
> > For me, Gora is essentially an extremely easy storage abstraction
> framework.
> > I do not currently use the Query API meaning that the analysis of data is
> > delegated to Gora data store.
> > This is my current usage of the code base.
> >
> >
> > On Saturday, March 21, 2015, Furkan KAMACI <[email protected]>
> wrote:
> >>
> >> Hi Lewis,
> >>
> >> I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've
> >> talked with Talat about design of that implementation. I just wanted to
> >> check other projects for does any of them such kind of feature.
> >>
> >> Here is what is in my mind for Apache Gora for Spark supoort:
> developing a
> >> layer which abstracts functionality of Spark, Tez, etc (GORA-418). There
> >> will be implementations for each of them (and Spark will be one of them:
> >> GORA-386)
> >>
> >> i.e. you will write a word count example as Gora style, you will use one
> >> of implementation and run it (as like storing data at Solr or Mongo via
> >> Gora).
> >>
> >> When I check Crunch I realize that:
> >>
> >> "Every Crunch job begins with a Pipeline instance that manages the
> >> execution lifecycle of your data pipeline. As of the 0.9.0 release,
> there
> >> are three implementations of the Pipeline interface:
> >>
> >> MRPipeline: Executes a pipeline as a series of MapReduce jobs that can
> run
> >> locally or on a Hadoop cluster.
> >> MemPipeline: Executes a pipeline in-memory on the client.
> >> SparkPipeline: Executes a pipeline by running a series of Apache Spark
> >> jobs, either locally or on a Hadoop cluster."
> >>
> >> So, I am curious about that supporting Crunch may help us what we want
> >> with Spark support at Gora? Actually, I am new to such projects, I want
> to
> >> learn what should be achieved with GORA-386 and not to be get lost
> because
> >> of overthinking :) I see that you can use Gora for storing your data
> with
> >> Gora-style, running jobs with Gora-style but have a flexibility of using
> >> either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.
> >>
> >> PS: I know there is a similar issue at Apache Gora for Cascading
> support:
> >> https://issues.apache.org/jira/browse/GORA-112
> >>
> >> Kind Regards,
> >> Furkan KAMACI
> >>
> >> On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney
> >> <[email protected]> wrote:
> >>>
> >>> Hi Furkan,
> >>> In what context are we talking here?
> >>> GSoC or Just development?
> >>> I am very keen to essentially work towards what we can release as Gora
> >>> 1.0
> >>> Thank you Furkan
> >>>
> >>>
> >>> On Saturday, March 21, 2015, Furkan KAMACI <[email protected]>
> >>> wrote:
> >>>>
> >>>> As you know that there is an issue for integration Apache Spark and
> >>>> Apache Gora [1]. Apache Spark is a popular project and in contrast to
> >>>> Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
> >>>> primitives provide performance up to 100 times faster for certain
> >>>> applications [2]. There are also some alternatives to Apache Spark,
> i.e.
> >>>> Apache Tez [3].
> >>>>
> >>>> When implementing an integration for Spark, it should be considered to
> >>>> have an abstraction for such kind of projects as an architectural
> design and
> >>>> there is a related issue for it: [4].
> >>>>
> >>>> There is another Apache project which aims to provide a framework
> named
> >>>> as Apache Crunch [5] for writing, testing, and running MapReduce
> pipelines.
> >>>> Its goal is to make pipelines that are composed of many user-defined
> >>>> functions simple to write, easy to test, and efficient to run. It is
> an
> >>>> high-level tool for writing data pipelines, as opposed to developing
> against
> >>>> the MapReduce, Spark, Tez APIs or etc. directly [6].
> >>>>
> >>>> I would like to learn how Apache Crunch fits with creating a multi
> >>>> execution engine for Gora [4]? What kind of benefits we can get with
> >>>> integrating Apache Gora and Apache Crunch and what kind of gaps we
> still can
> >>>> have instead of developing a custom engine for our purpose?
> >>>>
> >>>> Kind Regards,
> >>>> Furkan KAMACI
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/GORA-386
> >>>> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
> >>>> Shenker, Scott; Stoica, Ion (June 2013).
> >>>> [3] http://tez.apache.org/
> >>>> [4] https://issues.apache.org/jira/browse/GORA-418
> >>>> [5] https://crunch.apache.org/
> >>>> [6] https://crunch.apache.org/user-guide.html#motivation
> >>>
> >>>
> >>>
> >>> --
> >>> Lewis
> >>>
> >>
> >
> >
> > --
> > Lewis
> >
>

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Reply via email to