Hi Henry, I've submitted a proposal for Spark backend support. Most important key point is that: Gora input format will have an RDD transformation ability and so one able to access either Hadoop Map/Reduce or Spark.
On Tue, Mar 24, 2015 at 11:43 PM, Henry Saputra <[email protected]> wrote: > HI Furkan, > > Yes, you are right. In the code execution for Spark or Flink, Gora > should be part of the data ingest and storing. > > So, is the idea is to make data store in Spark to access Gora instead > of default store options? > > - Henry > > On Mon, Mar 23, 2015 at 11:34 AM, Furkan KAMACI <[email protected]> > wrote: > > Hi Henry, > > > > So, as far as I understand instead of wrapping Apache Spark within Gora > with > > full functionality, I have to wrap its functionality of storing and > > accessing data. I mean one will use Gora input/output format and at the > > backend it will me mapped to RDD and will able to run Map/Reduce via > Apache > > Spark etc. over Gora. > > > > Kind Regards, > > Furkan KAMACI > > > > On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra <[email protected]> > > wrote: > >> > >> Integration with Gora will mostly in the data ingest portion of the > flow. > >> > >> Distributed processing frameworks like Spark, or Flink, already > >> support Hadoop input format as data sources so Gora should be able to > >> be used directly with Gor input format. > >> > >> The interesting portion is probably tighter integration such as custom > >> RDD or custom Data Manager to store and get data from Gora directly. > >> > >> - Henry > >> > >> On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney > >> <[email protected]> wrote: > >> > Henry mentored Crunch through incubation... Maybe he can tell you more > >> > context. > >> > For me, Gora is essentially an extremely easy storage abstraction > >> > framework. > >> > I do not currently use the Query API meaning that the analysis of data > >> > is > >> > delegated to Gora data store. > >> > This is my current usage of the code base. > >> > > >> > > >> > On Saturday, March 21, 2015, Furkan KAMACI <[email protected]> > >> > wrote: > >> >> > >> >> Hi Lewis, > >> >> > >> >> I am talking in context of GORA-418 and GORA-386, we can say GSoC. > I've > >> >> talked with Talat about design of that implementation. I just wanted > to > >> >> check other projects for does any of them such kind of feature. > >> >> > >> >> Here is what is in my mind for Apache Gora for Spark supoort: > >> >> developing a > >> >> layer which abstracts functionality of Spark, Tez, etc (GORA-418). > >> >> There > >> >> will be implementations for each of them (and Spark will be one of > >> >> them: > >> >> GORA-386) > >> >> > >> >> i.e. you will write a word count example as Gora style, you will use > >> >> one > >> >> of implementation and run it (as like storing data at Solr or Mongo > via > >> >> Gora). > >> >> > >> >> When I check Crunch I realize that: > >> >> > >> >> "Every Crunch job begins with a Pipeline instance that manages the > >> >> execution lifecycle of your data pipeline. As of the 0.9.0 release, > >> >> there > >> >> are three implementations of the Pipeline interface: > >> >> > >> >> MRPipeline: Executes a pipeline as a series of MapReduce jobs that > can > >> >> run > >> >> locally or on a Hadoop cluster. > >> >> MemPipeline: Executes a pipeline in-memory on the client. > >> >> SparkPipeline: Executes a pipeline by running a series of Apache > Spark > >> >> jobs, either locally or on a Hadoop cluster." > >> >> > >> >> So, I am curious about that supporting Crunch may help us what we > want > >> >> with Spark support at Gora? Actually, I am new to such projects, I > want > >> >> to > >> >> learn what should be achieved with GORA-386 and not to be get lost > >> >> because > >> >> of overthinking :) I see that you can use Gora for storing your data > >> >> with > >> >> Gora-style, running jobs with Gora-style but have a flexibility of > >> >> using > >> >> either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc. > >> >> > >> >> PS: I know there is a similar issue at Apache Gora for Cascading > >> >> support: > >> >> https://issues.apache.org/jira/browse/GORA-112 > >> >> > >> >> Kind Regards, > >> >> Furkan KAMACI > >> >> > >> >> On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney > >> >> <[email protected]> wrote: > >> >>> > >> >>> Hi Furkan, > >> >>> In what context are we talking here? > >> >>> GSoC or Just development? > >> >>> I am very keen to essentially work towards what we can release as > Gora > >> >>> 1.0 > >> >>> Thank you Furkan > >> >>> > >> >>> > >> >>> On Saturday, March 21, 2015, Furkan KAMACI <[email protected]> > >> >>> wrote: > >> >>>> > >> >>>> As you know that there is an issue for integration Apache Spark and > >> >>>> Apache Gora [1]. Apache Spark is a popular project and in contrast > to > >> >>>> Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory > >> >>>> primitives provide performance up to 100 times faster for certain > >> >>>> applications [2]. There are also some alternatives to Apache Spark, > >> >>>> i.e. > >> >>>> Apache Tez [3]. > >> >>>> > >> >>>> When implementing an integration for Spark, it should be considered > >> >>>> to > >> >>>> have an abstraction for such kind of projects as an architectural > >> >>>> design and > >> >>>> there is a related issue for it: [4]. > >> >>>> > >> >>>> There is another Apache project which aims to provide a framework > >> >>>> named > >> >>>> as Apache Crunch [5] for writing, testing, and running MapReduce > >> >>>> pipelines. > >> >>>> Its goal is to make pipelines that are composed of many > user-defined > >> >>>> functions simple to write, easy to test, and efficient to run. It > is > >> >>>> an > >> >>>> high-level tool for writing data pipelines, as opposed to > developing > >> >>>> against > >> >>>> the MapReduce, Spark, Tez APIs or etc. directly [6]. > >> >>>> > >> >>>> I would like to learn how Apache Crunch fits with creating a multi > >> >>>> execution engine for Gora [4]? What kind of benefits we can get > with > >> >>>> integrating Apache Gora and Apache Crunch and what kind of gaps we > >> >>>> still can > >> >>>> have instead of developing a custom engine for our purpose? > >> >>>> > >> >>>> Kind Regards, > >> >>>> Furkan KAMACI > >> >>>> > >> >>>> [1] https://issues.apache.org/jira/browse/GORA-386 > >> >>>> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; > >> >>>> Shenker, Scott; Stoica, Ion (June 2013). > >> >>>> [3] http://tez.apache.org/ > >> >>>> [4] https://issues.apache.org/jira/browse/GORA-418 > >> >>>> [5] https://crunch.apache.org/ > >> >>>> [6] https://crunch.apache.org/user-guide.html#motivation > >> >>> > >> >>> > >> >>> > >> >>> -- > >> >>> Lewis > >> >>> > >> >> > >> > > >> > > >> > -- > >> > Lewis > >> > > > > > >

