Hi Henry, So, as far as I understand instead of wrapping Apache Spark within Gora with full functionality, I have to wrap its functionality of storing and accessing data. I mean one will use Gora input/output format and at the backend it will me mapped to RDD and will able to run Map/Reduce via Apache Spark etc. over Gora.
Kind Regards, Furkan KAMACI On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra <[email protected]> wrote: > Integration with Gora will mostly in the data ingest portion of the flow. > > Distributed processing frameworks like Spark, or Flink, already > support Hadoop input format as data sources so Gora should be able to > be used directly with Gor input format. > > The interesting portion is probably tighter integration such as custom > RDD or custom Data Manager to store and get data from Gora directly. > > - Henry > > On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney > <[email protected]> wrote: > > Henry mentored Crunch through incubation... Maybe he can tell you more > > context. > > For me, Gora is essentially an extremely easy storage abstraction > framework. > > I do not currently use the Query API meaning that the analysis of data is > > delegated to Gora data store. > > This is my current usage of the code base. > > > > > > On Saturday, March 21, 2015, Furkan KAMACI <[email protected]> > wrote: > >> > >> Hi Lewis, > >> > >> I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've > >> talked with Talat about design of that implementation. I just wanted to > >> check other projects for does any of them such kind of feature. > >> > >> Here is what is in my mind for Apache Gora for Spark supoort: > developing a > >> layer which abstracts functionality of Spark, Tez, etc (GORA-418). There > >> will be implementations for each of them (and Spark will be one of them: > >> GORA-386) > >> > >> i.e. you will write a word count example as Gora style, you will use one > >> of implementation and run it (as like storing data at Solr or Mongo via > >> Gora). > >> > >> When I check Crunch I realize that: > >> > >> "Every Crunch job begins with a Pipeline instance that manages the > >> execution lifecycle of your data pipeline. As of the 0.9.0 release, > there > >> are three implementations of the Pipeline interface: > >> > >> MRPipeline: Executes a pipeline as a series of MapReduce jobs that can > run > >> locally or on a Hadoop cluster. > >> MemPipeline: Executes a pipeline in-memory on the client. > >> SparkPipeline: Executes a pipeline by running a series of Apache Spark > >> jobs, either locally or on a Hadoop cluster." > >> > >> So, I am curious about that supporting Crunch may help us what we want > >> with Spark support at Gora? Actually, I am new to such projects, I want > to > >> learn what should be achieved with GORA-386 and not to be get lost > because > >> of overthinking :) I see that you can use Gora for storing your data > with > >> Gora-style, running jobs with Gora-style but have a flexibility of using > >> either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc. > >> > >> PS: I know there is a similar issue at Apache Gora for Cascading > support: > >> https://issues.apache.org/jira/browse/GORA-112 > >> > >> Kind Regards, > >> Furkan KAMACI > >> > >> On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney > >> <[email protected]> wrote: > >>> > >>> Hi Furkan, > >>> In what context are we talking here? > >>> GSoC or Just development? > >>> I am very keen to essentially work towards what we can release as Gora > >>> 1.0 > >>> Thank you Furkan > >>> > >>> > >>> On Saturday, March 21, 2015, Furkan KAMACI <[email protected]> > >>> wrote: > >>>> > >>>> As you know that there is an issue for integration Apache Spark and > >>>> Apache Gora [1]. Apache Spark is a popular project and in contrast to > >>>> Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory > >>>> primitives provide performance up to 100 times faster for certain > >>>> applications [2]. There are also some alternatives to Apache Spark, > i.e. > >>>> Apache Tez [3]. > >>>> > >>>> When implementing an integration for Spark, it should be considered to > >>>> have an abstraction for such kind of projects as an architectural > design and > >>>> there is a related issue for it: [4]. > >>>> > >>>> There is another Apache project which aims to provide a framework > named > >>>> as Apache Crunch [5] for writing, testing, and running MapReduce > pipelines. > >>>> Its goal is to make pipelines that are composed of many user-defined > >>>> functions simple to write, easy to test, and efficient to run. It is > an > >>>> high-level tool for writing data pipelines, as opposed to developing > against > >>>> the MapReduce, Spark, Tez APIs or etc. directly [6]. > >>>> > >>>> I would like to learn how Apache Crunch fits with creating a multi > >>>> execution engine for Gora [4]? What kind of benefits we can get with > >>>> integrating Apache Gora and Apache Crunch and what kind of gaps we > still can > >>>> have instead of developing a custom engine for our purpose? > >>>> > >>>> Kind Regards, > >>>> Furkan KAMACI > >>>> > >>>> [1] https://issues.apache.org/jira/browse/GORA-386 > >>>> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; > >>>> Shenker, Scott; Stoica, Ion (June 2013). > >>>> [3] http://tez.apache.org/ > >>>> [4] https://issues.apache.org/jira/browse/GORA-418 > >>>> [5] https://crunch.apache.org/ > >>>> [6] https://crunch.apache.org/user-guide.html#motivation > >>> > >>> > >>> > >>> -- > >>> Lewis > >>> > >> > > > > > > -- > > Lewis > > >

