Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

2015-03-21 Thread Lewis John Mcgibbney
Hi Furkan,
In what context are we talking here?
GSoC or Just development?
I am very keen to essentially work towards what we can release as Gora 1.0
Thank you Furkan

On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote:

 As you know that there is an issue for integration Apache Spark and Apache
 Gora [1]. Apache Spark is a popular project and in contrast to Hadoop's
 two-stage disk-based MapReduce paradigm, Spark's in-memory primitives
 provide performance up to 100 times faster for certain applications [2].
 There are also some alternatives to Apache Spark, i.e. Apache Tez [3].

 When implementing an integration for Spark, it should be considered to
 have an abstraction for such kind of projects as an architectural design
 and there is a related issue for it: [4].

 There is another Apache project which aims to provide a framework named as
 Apache Crunch [5] for writing, testing, and running MapReduce pipelines.
 Its goal is to make pipelines that are composed of many user-defined
 functions simple to write, easy to test, and efficient to run. It is an
 high-level tool for writing data pipelines, as opposed to developing
 against the MapReduce, Spark, Tez APIs or etc. directly [6].

 I would like to learn how Apache Crunch fits with creating a multi
 execution engine for Gora [4]? What kind of benefits we can get with
 integrating Apache Gora and Apache Crunch and what kind of gaps we still
 can have instead of developing a custom engine for our purpose?

 Kind Regards,
 Furkan KAMACI

 [1] https://issues.apache.org/jira/browse/GORA-386
 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker,
 Scott; Stoica, Ion (June 2013).
 [3] http://tez.apache.org/
 [4] https://issues.apache.org/jira/browse/GORA-418
 [5] https://crunch.apache.org/
 [6] https://crunch.apache.org/user-guide.html#motivation



-- 
*Lewis*


Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

2015-03-21 Thread Lewis John Mcgibbney
Henry mentored Crunch through incubation... Maybe he can tell you more
context.
For me, Gora is essentially an extremely easy storage abstraction
framework. I do not currently use the Query API meaning that the analysis
of data is delegated to Gora data store.
This is my current usage of the code base.

On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote:

 Hi Lewis,

 I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've
 talked with Talat about design of that implementation. I just wanted to
 check other projects for does any of them such kind of feature.

 Here is what is in my mind for Apache Gora for Spark supoort: developing a
 layer which abstracts functionality of Spark, Tez, etc (GORA-418). There
 will be implementations for each of them (and Spark will be one of them:
 GORA-386)

 i.e. you will write a word count example as Gora style, you will use one
 of implementation and run it (as like storing data at Solr or Mongo via
 Gora).

 When I check Crunch I realize that:

 *Every Crunch job begins with a Pipeline instance that manages the
 execution lifecycle of your data pipeline. As of the 0.9.0 release, there
 are three implementations of the Pipeline interface:*

 *MRPipeline: Executes a pipeline as a series of MapReduce jobs that can
 run locally or on a Hadoop cluster.*
 *MemPipeline: Executes a pipeline in-memory on the client.*
 *SparkPipeline: Executes a pipeline by running a series of Apache Spark
 jobs, either locally or on a Hadoop cluster.*

 So, I am curious about that supporting Crunch may help us what we want
 with Spark support at Gora? Actually, I am new to such projects, I want to
 learn what should be achieved with GORA-386 and not to be get lost because
 of overthinking :) I see that you can use Gora for storing your data with
 Gora-style, running jobs with Gora-style but have a flexibility of using
 either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.

 PS: I know there is a similar issue at Apache Gora for Cascading support:
 https://issues.apache.org/jira/browse/GORA-112

 Kind Regards,
 Furkan KAMACI

 On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com
 javascript:_e(%7B%7D,'cvml','lewis.mcgibb...@gmail.com'); wrote:

 Hi Furkan,
 In what context are we talking here?
 GSoC or Just development?
 I am very keen to essentially work towards what we can release as Gora 1.0
 Thank you Furkan


 On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com
 javascript:_e(%7B%7D,'cvml','furkankam...@gmail.com'); wrote:

 As you know that there is an issue for integration Apache Spark and
 Apache Gora [1]. Apache Spark is a popular project and in contrast to
 Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
 primitives provide performance up to 100 times faster for certain
 applications [2]. There are also some alternatives to Apache Spark, i.e.
 Apache Tez [3].

 When implementing an integration for Spark, it should be considered to
 have an abstraction for such kind of projects as an architectural design
 and there is a related issue for it: [4].

 There is another Apache project which aims to provide a framework named
 as Apache Crunch [5] for writing, testing, and running MapReduce pipelines.
 Its goal is to make pipelines that are composed of many user-defined
 functions simple to write, easy to test, and efficient to run. It is an
 high-level tool for writing data pipelines, as opposed to developing
 against the MapReduce, Spark, Tez APIs or etc. directly [6].

 I would like to learn how Apache Crunch fits with creating a multi
 execution engine for Gora [4]? What kind of benefits we can get with
 integrating Apache Gora and Apache Crunch and what kind of gaps we still
 can have instead of developing a custom engine for our purpose?

 Kind Regards,
 Furkan KAMACI

 [1] https://issues.apache.org/jira/browse/GORA-386
 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
 Shenker, Scott; Stoica, Ion (June 2013).
 [3] http://tez.apache.org/
 [4] https://issues.apache.org/jira/browse/GORA-418
 [5] https://crunch.apache.org/
 [6] https://crunch.apache.org/user-guide.html#motivation



 --
 *Lewis*




-- 
*Lewis*