subject:"Gora Spark Backend Support \(GORA\-386\) and Apache Crunch"

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

2015-03-27 Thread Furkan KAMACI

Hi Henry,

I've submitted a proposal for Spark backend support. Most important key
point is that: Gora input format will have an RDD transformation ability
and so one able to access either Hadoop Map/Reduce or Spark.

On Tue, Mar 24, 2015 at 11:43 PM, Henry Saputra henry.sapu...@gmail.com
wrote:

 HI Furkan,

 Yes, you are right. In the code execution for Spark or Flink, Gora
 should be part of the data ingest and storing.

 So, is  the idea is to make data store in Spark to access Gora instead
 of default store options?

 - Henry

 On Mon, Mar 23, 2015 at 11:34 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Hi Henry,
 
  So, as far as I understand instead of wrapping Apache Spark within Gora
 with
  full functionality, I have to wrap its functionality of storing and
  accessing data. I mean one will use Gora input/output format  and at the
  backend it will me mapped to RDD and will able to run Map/Reduce via
 Apache
  Spark etc. over Gora.
 
  Kind Regards,
  Furkan KAMACI
 
  On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra henry.sapu...@gmail.com
  wrote:
 
  Integration with Gora will mostly in the data ingest portion of the
 flow.
 
  Distributed processing frameworks like Spark, or Flink, already
  support Hadoop input format as data sources so Gora should be able to
  be used directly with Gor input format.
 
  The interesting portion is probably tighter integration such as custom
  RDD or custom Data Manager to store and get data from Gora directly.
 
  - Henry
 
  On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney
  lewis.mcgibb...@gmail.com wrote:
   Henry mentored Crunch through incubation... Maybe he can tell you more
   context.
   For me, Gora is essentially an extremely easy storage abstraction
   framework.
   I do not currently use the Query API meaning that the analysis of data
   is
   delegated to Gora data store.
   This is my current usage of the code base.
  
  
   On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com
   wrote:
  
   Hi Lewis,
  
   I am talking in context of GORA-418 and GORA-386, we can say GSoC.
 I've
   talked with Talat about design of that implementation. I just wanted
 to
   check other projects for does any of them such kind of feature.
  
   Here is what is in my mind for Apache Gora for Spark supoort:
   developing a
   layer which abstracts functionality of Spark, Tez, etc (GORA-418).
   There
   will be implementations for each of them (and Spark will be one of
   them:
   GORA-386)
  
   i.e. you will write a word count example as Gora style, you will use
   one
   of implementation and run it (as like storing data at Solr or Mongo
 via
   Gora).
  
   When I check Crunch I realize that:
  
   Every Crunch job begins with a Pipeline instance that manages the
   execution lifecycle of your data pipeline. As of the 0.9.0 release,
   there
   are three implementations of the Pipeline interface:
  
   MRPipeline: Executes a pipeline as a series of MapReduce jobs that
 can
   run
   locally or on a Hadoop cluster.
   MemPipeline: Executes a pipeline in-memory on the client.
   SparkPipeline: Executes a pipeline by running a series of Apache
 Spark
   jobs, either locally or on a Hadoop cluster.
  
   So, I am curious about that supporting Crunch may help us what we
 want
   with Spark support at Gora? Actually, I am new to such projects, I
 want
   to
   learn what should be achieved with GORA-386 and not to be get lost
   because
   of overthinking :) I see that you can use Gora for storing your data
   with
   Gora-style, running jobs with Gora-style but have a flexibility of
   using
   either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.
  
   PS: I know there is a similar issue at Apache Gora for Cascading
   support:
   https://issues.apache.org/jira/browse/GORA-112
  
   Kind Regards,
   Furkan KAMACI
  
   On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney
   lewis.mcgibb...@gmail.com wrote:
  
   Hi Furkan,
   In what context are we talking here?
   GSoC or Just development?
   I am very keen to essentially work towards what we can release as
 Gora
   1.0
   Thank you Furkan
  
  
   On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com
   wrote:
  
   As you know that there is an issue for integration Apache Spark and
   Apache Gora [1]. Apache Spark is a popular project and in contrast
 to
   Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
   primitives provide performance up to 100 times faster for certain
   applications [2]. There are also some alternatives to Apache Spark,
   i.e.
   Apache Tez [3].
  
   When implementing an integration for Spark, it should be considered
   to
   have an abstraction for such kind of projects as an architectural
   design and
   there is a related issue for it: [4].
  
   There is another Apache project which aims to provide a framework
   named
   as Apache Crunch [5] for writing, testing, and running MapReduce
   pipelines.
   Its goal is to

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

2015-03-24 Thread Henry Saputra

HI Furkan,

Yes, you are right. In the code execution for Spark or Flink, Gora
should be part of the data ingest and storing.

So, is  the idea is to make data store in Spark to access Gora instead
of default store options?

- Henry

On Mon, Mar 23, 2015 at 11:34 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Hi Henry,

 So, as far as I understand instead of wrapping Apache Spark within Gora with
 full functionality, I have to wrap its functionality of storing and
 accessing data. I mean one will use Gora input/output format  and at the
 backend it will me mapped to RDD and will able to run Map/Reduce via Apache
 Spark etc. over Gora.

 Kind Regards,
 Furkan KAMACI

 On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra henry.sapu...@gmail.com
 wrote:

 Integration with Gora will mostly in the data ingest portion of the flow.

 Distributed processing frameworks like Spark, or Flink, already
 support Hadoop input format as data sources so Gora should be able to
 be used directly with Gor input format.

 The interesting portion is probably tighter integration such as custom
 RDD or custom Data Manager to store and get data from Gora directly.

 - Henry

 On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:
  Henry mentored Crunch through incubation... Maybe he can tell you more
  context.
  For me, Gora is essentially an extremely easy storage abstraction
  framework.
  I do not currently use the Query API meaning that the analysis of data
  is
  delegated to Gora data store.
  This is my current usage of the code base.
 
 
  On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com
  wrote:
 
  Hi Lewis,
 
  I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've
  talked with Talat about design of that implementation. I just wanted to
  check other projects for does any of them such kind of feature.
 
  Here is what is in my mind for Apache Gora for Spark supoort:
  developing a
  layer which abstracts functionality of Spark, Tez, etc (GORA-418).
  There
  will be implementations for each of them (and Spark will be one of
  them:
  GORA-386)
 
  i.e. you will write a word count example as Gora style, you will use
  one
  of implementation and run it (as like storing data at Solr or Mongo via
  Gora).
 
  When I check Crunch I realize that:
 
  Every Crunch job begins with a Pipeline instance that manages the
  execution lifecycle of your data pipeline. As of the 0.9.0 release,
  there
  are three implementations of the Pipeline interface:
 
  MRPipeline: Executes a pipeline as a series of MapReduce jobs that can
  run
  locally or on a Hadoop cluster.
  MemPipeline: Executes a pipeline in-memory on the client.
  SparkPipeline: Executes a pipeline by running a series of Apache Spark
  jobs, either locally or on a Hadoop cluster.
 
  So, I am curious about that supporting Crunch may help us what we want
  with Spark support at Gora? Actually, I am new to such projects, I want
  to
  learn what should be achieved with GORA-386 and not to be get lost
  because
  of overthinking :) I see that you can use Gora for storing your data
  with
  Gora-style, running jobs with Gora-style but have a flexibility of
  using
  either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.
 
  PS: I know there is a similar issue at Apache Gora for Cascading
  support:
  https://issues.apache.org/jira/browse/GORA-112
 
  Kind Regards,
  Furkan KAMACI
 
  On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney
  lewis.mcgibb...@gmail.com wrote:
 
  Hi Furkan,
  In what context are we talking here?
  GSoC or Just development?
  I am very keen to essentially work towards what we can release as Gora
  1.0
  Thank you Furkan
 
 
  On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com
  wrote:
 
  As you know that there is an issue for integration Apache Spark and
  Apache Gora [1]. Apache Spark is a popular project and in contrast to
  Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
  primitives provide performance up to 100 times faster for certain
  applications [2]. There are also some alternatives to Apache Spark,
  i.e.
  Apache Tez [3].
 
  When implementing an integration for Spark, it should be considered
  to
  have an abstraction for such kind of projects as an architectural
  design and
  there is a related issue for it: [4].
 
  There is another Apache project which aims to provide a framework
  named
  as Apache Crunch [5] for writing, testing, and running MapReduce
  pipelines.
  Its goal is to make pipelines that are composed of many user-defined
  functions simple to write, easy to test, and efficient to run. It is
  an
  high-level tool for writing data pipelines, as opposed to developing
  against
  the MapReduce, Spark, Tez APIs or etc. directly [6].
 
  I would like to learn how Apache Crunch fits with creating a multi
  execution engine for Gora [4]? What kind of benefits we can get with
  integrating Apache Gora and Apache

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

2015-03-23 Thread Henry Saputra

Integration with Gora will mostly in the data ingest portion of the flow.

Distributed processing frameworks like Spark, or Flink, already
support Hadoop input format as data sources so Gora should be able to
be used directly with Gor input format.

The interesting portion is probably tighter integration such as custom
RDD or custom Data Manager to store and get data from Gora directly.

- Henry

On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
 Henry mentored Crunch through incubation... Maybe he can tell you more
 context.
 For me, Gora is essentially an extremely easy storage abstraction framework.
 I do not currently use the Query API meaning that the analysis of data is
 delegated to Gora data store.
 This is my current usage of the code base.


 On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote:

 Hi Lewis,

 I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've
 talked with Talat about design of that implementation. I just wanted to
 check other projects for does any of them such kind of feature.

 Here is what is in my mind for Apache Gora for Spark supoort: developing a
 layer which abstracts functionality of Spark, Tez, etc (GORA-418). There
 will be implementations for each of them (and Spark will be one of them:
 GORA-386)

 i.e. you will write a word count example as Gora style, you will use one
 of implementation and run it (as like storing data at Solr or Mongo via
 Gora).

 When I check Crunch I realize that:

 Every Crunch job begins with a Pipeline instance that manages the
 execution lifecycle of your data pipeline. As of the 0.9.0 release, there
 are three implementations of the Pipeline interface:

 MRPipeline: Executes a pipeline as a series of MapReduce jobs that can run
 locally or on a Hadoop cluster.
 MemPipeline: Executes a pipeline in-memory on the client.
 SparkPipeline: Executes a pipeline by running a series of Apache Spark
 jobs, either locally or on a Hadoop cluster.

 So, I am curious about that supporting Crunch may help us what we want
 with Spark support at Gora? Actually, I am new to such projects, I want to
 learn what should be achieved with GORA-386 and not to be get lost because
 of overthinking :) I see that you can use Gora for storing your data with
 Gora-style, running jobs with Gora-style but have a flexibility of using
 either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.

 PS: I know there is a similar issue at Apache Gora for Cascading support:
 https://issues.apache.org/jira/browse/GORA-112

 Kind Regards,
 Furkan KAMACI

 On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:

 Hi Furkan,
 In what context are we talking here?
 GSoC or Just development?
 I am very keen to essentially work towards what we can release as Gora
 1.0
 Thank you Furkan


 On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com
 wrote:

 As you know that there is an issue for integration Apache Spark and
 Apache Gora [1]. Apache Spark is a popular project and in contrast to
 Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
 primitives provide performance up to 100 times faster for certain
 applications [2]. There are also some alternatives to Apache Spark, i.e.
 Apache Tez [3].

 When implementing an integration for Spark, it should be considered to
 have an abstraction for such kind of projects as an architectural design 
 and
 there is a related issue for it: [4].

 There is another Apache project which aims to provide a framework named
 as Apache Crunch [5] for writing, testing, and running MapReduce pipelines.
 Its goal is to make pipelines that are composed of many user-defined
 functions simple to write, easy to test, and efficient to run. It is an
 high-level tool for writing data pipelines, as opposed to developing 
 against
 the MapReduce, Spark, Tez APIs or etc. directly [6].

 I would like to learn how Apache Crunch fits with creating a multi
 execution engine for Gora [4]? What kind of benefits we can get with
 integrating Apache Gora and Apache Crunch and what kind of gaps we still 
 can
 have instead of developing a custom engine for our purpose?

 Kind Regards,
 Furkan KAMACI

 [1] https://issues.apache.org/jira/browse/GORA-386
 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
 Shenker, Scott; Stoica, Ion (June 2013).
 [3] http://tez.apache.org/
 [4] https://issues.apache.org/jira/browse/GORA-418
 [5] https://crunch.apache.org/
 [6] https://crunch.apache.org/user-guide.html#motivation



 --
 Lewis




 --
 Lewis

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

2015-03-23 Thread Furkan KAMACI

Hi Henry,

So, as far as I understand instead of wrapping Apache Spark within Gora
with full functionality, I have to wrap its functionality of storing and
accessing data. I mean one will use Gora input/output format  and at the
backend it will me mapped to RDD and will able to run Map/Reduce via Apache
Spark etc. over Gora.

Kind Regards,
Furkan KAMACI

On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra henry.sapu...@gmail.com
wrote:

 Integration with Gora will mostly in the data ingest portion of the flow.

 Distributed processing frameworks like Spark, or Flink, already
 support Hadoop input format as data sources so Gora should be able to
 be used directly with Gor input format.

 The interesting portion is probably tighter integration such as custom
 RDD or custom Data Manager to store and get data from Gora directly.

 - Henry

 On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:
  Henry mentored Crunch through incubation... Maybe he can tell you more
  context.
  For me, Gora is essentially an extremely easy storage abstraction
 framework.
  I do not currently use the Query API meaning that the analysis of data is
  delegated to Gora data store.
  This is my current usage of the code base.
 
 
  On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com
 wrote:
 
  Hi Lewis,
 
  I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've
  talked with Talat about design of that implementation. I just wanted to
  check other projects for does any of them such kind of feature.
 
  Here is what is in my mind for Apache Gora for Spark supoort:
 developing a
  layer which abstracts functionality of Spark, Tez, etc (GORA-418). There
  will be implementations for each of them (and Spark will be one of them:
  GORA-386)
 
  i.e. you will write a word count example as Gora style, you will use one
  of implementation and run it (as like storing data at Solr or Mongo via
  Gora).
 
  When I check Crunch I realize that:
 
  Every Crunch job begins with a Pipeline instance that manages the
  execution lifecycle of your data pipeline. As of the 0.9.0 release,
 there
  are three implementations of the Pipeline interface:
 
  MRPipeline: Executes a pipeline as a series of MapReduce jobs that can
 run
  locally or on a Hadoop cluster.
  MemPipeline: Executes a pipeline in-memory on the client.
  SparkPipeline: Executes a pipeline by running a series of Apache Spark
  jobs, either locally or on a Hadoop cluster.
 
  So, I am curious about that supporting Crunch may help us what we want
  with Spark support at Gora? Actually, I am new to such projects, I want
 to
  learn what should be achieved with GORA-386 and not to be get lost
 because
  of overthinking :) I see that you can use Gora for storing your data
 with
  Gora-style, running jobs with Gora-style but have a flexibility of using
  either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.
 
  PS: I know there is a similar issue at Apache Gora for Cascading
 support:
  https://issues.apache.org/jira/browse/GORA-112
 
  Kind Regards,
  Furkan KAMACI
 
  On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney
  lewis.mcgibb...@gmail.com wrote:
 
  Hi Furkan,
  In what context are we talking here?
  GSoC or Just development?
  I am very keen to essentially work towards what we can release as Gora
  1.0
  Thank you Furkan
 
 
  On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com
  wrote:
 
  As you know that there is an issue for integration Apache Spark and
  Apache Gora [1]. Apache Spark is a popular project and in contrast to
  Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
  primitives provide performance up to 100 times faster for certain
  applications [2]. There are also some alternatives to Apache Spark,
 i.e.
  Apache Tez [3].
 
  When implementing an integration for Spark, it should be considered to
  have an abstraction for such kind of projects as an architectural
 design and
  there is a related issue for it: [4].
 
  There is another Apache project which aims to provide a framework
 named
  as Apache Crunch [5] for writing, testing, and running MapReduce
 pipelines.
  Its goal is to make pipelines that are composed of many user-defined
  functions simple to write, easy to test, and efficient to run. It is
 an
  high-level tool for writing data pipelines, as opposed to developing
 against
  the MapReduce, Spark, Tez APIs or etc. directly [6].
 
  I would like to learn how Apache Crunch fits with creating a multi
  execution engine for Gora [4]? What kind of benefits we can get with
  integrating Apache Gora and Apache Crunch and what kind of gaps we
 still can
  have instead of developing a custom engine for our purpose?
 
  Kind Regards,
  Furkan KAMACI
 
  [1] https://issues.apache.org/jira/browse/GORA-386
  [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
  Shenker, Scott; Stoica, Ion (June 2013).
  [3] http://tez.apache.org/
  [4]

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

2015-03-21 Thread Lewis John Mcgibbney

Hi Furkan,
In what context are we talking here?
GSoC or Just development?
I am very keen to essentially work towards what we can release as Gora 1.0
Thank you Furkan

On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote:

 As you know that there is an issue for integration Apache Spark and Apache
 Gora [1]. Apache Spark is a popular project and in contrast to Hadoop's
 two-stage disk-based MapReduce paradigm, Spark's in-memory primitives
 provide performance up to 100 times faster for certain applications [2].
 There are also some alternatives to Apache Spark, i.e. Apache Tez [3].

 When implementing an integration for Spark, it should be considered to
 have an abstraction for such kind of projects as an architectural design
 and there is a related issue for it: [4].

 There is another Apache project which aims to provide a framework named as
 Apache Crunch [5] for writing, testing, and running MapReduce pipelines.
 Its goal is to make pipelines that are composed of many user-defined
 functions simple to write, easy to test, and efficient to run. It is an
 high-level tool for writing data pipelines, as opposed to developing
 against the MapReduce, Spark, Tez APIs or etc. directly [6].

 I would like to learn how Apache Crunch fits with creating a multi
 execution engine for Gora [4]? What kind of benefits we can get with
 integrating Apache Gora and Apache Crunch and what kind of gaps we still
 can have instead of developing a custom engine for our purpose?

 Kind Regards,
 Furkan KAMACI

 [1] https://issues.apache.org/jira/browse/GORA-386
 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker,
 Scott; Stoica, Ion (June 2013).
 [3] http://tez.apache.org/
 [4] https://issues.apache.org/jira/browse/GORA-418
 [5] https://crunch.apache.org/
 [6] https://crunch.apache.org/user-guide.html#motivation



-- 
*Lewis*

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

2015-03-21 Thread Lewis John Mcgibbney

Henry mentored Crunch through incubation... Maybe he can tell you more
context.
For me, Gora is essentially an extremely easy storage abstraction
framework. I do not currently use the Query API meaning that the analysis
of data is delegated to Gora data store.
This is my current usage of the code base.

On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote:

 Hi Lewis,

 I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've
 talked with Talat about design of that implementation. I just wanted to
 check other projects for does any of them such kind of feature.

 Here is what is in my mind for Apache Gora for Spark supoort: developing a
 layer which abstracts functionality of Spark, Tez, etc (GORA-418). There
 will be implementations for each of them (and Spark will be one of them:
 GORA-386)

 i.e. you will write a word count example as Gora style, you will use one
 of implementation and run it (as like storing data at Solr or Mongo via
 Gora).

 When I check Crunch I realize that:

 *Every Crunch job begins with a Pipeline instance that manages the
 execution lifecycle of your data pipeline. As of the 0.9.0 release, there
 are three implementations of the Pipeline interface:*

 *MRPipeline: Executes a pipeline as a series of MapReduce jobs that can
 run locally or on a Hadoop cluster.*
 *MemPipeline: Executes a pipeline in-memory on the client.*
 *SparkPipeline: Executes a pipeline by running a series of Apache Spark
 jobs, either locally or on a Hadoop cluster.*

 So, I am curious about that supporting Crunch may help us what we want
 with Spark support at Gora? Actually, I am new to such projects, I want to
 learn what should be achieved with GORA-386 and not to be get lost because
 of overthinking :) I see that you can use Gora for storing your data with
 Gora-style, running jobs with Gora-style but have a flexibility of using
 either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.

 PS: I know there is a similar issue at Apache Gora for Cascading support:
 https://issues.apache.org/jira/browse/GORA-112

 Kind Regards,
 Furkan KAMACI

 On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com
 javascript:_e(%7B%7D,'cvml','lewis.mcgibb...@gmail.com'); wrote:

 Hi Furkan,
 In what context are we talking here?
 GSoC or Just development?
 I am very keen to essentially work towards what we can release as Gora 1.0
 Thank you Furkan


 On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com
 javascript:_e(%7B%7D,'cvml','furkankam...@gmail.com'); wrote:

 As you know that there is an issue for integration Apache Spark and
 Apache Gora [1]. Apache Spark is a popular project and in contrast to
 Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
 primitives provide performance up to 100 times faster for certain
 applications [2]. There are also some alternatives to Apache Spark, i.e.
 Apache Tez [3].

 When implementing an integration for Spark, it should be considered to
 have an abstraction for such kind of projects as an architectural design
 and there is a related issue for it: [4].

 There is another Apache project which aims to provide a framework named
 as Apache Crunch [5] for writing, testing, and running MapReduce pipelines.
 Its goal is to make pipelines that are composed of many user-defined
 functions simple to write, easy to test, and efficient to run. It is an
 high-level tool for writing data pipelines, as opposed to developing
 against the MapReduce, Spark, Tez APIs or etc. directly [6].

 I would like to learn how Apache Crunch fits with creating a multi
 execution engine for Gora [4]? What kind of benefits we can get with
 integrating Apache Gora and Apache Crunch and what kind of gaps we still
 can have instead of developing a custom engine for our purpose?

 Kind Regards,
 Furkan KAMACI

 [1] https://issues.apache.org/jira/browse/GORA-386
 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
 Shenker, Scott; Stoica, Ion (June 2013).
 [3] http://tez.apache.org/
 [4] https://issues.apache.org/jira/browse/GORA-418
 [5] https://crunch.apache.org/
 [6] https://crunch.apache.org/user-guide.html#motivation



 --
 *Lewis*




-- 
*Lewis*

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

6 matches

Site Navigation

Mail list logo

Footer information