Re: Gora Spark Backend Support (GORA-386) and Apache Crunch
Hi Henry, I've submitted a proposal for Spark backend support. Most important key point is that: Gora input format will have an RDD transformation ability and so one able to access either Hadoop Map/Reduce or Spark. On Tue, Mar 24, 2015 at 11:43 PM, Henry Saputra henry.sapu...@gmail.com wrote: HI Furkan, Yes, you are right. In the code execution for Spark or Flink, Gora should be part of the data ingest and storing. So, is the idea is to make data store in Spark to access Gora instead of default store options? - Henry On Mon, Mar 23, 2015 at 11:34 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Henry, So, as far as I understand instead of wrapping Apache Spark within Gora with full functionality, I have to wrap its functionality of storing and accessing data. I mean one will use Gora input/output format and at the backend it will me mapped to RDD and will able to run Map/Reduce via Apache Spark etc. over Gora. Kind Regards, Furkan KAMACI On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra henry.sapu...@gmail.com wrote: Integration with Gora will mostly in the data ingest portion of the flow. Distributed processing frameworks like Spark, or Flink, already support Hadoop input format as data sources so Gora should be able to be used directly with Gor input format. The interesting portion is probably tighter integration such as custom RDD or custom Data Manager to store and get data from Gora directly. - Henry On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Henry mentored Crunch through incubation... Maybe he can tell you more context. For me, Gora is essentially an extremely easy storage abstraction framework. I do not currently use the Query API meaning that the analysis of data is delegated to Gora data store. This is my current usage of the code base. On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: Hi Lewis, I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've talked with Talat about design of that implementation. I just wanted to check other projects for does any of them such kind of feature. Here is what is in my mind for Apache Gora for Spark supoort: developing a layer which abstracts functionality of Spark, Tez, etc (GORA-418). There will be implementations for each of them (and Spark will be one of them: GORA-386) i.e. you will write a word count example as Gora style, you will use one of implementation and run it (as like storing data at Solr or Mongo via Gora). When I check Crunch I realize that: Every Crunch job begins with a Pipeline instance that manages the execution lifecycle of your data pipeline. As of the 0.9.0 release, there are three implementations of the Pipeline interface: MRPipeline: Executes a pipeline as a series of MapReduce jobs that can run locally or on a Hadoop cluster. MemPipeline: Executes a pipeline in-memory on the client. SparkPipeline: Executes a pipeline by running a series of Apache Spark jobs, either locally or on a Hadoop cluster. So, I am curious about that supporting Crunch may help us what we want with Spark support at Gora? Actually, I am new to such projects, I want to learn what should be achieved with GORA-386 and not to be get lost because of overthinking :) I see that you can use Gora for storing your data with Gora-style, running jobs with Gora-style but have a flexibility of using either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc. PS: I know there is a similar issue at Apache Gora for Cascading support: https://issues.apache.org/jira/browse/GORA-112 Kind Regards, Furkan KAMACI On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Furkan, In what context are we talking here? GSoC or Just development? I am very keen to essentially work towards what we can release as Gora 1.0 Thank you Furkan On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: As you know that there is an issue for integration Apache Spark and Apache Gora [1]. Apache Spark is a popular project and in contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications [2]. There are also some alternatives to Apache Spark, i.e. Apache Tez [3]. When implementing an integration for Spark, it should be considered to have an abstraction for such kind of projects as an architectural design and there is a related issue for it: [4]. There is another Apache project which aims to provide a framework named as Apache Crunch [5] for writing, testing, and running MapReduce pipelines. Its goal is to
Re: Gora Spark Backend Support (GORA-386) and Apache Crunch
HI Furkan, Yes, you are right. In the code execution for Spark or Flink, Gora should be part of the data ingest and storing. So, is the idea is to make data store in Spark to access Gora instead of default store options? - Henry On Mon, Mar 23, 2015 at 11:34 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Henry, So, as far as I understand instead of wrapping Apache Spark within Gora with full functionality, I have to wrap its functionality of storing and accessing data. I mean one will use Gora input/output format and at the backend it will me mapped to RDD and will able to run Map/Reduce via Apache Spark etc. over Gora. Kind Regards, Furkan KAMACI On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra henry.sapu...@gmail.com wrote: Integration with Gora will mostly in the data ingest portion of the flow. Distributed processing frameworks like Spark, or Flink, already support Hadoop input format as data sources so Gora should be able to be used directly with Gor input format. The interesting portion is probably tighter integration such as custom RDD or custom Data Manager to store and get data from Gora directly. - Henry On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Henry mentored Crunch through incubation... Maybe he can tell you more context. For me, Gora is essentially an extremely easy storage abstraction framework. I do not currently use the Query API meaning that the analysis of data is delegated to Gora data store. This is my current usage of the code base. On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: Hi Lewis, I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've talked with Talat about design of that implementation. I just wanted to check other projects for does any of them such kind of feature. Here is what is in my mind for Apache Gora for Spark supoort: developing a layer which abstracts functionality of Spark, Tez, etc (GORA-418). There will be implementations for each of them (and Spark will be one of them: GORA-386) i.e. you will write a word count example as Gora style, you will use one of implementation and run it (as like storing data at Solr or Mongo via Gora). When I check Crunch I realize that: Every Crunch job begins with a Pipeline instance that manages the execution lifecycle of your data pipeline. As of the 0.9.0 release, there are three implementations of the Pipeline interface: MRPipeline: Executes a pipeline as a series of MapReduce jobs that can run locally or on a Hadoop cluster. MemPipeline: Executes a pipeline in-memory on the client. SparkPipeline: Executes a pipeline by running a series of Apache Spark jobs, either locally or on a Hadoop cluster. So, I am curious about that supporting Crunch may help us what we want with Spark support at Gora? Actually, I am new to such projects, I want to learn what should be achieved with GORA-386 and not to be get lost because of overthinking :) I see that you can use Gora for storing your data with Gora-style, running jobs with Gora-style but have a flexibility of using either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc. PS: I know there is a similar issue at Apache Gora for Cascading support: https://issues.apache.org/jira/browse/GORA-112 Kind Regards, Furkan KAMACI On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Furkan, In what context are we talking here? GSoC or Just development? I am very keen to essentially work towards what we can release as Gora 1.0 Thank you Furkan On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: As you know that there is an issue for integration Apache Spark and Apache Gora [1]. Apache Spark is a popular project and in contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications [2]. There are also some alternatives to Apache Spark, i.e. Apache Tez [3]. When implementing an integration for Spark, it should be considered to have an abstraction for such kind of projects as an architectural design and there is a related issue for it: [4]. There is another Apache project which aims to provide a framework named as Apache Crunch [5] for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. It is an high-level tool for writing data pipelines, as opposed to developing against the MapReduce, Spark, Tez APIs or etc. directly [6]. I would like to learn how Apache Crunch fits with creating a multi execution engine for Gora [4]? What kind of benefits we can get with integrating Apache Gora and Apache
Re: Gora Spark Backend Support (GORA-386) and Apache Crunch
Integration with Gora will mostly in the data ingest portion of the flow. Distributed processing frameworks like Spark, or Flink, already support Hadoop input format as data sources so Gora should be able to be used directly with Gor input format. The interesting portion is probably tighter integration such as custom RDD or custom Data Manager to store and get data from Gora directly. - Henry On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Henry mentored Crunch through incubation... Maybe he can tell you more context. For me, Gora is essentially an extremely easy storage abstraction framework. I do not currently use the Query API meaning that the analysis of data is delegated to Gora data store. This is my current usage of the code base. On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: Hi Lewis, I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've talked with Talat about design of that implementation. I just wanted to check other projects for does any of them such kind of feature. Here is what is in my mind for Apache Gora for Spark supoort: developing a layer which abstracts functionality of Spark, Tez, etc (GORA-418). There will be implementations for each of them (and Spark will be one of them: GORA-386) i.e. you will write a word count example as Gora style, you will use one of implementation and run it (as like storing data at Solr or Mongo via Gora). When I check Crunch I realize that: Every Crunch job begins with a Pipeline instance that manages the execution lifecycle of your data pipeline. As of the 0.9.0 release, there are three implementations of the Pipeline interface: MRPipeline: Executes a pipeline as a series of MapReduce jobs that can run locally or on a Hadoop cluster. MemPipeline: Executes a pipeline in-memory on the client. SparkPipeline: Executes a pipeline by running a series of Apache Spark jobs, either locally or on a Hadoop cluster. So, I am curious about that supporting Crunch may help us what we want with Spark support at Gora? Actually, I am new to such projects, I want to learn what should be achieved with GORA-386 and not to be get lost because of overthinking :) I see that you can use Gora for storing your data with Gora-style, running jobs with Gora-style but have a flexibility of using either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc. PS: I know there is a similar issue at Apache Gora for Cascading support: https://issues.apache.org/jira/browse/GORA-112 Kind Regards, Furkan KAMACI On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Furkan, In what context are we talking here? GSoC or Just development? I am very keen to essentially work towards what we can release as Gora 1.0 Thank you Furkan On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: As you know that there is an issue for integration Apache Spark and Apache Gora [1]. Apache Spark is a popular project and in contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications [2]. There are also some alternatives to Apache Spark, i.e. Apache Tez [3]. When implementing an integration for Spark, it should be considered to have an abstraction for such kind of projects as an architectural design and there is a related issue for it: [4]. There is another Apache project which aims to provide a framework named as Apache Crunch [5] for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. It is an high-level tool for writing data pipelines, as opposed to developing against the MapReduce, Spark, Tez APIs or etc. directly [6]. I would like to learn how Apache Crunch fits with creating a multi execution engine for Gora [4]? What kind of benefits we can get with integrating Apache Gora and Apache Crunch and what kind of gaps we still can have instead of developing a custom engine for our purpose? Kind Regards, Furkan KAMACI [1] https://issues.apache.org/jira/browse/GORA-386 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). [3] http://tez.apache.org/ [4] https://issues.apache.org/jira/browse/GORA-418 [5] https://crunch.apache.org/ [6] https://crunch.apache.org/user-guide.html#motivation -- Lewis -- Lewis
Re: Gora Spark Backend Support (GORA-386) and Apache Crunch
Hi Henry, So, as far as I understand instead of wrapping Apache Spark within Gora with full functionality, I have to wrap its functionality of storing and accessing data. I mean one will use Gora input/output format and at the backend it will me mapped to RDD and will able to run Map/Reduce via Apache Spark etc. over Gora. Kind Regards, Furkan KAMACI On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra henry.sapu...@gmail.com wrote: Integration with Gora will mostly in the data ingest portion of the flow. Distributed processing frameworks like Spark, or Flink, already support Hadoop input format as data sources so Gora should be able to be used directly with Gor input format. The interesting portion is probably tighter integration such as custom RDD or custom Data Manager to store and get data from Gora directly. - Henry On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Henry mentored Crunch through incubation... Maybe he can tell you more context. For me, Gora is essentially an extremely easy storage abstraction framework. I do not currently use the Query API meaning that the analysis of data is delegated to Gora data store. This is my current usage of the code base. On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: Hi Lewis, I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've talked with Talat about design of that implementation. I just wanted to check other projects for does any of them such kind of feature. Here is what is in my mind for Apache Gora for Spark supoort: developing a layer which abstracts functionality of Spark, Tez, etc (GORA-418). There will be implementations for each of them (and Spark will be one of them: GORA-386) i.e. you will write a word count example as Gora style, you will use one of implementation and run it (as like storing data at Solr or Mongo via Gora). When I check Crunch I realize that: Every Crunch job begins with a Pipeline instance that manages the execution lifecycle of your data pipeline. As of the 0.9.0 release, there are three implementations of the Pipeline interface: MRPipeline: Executes a pipeline as a series of MapReduce jobs that can run locally or on a Hadoop cluster. MemPipeline: Executes a pipeline in-memory on the client. SparkPipeline: Executes a pipeline by running a series of Apache Spark jobs, either locally or on a Hadoop cluster. So, I am curious about that supporting Crunch may help us what we want with Spark support at Gora? Actually, I am new to such projects, I want to learn what should be achieved with GORA-386 and not to be get lost because of overthinking :) I see that you can use Gora for storing your data with Gora-style, running jobs with Gora-style but have a flexibility of using either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc. PS: I know there is a similar issue at Apache Gora for Cascading support: https://issues.apache.org/jira/browse/GORA-112 Kind Regards, Furkan KAMACI On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Furkan, In what context are we talking here? GSoC or Just development? I am very keen to essentially work towards what we can release as Gora 1.0 Thank you Furkan On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: As you know that there is an issue for integration Apache Spark and Apache Gora [1]. Apache Spark is a popular project and in contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications [2]. There are also some alternatives to Apache Spark, i.e. Apache Tez [3]. When implementing an integration for Spark, it should be considered to have an abstraction for such kind of projects as an architectural design and there is a related issue for it: [4]. There is another Apache project which aims to provide a framework named as Apache Crunch [5] for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. It is an high-level tool for writing data pipelines, as opposed to developing against the MapReduce, Spark, Tez APIs or etc. directly [6]. I would like to learn how Apache Crunch fits with creating a multi execution engine for Gora [4]? What kind of benefits we can get with integrating Apache Gora and Apache Crunch and what kind of gaps we still can have instead of developing a custom engine for our purpose? Kind Regards, Furkan KAMACI [1] https://issues.apache.org/jira/browse/GORA-386 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). [3] http://tez.apache.org/ [4]
Re: Gora Spark Backend Support (GORA-386) and Apache Crunch
Hi Furkan, In what context are we talking here? GSoC or Just development? I am very keen to essentially work towards what we can release as Gora 1.0 Thank you Furkan On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: As you know that there is an issue for integration Apache Spark and Apache Gora [1]. Apache Spark is a popular project and in contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications [2]. There are also some alternatives to Apache Spark, i.e. Apache Tez [3]. When implementing an integration for Spark, it should be considered to have an abstraction for such kind of projects as an architectural design and there is a related issue for it: [4]. There is another Apache project which aims to provide a framework named as Apache Crunch [5] for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. It is an high-level tool for writing data pipelines, as opposed to developing against the MapReduce, Spark, Tez APIs or etc. directly [6]. I would like to learn how Apache Crunch fits with creating a multi execution engine for Gora [4]? What kind of benefits we can get with integrating Apache Gora and Apache Crunch and what kind of gaps we still can have instead of developing a custom engine for our purpose? Kind Regards, Furkan KAMACI [1] https://issues.apache.org/jira/browse/GORA-386 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). [3] http://tez.apache.org/ [4] https://issues.apache.org/jira/browse/GORA-418 [5] https://crunch.apache.org/ [6] https://crunch.apache.org/user-guide.html#motivation -- *Lewis*
Re: Gora Spark Backend Support (GORA-386) and Apache Crunch
Henry mentored Crunch through incubation... Maybe he can tell you more context. For me, Gora is essentially an extremely easy storage abstraction framework. I do not currently use the Query API meaning that the analysis of data is delegated to Gora data store. This is my current usage of the code base. On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com wrote: Hi Lewis, I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've talked with Talat about design of that implementation. I just wanted to check other projects for does any of them such kind of feature. Here is what is in my mind for Apache Gora for Spark supoort: developing a layer which abstracts functionality of Spark, Tez, etc (GORA-418). There will be implementations for each of them (and Spark will be one of them: GORA-386) i.e. you will write a word count example as Gora style, you will use one of implementation and run it (as like storing data at Solr or Mongo via Gora). When I check Crunch I realize that: *Every Crunch job begins with a Pipeline instance that manages the execution lifecycle of your data pipeline. As of the 0.9.0 release, there are three implementations of the Pipeline interface:* *MRPipeline: Executes a pipeline as a series of MapReduce jobs that can run locally or on a Hadoop cluster.* *MemPipeline: Executes a pipeline in-memory on the client.* *SparkPipeline: Executes a pipeline by running a series of Apache Spark jobs, either locally or on a Hadoop cluster.* So, I am curious about that supporting Crunch may help us what we want with Spark support at Gora? Actually, I am new to such projects, I want to learn what should be achieved with GORA-386 and not to be get lost because of overthinking :) I see that you can use Gora for storing your data with Gora-style, running jobs with Gora-style but have a flexibility of using either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc. PS: I know there is a similar issue at Apache Gora for Cascading support: https://issues.apache.org/jira/browse/GORA-112 Kind Regards, Furkan KAMACI On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com javascript:_e(%7B%7D,'cvml','lewis.mcgibb...@gmail.com'); wrote: Hi Furkan, In what context are we talking here? GSoC or Just development? I am very keen to essentially work towards what we can release as Gora 1.0 Thank you Furkan On Saturday, March 21, 2015, Furkan KAMACI furkankam...@gmail.com javascript:_e(%7B%7D,'cvml','furkankam...@gmail.com'); wrote: As you know that there is an issue for integration Apache Spark and Apache Gora [1]. Apache Spark is a popular project and in contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications [2]. There are also some alternatives to Apache Spark, i.e. Apache Tez [3]. When implementing an integration for Spark, it should be considered to have an abstraction for such kind of projects as an architectural design and there is a related issue for it: [4]. There is another Apache project which aims to provide a framework named as Apache Crunch [5] for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. It is an high-level tool for writing data pipelines, as opposed to developing against the MapReduce, Spark, Tez APIs or etc. directly [6]. I would like to learn how Apache Crunch fits with creating a multi execution engine for Gora [4]? What kind of benefits we can get with integrating Apache Gora and Apache Crunch and what kind of gaps we still can have instead of developing a custom engine for our purpose? Kind Regards, Furkan KAMACI [1] https://issues.apache.org/jira/browse/GORA-386 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). [3] http://tez.apache.org/ [4] https://issues.apache.org/jira/browse/GORA-418 [5] https://crunch.apache.org/ [6] https://crunch.apache.org/user-guide.html#motivation -- *Lewis* -- *Lewis*