Re: Python support
I created the following JIRAs: https://issues.apache.org/jira/browse/APEXMALHAR-2260 https://issues.apache.org/jira/browse/APEXMALHAR-2261 On Wed, Sep 21, 2016 at 11:10 PM, Chinmay Kolhatkar wrote: > I would like to help in contributing to this feature. > > On Wed, Sep 21, 2016 at 12:26 AM, Sasha Parfenov > wrote: > > > +1 on both executing Python code in an operator and high level API for > > constructing Pipelines in Python. > > > > There is a large user base of engineers and data scientists which use > > Python on regular basis for crunching through big data. Providing them > > with a powerful new platform for big data processing, wrapped in a > familiar > > language, will open Apex to a much broader user base and help grow the > > project. > > > > Given the potentially new user base of Python developers, it may make > sense > > to prioritize the high level API for pipeline construction. This will > > allow users to build simple applications with existing library operators, > > and we can get feedback on what areas they would like to see improved > next > > - custom Python operator support or more built-in library operators. > > > > Thanks, > > Sasha > > > > On Thu, Sep 15, 2016 at 2:06 PM, Thomas Weise wrote: > > > > > Hi, > > > > > > Python (not Jython) seems to be a popular language and frequently used > > for > > > data analysis, especially where flexibility matters. It has a > > comprehensive > > > library and it is generally considered low barrier to entry. I have > also > > > seen Python used in critical back-end components, although that's > > probably > > > not very common? > > > > > > I think Python support could potentially expand the user base for Apex. > > > There are 2 main areas that can be considered: > > > > > > 1) Support to execute Python code through an operator > > > 2) A client API that lets users construct pipelines in Python > > > > > > The former can exist without the latter. And it would enable users to > > > leverage existing code that otherwise would have to be rewritten in a > JVM > > > language. The engine could ship scripts/packages so they are > > automatically > > > distributed on the cluster. > > > > > > A useful client API probably requires back-end support for lambda > > functions > > > and more complex UDFs. > > > > > > Would be great to get some feedback, especially from those that have > > > experience with Python, on how an integration could potentially open up > > new > > > use cases for Apex. > > > > > > Thanks, > > > Thomas > > > > > >
Re: Python support
I would like to help in contributing to this feature. On Wed, Sep 21, 2016 at 12:26 AM, Sasha Parfenov wrote: > +1 on both executing Python code in an operator and high level API for > constructing Pipelines in Python. > > There is a large user base of engineers and data scientists which use > Python on regular basis for crunching through big data. Providing them > with a powerful new platform for big data processing, wrapped in a familiar > language, will open Apex to a much broader user base and help grow the > project. > > Given the potentially new user base of Python developers, it may make sense > to prioritize the high level API for pipeline construction. This will > allow users to build simple applications with existing library operators, > and we can get feedback on what areas they would like to see improved next > - custom Python operator support or more built-in library operators. > > Thanks, > Sasha > > On Thu, Sep 15, 2016 at 2:06 PM, Thomas Weise wrote: > > > Hi, > > > > Python (not Jython) seems to be a popular language and frequently used > for > > data analysis, especially where flexibility matters. It has a > comprehensive > > library and it is generally considered low barrier to entry. I have also > > seen Python used in critical back-end components, although that's > probably > > not very common? > > > > I think Python support could potentially expand the user base for Apex. > > There are 2 main areas that can be considered: > > > > 1) Support to execute Python code through an operator > > 2) A client API that lets users construct pipelines in Python > > > > The former can exist without the latter. And it would enable users to > > leverage existing code that otherwise would have to be rewritten in a JVM > > language. The engine could ship scripts/packages so they are > automatically > > distributed on the cluster. > > > > A useful client API probably requires back-end support for lambda > functions > > and more complex UDFs. > > > > Would be great to get some feedback, especially from those that have > > experience with Python, on how an integration could potentially open up > new > > use cases for Apex. > > > > Thanks, > > Thomas > > >
Re: Python support
+1 on both executing Python code in an operator and high level API for constructing Pipelines in Python. There is a large user base of engineers and data scientists which use Python on regular basis for crunching through big data. Providing them with a powerful new platform for big data processing, wrapped in a familiar language, will open Apex to a much broader user base and help grow the project. Given the potentially new user base of Python developers, it may make sense to prioritize the high level API for pipeline construction. This will allow users to build simple applications with existing library operators, and we can get feedback on what areas they would like to see improved next - custom Python operator support or more built-in library operators. Thanks, Sasha On Thu, Sep 15, 2016 at 2:06 PM, Thomas Weise wrote: > Hi, > > Python (not Jython) seems to be a popular language and frequently used for > data analysis, especially where flexibility matters. It has a comprehensive > library and it is generally considered low barrier to entry. I have also > seen Python used in critical back-end components, although that's probably > not very common? > > I think Python support could potentially expand the user base for Apex. > There are 2 main areas that can be considered: > > 1) Support to execute Python code through an operator > 2) A client API that lets users construct pipelines in Python > > The former can exist without the latter. And it would enable users to > leverage existing code that otherwise would have to be rewritten in a JVM > language. The engine could ship scripts/packages so they are automatically > distributed on the cluster. > > A useful client API probably requires back-end support for lambda functions > and more complex UDFs. > > Would be great to get some feedback, especially from those that have > experience with Python, on how an integration could potentially open up new > use cases for Apex. > > Thanks, > Thomas >
Re: Python support
+1 on this feature. we could use py4j or communication with python process through pipes to run python code through jvm. - Tushar. On Fri, Sep 16, 2016 at 12:10 PM, Thomas Weise wrote: > Jython is not a replacement for Python, it seems to be fairly limited. We > would need the ability to run Python with all its libraries. > > Thomas > > On Thu, Sep 15, 2016 at 11:25 PM, David Yan wrote: > >> On a very high level, we can build a Python framework in Apex by having a >> Python binding on our high level API that generates Jython operators with >> the business logic written by users in Python, along with existing >> connectors. >> >> David >> >> On Sep 15, 2016 11:00 PM, "Chinmay Kolhatkar" >> wrote: >> >> > Strongly +1 on this. One thing that proves this is useful for Apex is >> > hadoop streaming where python is used write map-reduce jobs. This not >> only >> > will increase the reach in development world but also would be appealing >> to >> > administrators to write an app as they are usually aware of python. >> > >> > >> > Few suggestions (not in specific order): >> > 1. As a part of supporting python execution in operator code, we should >> > provide a complete lifecycle of an operator to be specified from python. >> > >> > 2. I would personally not worry about providing python binding for low >> > level apex client APIs like addOperator, addStream etc... If one has to >> do >> > it, I think its best to use JAVA api as the most power of those low level >> > APIs can be leveraged there. >> > >> > 3. For client APIs, I would rather suggest we focus on high level APIs >> like >> > apex stream API (malhar-stream). We should provide a complete python >> > binding for them. Python is very useful when it comes to functional >> > programming and Stream API provide exactly that. >> > >> > 4. Thinking very high level, I don't think we need any change in >> apex-core >> > for this. This could be another project in malhar itself. There are >> python >> > libraries like py4j or pyjnius or JPype which allows to access Java >> objects >> > from python. >> > Basically, we just need to establish a right bridge betweeen java and >> > python VM. We need to be thoughtful about performance as these bridges >> > across programming languages are costly. >> > >> > 5. We need to decide on how the code execution will look like on this. >> For >> > eg., should a py file be an alternative to Application.java in the >> package? >> > This means, the starting point is apex cli i.e. java. Hence instead of >> > finding classes implementing StreamingApplication, apexcli needs to find >> py >> > file which defines definition of DAG. >> > OR should the flow start with "__main__" of python file and end up in >> Java? >> > >> > 6. This might be too early, but it important to emphasis that we need to >> > plan for writing examples and documentation for python binding. >> > >> > -Chinmay. >> > >> > >> > >> > On Fri, Sep 16, 2016 at 2:36 AM, Thomas Weise wrote: >> > >> > > Hi, >> > > >> > > Python (not Jython) seems to be a popular language and frequently used >> > for >> > > data analysis, especially where flexibility matters. It has a >> > comprehensive >> > > library and it is generally considered low barrier to entry. I have >> also >> > > seen Python used in critical back-end components, although that's >> > probably >> > > not very common? >> > > >> > > I think Python support could potentially expand the user base for Apex. >> > > There are 2 main areas that can be considered: >> > > >> > > 1) Support to execute Python code through an operator >> > > 2) A client API that lets users construct pipelines in Python >> > > >> > > The former can exist without the latter. And it would enable users to >> > > leverage existing code that otherwise would have to be rewritten in a >> JVM >> > > language. The engine could ship scripts/packages so they are >> > automatically >> > > distributed on the cluster. >> > > >> > > A useful client API probably requires back-end support for lambda >> > functions >> > > and more complex UDFs. >> > > >> > > Would be great to get some feedback, especially from those that have >> > > experience with Python, on how an integration could potentially open up >> > new >> > > use cases for Apex. >> > > >> > > Thanks, >> > > Thomas >> > > >> > >>
Re: Python support
Jython is not a replacement for Python, it seems to be fairly limited. We would need the ability to run Python with all its libraries. Thomas On Thu, Sep 15, 2016 at 11:25 PM, David Yan wrote: > On a very high level, we can build a Python framework in Apex by having a > Python binding on our high level API that generates Jython operators with > the business logic written by users in Python, along with existing > connectors. > > David > > On Sep 15, 2016 11:00 PM, "Chinmay Kolhatkar" > wrote: > > > Strongly +1 on this. One thing that proves this is useful for Apex is > > hadoop streaming where python is used write map-reduce jobs. This not > only > > will increase the reach in development world but also would be appealing > to > > administrators to write an app as they are usually aware of python. > > > > > > Few suggestions (not in specific order): > > 1. As a part of supporting python execution in operator code, we should > > provide a complete lifecycle of an operator to be specified from python. > > > > 2. I would personally not worry about providing python binding for low > > level apex client APIs like addOperator, addStream etc... If one has to > do > > it, I think its best to use JAVA api as the most power of those low level > > APIs can be leveraged there. > > > > 3. For client APIs, I would rather suggest we focus on high level APIs > like > > apex stream API (malhar-stream). We should provide a complete python > > binding for them. Python is very useful when it comes to functional > > programming and Stream API provide exactly that. > > > > 4. Thinking very high level, I don't think we need any change in > apex-core > > for this. This could be another project in malhar itself. There are > python > > libraries like py4j or pyjnius or JPype which allows to access Java > objects > > from python. > > Basically, we just need to establish a right bridge betweeen java and > > python VM. We need to be thoughtful about performance as these bridges > > across programming languages are costly. > > > > 5. We need to decide on how the code execution will look like on this. > For > > eg., should a py file be an alternative to Application.java in the > package? > > This means, the starting point is apex cli i.e. java. Hence instead of > > finding classes implementing StreamingApplication, apexcli needs to find > py > > file which defines definition of DAG. > > OR should the flow start with "__main__" of python file and end up in > Java? > > > > 6. This might be too early, but it important to emphasis that we need to > > plan for writing examples and documentation for python binding. > > > > -Chinmay. > > > > > > > > On Fri, Sep 16, 2016 at 2:36 AM, Thomas Weise wrote: > > > > > Hi, > > > > > > Python (not Jython) seems to be a popular language and frequently used > > for > > > data analysis, especially where flexibility matters. It has a > > comprehensive > > > library and it is generally considered low barrier to entry. I have > also > > > seen Python used in critical back-end components, although that's > > probably > > > not very common? > > > > > > I think Python support could potentially expand the user base for Apex. > > > There are 2 main areas that can be considered: > > > > > > 1) Support to execute Python code through an operator > > > 2) A client API that lets users construct pipelines in Python > > > > > > The former can exist without the latter. And it would enable users to > > > leverage existing code that otherwise would have to be rewritten in a > JVM > > > language. The engine could ship scripts/packages so they are > > automatically > > > distributed on the cluster. > > > > > > A useful client API probably requires back-end support for lambda > > functions > > > and more complex UDFs. > > > > > > Would be great to get some feedback, especially from those that have > > > experience with Python, on how an integration could potentially open up > > new > > > use cases for Apex. > > > > > > Thanks, > > > Thomas > > > > > >
Re: Python support
On a very high level, we can build a Python framework in Apex by having a Python binding on our high level API that generates Jython operators with the business logic written by users in Python, along with existing connectors. David On Sep 15, 2016 11:00 PM, "Chinmay Kolhatkar" wrote: > Strongly +1 on this. One thing that proves this is useful for Apex is > hadoop streaming where python is used write map-reduce jobs. This not only > will increase the reach in development world but also would be appealing to > administrators to write an app as they are usually aware of python. > > > Few suggestions (not in specific order): > 1. As a part of supporting python execution in operator code, we should > provide a complete lifecycle of an operator to be specified from python. > > 2. I would personally not worry about providing python binding for low > level apex client APIs like addOperator, addStream etc... If one has to do > it, I think its best to use JAVA api as the most power of those low level > APIs can be leveraged there. > > 3. For client APIs, I would rather suggest we focus on high level APIs like > apex stream API (malhar-stream). We should provide a complete python > binding for them. Python is very useful when it comes to functional > programming and Stream API provide exactly that. > > 4. Thinking very high level, I don't think we need any change in apex-core > for this. This could be another project in malhar itself. There are python > libraries like py4j or pyjnius or JPype which allows to access Java objects > from python. > Basically, we just need to establish a right bridge betweeen java and > python VM. We need to be thoughtful about performance as these bridges > across programming languages are costly. > > 5. We need to decide on how the code execution will look like on this. For > eg., should a py file be an alternative to Application.java in the package? > This means, the starting point is apex cli i.e. java. Hence instead of > finding classes implementing StreamingApplication, apexcli needs to find py > file which defines definition of DAG. > OR should the flow start with "__main__" of python file and end up in Java? > > 6. This might be too early, but it important to emphasis that we need to > plan for writing examples and documentation for python binding. > > -Chinmay. > > > > On Fri, Sep 16, 2016 at 2:36 AM, Thomas Weise wrote: > > > Hi, > > > > Python (not Jython) seems to be a popular language and frequently used > for > > data analysis, especially where flexibility matters. It has a > comprehensive > > library and it is generally considered low barrier to entry. I have also > > seen Python used in critical back-end components, although that's > probably > > not very common? > > > > I think Python support could potentially expand the user base for Apex. > > There are 2 main areas that can be considered: > > > > 1) Support to execute Python code through an operator > > 2) A client API that lets users construct pipelines in Python > > > > The former can exist without the latter. And it would enable users to > > leverage existing code that otherwise would have to be rewritten in a JVM > > language. The engine could ship scripts/packages so they are > automatically > > distributed on the cluster. > > > > A useful client API probably requires back-end support for lambda > functions > > and more complex UDFs. > > > > Would be great to get some feedback, especially from those that have > > experience with Python, on how an integration could potentially open up > new > > use cases for Apex. > > > > Thanks, > > Thomas > > >
Re: Python support
Strongly +1 on this. One thing that proves this is useful for Apex is hadoop streaming where python is used write map-reduce jobs. This not only will increase the reach in development world but also would be appealing to administrators to write an app as they are usually aware of python. Few suggestions (not in specific order): 1. As a part of supporting python execution in operator code, we should provide a complete lifecycle of an operator to be specified from python. 2. I would personally not worry about providing python binding for low level apex client APIs like addOperator, addStream etc... If one has to do it, I think its best to use JAVA api as the most power of those low level APIs can be leveraged there. 3. For client APIs, I would rather suggest we focus on high level APIs like apex stream API (malhar-stream). We should provide a complete python binding for them. Python is very useful when it comes to functional programming and Stream API provide exactly that. 4. Thinking very high level, I don't think we need any change in apex-core for this. This could be another project in malhar itself. There are python libraries like py4j or pyjnius or JPype which allows to access Java objects from python. Basically, we just need to establish a right bridge betweeen java and python VM. We need to be thoughtful about performance as these bridges across programming languages are costly. 5. We need to decide on how the code execution will look like on this. For eg., should a py file be an alternative to Application.java in the package? This means, the starting point is apex cli i.e. java. Hence instead of finding classes implementing StreamingApplication, apexcli needs to find py file which defines definition of DAG. OR should the flow start with "__main__" of python file and end up in Java? 6. This might be too early, but it important to emphasis that we need to plan for writing examples and documentation for python binding. -Chinmay. On Fri, Sep 16, 2016 at 2:36 AM, Thomas Weise wrote: > Hi, > > Python (not Jython) seems to be a popular language and frequently used for > data analysis, especially where flexibility matters. It has a comprehensive > library and it is generally considered low barrier to entry. I have also > seen Python used in critical back-end components, although that's probably > not very common? > > I think Python support could potentially expand the user base for Apex. > There are 2 main areas that can be considered: > > 1) Support to execute Python code through an operator > 2) A client API that lets users construct pipelines in Python > > The former can exist without the latter. And it would enable users to > leverage existing code that otherwise would have to be rewritten in a JVM > language. The engine could ship scripts/packages so they are automatically > distributed on the cluster. > > A useful client API probably requires back-end support for lambda functions > and more complex UDFs. > > Would be great to get some feedback, especially from those that have > experience with Python, on how an integration could potentially open up new > use cases for Apex. > > Thanks, > Thomas >