Hi Wei-Chiu, You can look at Dask [1]. It can work with HDFS [2] and integrates well with YARN as well [3].
1 - https://dask.org 2 - http://docs.dask.org/en/latest/remote-data-services.html 3 - http://yarn.dask.org/en/latest/ Thanks, Hari On Sun, 16 Jun 2019, 23:31 Wei-Chiu Chuang, <weic...@apache.org> wrote: > Thanks Artem, > Looks interesting. I honestly didn't know what Hadoop Streaming API is > used for. > Here are more references: > https://hadoop.apache.org/docs/r3.2.0/hadoop-streaming/HadoopStreaming.html > > I think it brings to another question: how do we treat Python as a first > class citizen. Especially for data science use cases, Python is *the* > language. > For example, we have Java and C and (in Hadoop 3.2) C++ client for HDFS. > But Hadoop does not ship a Python client. > I see a number of Python libraries that support webhdfs. It's not clear to > me how well they perform, and if they support more advanced features like > encryption/Kerberos. > > NFS gateway is a possibility. Fuse-dfs is another option. But we know they > don't work at scale, and the community seems to lost the steam to improve > NFS/fuse-dfs. > > Thoughts? > > On Sun, Jun 16, 2019 at 6:52 AM Artem Ervits <artemerv...@gmail.com> > wrote: > >> >> https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ >> >> On Sun, Jun 16, 2019, 9:18 AM Mike IT Expert <mikeitexp...@gmail.com> >> wrote: >> >>> Please let me know where I can find a good/simple example of mapreduce >>> Python code running on Hadoop. Like tutorial or sth. >>> >>> Thank you >>> >>> >>>