Hi, I have been looking at some of the stuff around the fetcher and saw something interesting. The code for fetcher::fetch method is dependent on a hard coded list of url schemes. No doubt that this works but is very restrictive. Hadoop/HDFS in general is pretty flexible when it comes to being able to fetch stuff from urls and has the ability to fetch a large number of types of urls and can be extended by adding configuration into the conf/hdfs-site.xml and core-site.xml
What I am proposing is that we refactor the fetcher.cpp to prefer to use the hdfs (using hdfs/hdfs.hpp) to do all the fetching if HADOOP_HOME is set and $HADOOP_HOME/bin/hadoop is available. This logic already exists and we can just use it. The fallback logic for using net::download or local file copy is may be left in place for installations that do not have hadoop configured. This means that if hadoop is present we can directly fetch urls such as tachyon://... snackfs:// ... cfs:// .... ftp://... s3://... http:// ... file:// with no extra effort. This makes up for a much better experience when it comes to debugging and extensibility. What do others think about this? - Ankur

