Hi,

I have been looking at some of the stuff around the fetcher and saw something 
interesting. The code for fetcher::fetch method is dependent on a hard coded 
list of url schemes. No doubt that this works but is very restrictive. 
Hadoop/HDFS in general is pretty flexible when it comes to being able to fetch 
stuff from urls and has the ability to fetch a large number of types of urls 
and can be extended by adding configuration into the conf/hdfs-site.xml and 
core-site.xml

What I am proposing is that we refactor the fetcher.cpp to prefer to use the 
hdfs (using hdfs/hdfs.hpp) to do all the fetching if HADOOP_HOME is set and 
$HADOOP_HOME/bin/hadoop is available. This logic already exists and we can just 
use it. The fallback logic for using net::download or local file copy is may be 
left in place for installations that do not have hadoop configured. This means 
that if hadoop is present we can directly fetch urls such as tachyon://... 
snackfs:// ... cfs:// .... ftp://... s3://... http:// ... file:// with no extra 
effort. This makes up for a much better experience when it comes to debugging 
and extensibility.

What do others think about this?

- Ankur

Reply via email to