Oop! I replied too early in the morning for me!
You're not confused about close vs closeAll.

You're confused about the fact that the Filesystem is sort of a hybrid 
Singleton class that defaults with performance and memory in mind, but allows 
you to force a new instance, for say, multithreaded programs, etc.. notice the 
"newInstance" in your second snippet, that is not in your first.

This is a trade-off between performance & conceptual clarity that is often the 
hardest part of API design. I think hdfs did pretty good here - I/O will always 
be the bottle-neck, esp with rotational media.

Take care,
  -stu



----- Reply message -----
From: "Koert Kuipers" <[email protected]>
To: <[email protected]>
Subject: fs cache giving me headaches
Date: Mon, Aug 6, 2012 10:32 am
---------- Forwarded message ----------
From: "Koert Kuipers" <[email protected]>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches

To:  <[email protected]>
nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes 
code like this:

Final FileSystem fs = FileSystem.get(conf);
try {

    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sorts of 
weird errors where FileSystems in unrelated code (sometimes not even my code) 
started misbehaving and streams where unexpectedly shut. Then i realized that 
FileSystem uses a cache and close() closes it for everyone! Not pretty in my 
opinion, but i can live with it. So i checked other code and found that 
basically nobody closes FileSystems. Apparently the expected way of using 
FileSystems is to simple never close them. So i adopted this approach (which i 
think is really contrary to java conventions for a Closeable).



Lately i started working on some code for a daemon/server where many 
FileSystems objects are created for different users (UGIs) that use the 
service. As it turns out other projects have run into trouble with the 
FileSystem cache in situations like this (for example, Scribe and Hoop). I 
imagine the cache can get very large and cause problems (i have not tested this 
myself).



Looking at the code for Hoop i noticed they simply turned off the FileSystem 
cache and made sure to close every FileSystem. So here the suggested approach 
to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) 
but with caching turned off in the conf


try {

    // do something with fs

} finally {

    fs.close();

}


This code bypasses the cache if i understand it correctly, avoiding any cache 
size limitations. However if i adopt this approach i basically can not re-use 
any existing code or libraries that do not close FileSystems, splitting the 
codebase into two which is pretty ugly. And this code is not efficient in 
situations where there are very few used FileSystem objects and a cache would 
improve performance, so the split works both ways.



In short, there is so single way to code with FileSystem that works in both 
situations! Ideally i would have liked fs.close() to do the right thing 
depending in the settings: if cache is turned off it closes the FileSystem, and 
if it is turned on its a NOOP. That way i could always use FileSystem.get(conf) 
and always close my filesystems, and the code would be usable irrespective of 
whether the cache is turned on or off.



Any insights or suggestions? Thanks!

Reply via email to