On Wed, 3 Jun 2009, Senthil Kumaran wrote: > Hello Twisted Developers/Users, > > This is my first concurrent application design and my first trial with > twisted. I have read the documentation and understand where twisted > plays its part. Unfortunately, I could not directly relate it to my > requirements and hence, could not go forward with designing and > building on top using the examples as a reference. > > I need your guidance in helping me design an application. > > My Application Details: > > 1) I need to constantly monitor a particular directory for new files. > 2) Whenever a new file is dropped; I read that file and get > information on where to collect data from that is a) another machine b) > machine2-different method c) database. > 3) I collect data from those machines and store it. > > The data is huge and I need the three processes a, b, c to be > non-blocking, and I can just do a function call like do_a(), do_b(), > do_c() to perform them. > > For 1) to constantly monitor a particular directory for new files, I > am doing something like this: > > while True: > check_for_new_files() >
This is not an issue specifically related to Python or Twisted, but there is a very serious synchronization issue that needs to be addressed with this application design. (Trust me, I've seen this issue come up dozens of times in over 30 years of experience...) Creating a file and loading it with data is not an atomic operation. It takes a significant amount of time, and if the process attempting to read the file is faster than the process writing it, it won't see all (or any of) the data. It can work great while testing and then fall over the first time it is used in production, or it can work fine for years before mysteriously breaking. There are several ways to cope with this situation: 1) If the system allows you to create temporary invisible files and then only makes the file visible when it is cleanly closed, you can use this method. However, this is often not portable. Not all operating systems, languages, FTP or SFTP servers, etc. support such a facility. 2) Create the file using a method that disallows reading of the file while it is still open by the creator. Make the reader process wait until it can get read access to the file before processing it. (Sometimes this can be done by making the reader process request exclusive write access to the file, even though it doesn't intend to write to it.) This is also not particularly portable, and may require the reading process to spin or wait-loop, either wasting resources or delaying processing by half the wait time on average. 3) Create the file in another directory and then move it to the target directory when it is complete. The reading process will only see it after the move is complete. However, such an operation isn't always atomic, or even possible. I think "mv" on most Unix systems is atomic if both directories are on the same physical disk, but if the directories are on different disks, it copies the file and then deletes the original file. This could work fine for years and then break when someone decides to move directories around for some reason. 4) Create the file with a temporary file name, for example "foo-YYYYMMDD-SEQ.tmp" and then after it is created and fully populated, rename it to "foo-YYYYMMDD-SEQ.dat". Make the reading process only look for files named "*.dat", ignoring the "*.tmp" files. I don't know of any operating system where renaming a file is not an atomic operation, but I suppose such might exist. There could concievably be a small window when the file system could have created a directory entry for the .dat filename, but hasn't yet linked the filename to the file. Though if this is possible, one could argue that this is an O/S bug and demand the O/S vendor fix it. (Or fix it yourself if it's a self-maintained O/S or file system...) 5) After creating the file, create a flag file (empty or with minimal, unimportant contents.) For example, if the data file is named "foo-YYYYMMDD-SEQ#.dat", after creating it, create a flag file name "foo-YYYYMMDD-SEQ#.flag". Have the reading process look only for flag files (they could even be in a separate directory to avoid clutter.) When a flag file appears, process the corresponding data file. This method is very portable and is *almost* bullet proof. The exceptions I have seen have almost all been when someone didn't understand the importance of the flag file and created it first. Aside from just doing it in the wrong order, I've seen cases where they triggered two parallel processes to create the files, and the flag file being much smaller, got created first, and where they created all the files in a local directory (on another system), and then FTP'ed them to the target system/directory, using a wild-card file name, which unfortunately caused the flag file to get sent first. (It may have had an alphabetically earlier name than the data file, or the FTP client may have transfered files in a random order or one based on the inode or file ID or other non-obvious file attribute.) In these cases, the cure was to explicitly transfer the data file and then the flag file in the correct order. (We once encountered an issue where an FTP client may have been "optimizing" transfers either by doing them in parallel, or by sending small files first, and broke this scheme, but that was only a theory we had while trying to diagnose the problem, and may not have been what was actually happening.) I don't know of any scheme that is absolutely foolproof unless you control both the file creation and file reading sides of things, but scheme 5 (flag files) seems to work best in practice. Sorry I can't help with the Python/Twisted specifics, but I'm too much of a newbie to be very useful with that. > http://paste.pocoo.org/show/120824/ > > My Question: Can this be designed in way that looking for new files is > also asynchronous activity? > > What will be the deferred in this case? > > # my ideas: > > - I might define a deferred as, whenever the contents of the directory > is not matching the previous contents, return the new file which was > added. > - I can then add a callback to read the newfile. > > > Now, after reading the contents, I will have to do a non-blocking call > to fetch data, either using fun_a, fun_b or fun_b. How should I > associate this requirement to deferred/callback pattern? > > Any guidance would be helpful. > > Thanks, > Senthil > > > _______________________________________________ > Twisted-Python mailing list > Twisted-Python@twistedmatrix.com > http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python > > -- John Santos Evans Griffiths & Hart, Inc. 781-861-0670 ext 539 _______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python