Hey Tom, On Mar 16, 2012, at 6:57 AM, Thomas Bennett wrote:
> Hi, > > I have a few questions about data transfer and thought I would roll it into > one email: > > 1) Local and remote data transfer with the same file manager > • I see that when configuring a cas-crawler, one specifies the data > transfer factory by using --clientTransferer > • However in etc/filemgr.properties the data transfer factory is > specified with filemgr.datatransfer.factory. > Does this mean that I if I specify a local transfer factory I cannot use a > crawler with a remote data transferer? Basically it means that there are 2 ways to configure data transfer. If you are using a Crawler, the crawler is going to handle client side transfer to the FM server. You can configure Local, Remote, or InPlace transfer at the moment, or roll your own client side transfer and then pass it via the crawler command line or config. Local means that the source and dest file paths need to be visible from the crawler's machine (or at least "appear" that way. A common pattern here is to use a Distributed File System like HDFS or ClusterFS to virtualize local disk, and mount it at a global virtual root. That way even though the data itself is distributed, to the Crawler and thus to LocalDataTransfer, it looks like it's on the same path). Remote means that the dest path can live on a different host, and that the client will work with the file manager server to chunk and transfer (via XML-RPC) that data from the client to the server. InPlace means that no data transfer will occur at all. The Data Transferers have an acute coupling with the Versioner scheme, case in point: if you are doing InPlaceTransfer, you need a versioner that will handle file paths that don't change from src to dest. Configuring a data trransfer in filemgr.properties, and then not using the crawler directly, but e.g., using the XmlRpcFileManagerClient,directly, you can tell the server (on the ingest(...) method) to handle all the file transfers for you. In that case, the server needs a Data Transferer configured, and the above properties apply, with the caveat that the FM server is now the "client" that is transferring the data to itself :) > > I'm wanting to cater for a situation where files could be ingested locally as > well as remotely using a single file manager. Is this possible? Sure can. One way to do this is to write a Facade java class, e.g., MultiTransferer, that can e.g., on a per-product type basis, decide whether to call and delegate to LocalDataTransfer or RemoteDataTransfer. If wrote in a configurable way, that would be an awesome addition to the OODT code base. We could call it ProductTypeDelegatingDataTransfer. > > 2) Copy and ingested product on a back up archive > > For backup (and access purposes), I'm wanting to ingest the product into an > off site archive (at our main engineering office) with it's own separate > catalogue. > What is the recommended way of doing this? One way to do it is to simply stand up a file manager at the remote site and catalog, and then do remote data transfer (and met transfer) to take care of that. Then as long as your XML-RPC ports are open both the data and metadata can be backed up by simply doing the same ingestion mechanisms. You could wire that up as a Workflow task to run periodically, or as part of your std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs up to the remote site by ingesting into the remote backup file manager). I'd be happy to help you down either path. > > They way I currently do this is by replicate the files using rsync (but I'm > then left with finding a way to update the catalogue). I was wondering if > there was a neater (more OODT) solution? I think a good solution might be to run a remote Backup File Manager and just ingest it again. Another option too would be to use the File Manager ExpImpCatalog tool to replicate the metadata out to your remote site, and then to rsync the files. That way you get files + met. > > I was thinking, perhaps using the functionality described in OODT-84 (Ability > for File Manager to stage an ingested Product to one of its clients) and then > have a second crawler on the backup archive which will then update it's own > catalogue. +1, that would work too! > > I just thought I would ask the question in case anyone has tried something > similar. Let me know what you think of the above and we'll work it out! Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
