Re: Data transfer questions

Mattmann, Chris A (388J) Fri, 16 Mar 2012 08:55:44 -0700

Hey Tom,

On Mar 16, 2012, at 6:57 AM, Thomas Bennett wrote:


> Hi,
> 
> I have a few questions about data transfer and thought I would roll it into 
> one email:
> 
> 1) Local and remote data transfer with the same file manager
>       • I see that when configuring a cas-crawler, one specifies the data 
> transfer factory by using --clientTransferer 
>       • However in etc/filemgr.properties the data transfer factory is 
> specified with  filemgr.datatransfer.factory.
> Does this mean that I if I specify a local transfer factory I cannot  use a 
> crawler with a remote data transferer?

Basically it means that there are 2 ways to configure data transfer. If you are 
using a Crawler, the crawler is going to 
handle client side transfer to the FM server. You can configure Local, Remote, 
or InPlace transfer at the moment, 
or roll your own client side transfer and then pass it via the crawler command 
line or config. Local means that the
source and dest file paths need to be visible from the crawler's machine (or at 
least "appear" that way. A common pattern
here is to use a Distributed File System like HDFS or ClusterFS to virtualize 
local disk, and mount it at a global virtual 
root. That way even though the data itself is distributed, to the Crawler and 
thus to LocalDataTransfer, it looks like
it's on the same path). Remote means that the dest path can live on a different 
host, and that the client will work
with the file manager server to chunk and transfer (via XML-RPC) that data from 
the client to the server. InPlace means
that no data transfer will occur at all.

The Data Transferers have an acute coupling with the Versioner scheme, case in 
point: if you are doing InPlaceTransfer,
you need a versioner that will handle file paths that don't change from src to 
dest.

Configuring a data trransfer in filemgr.properties, and then not using the 
crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
you can tell the server (on the ingest(...) method) to handle all the file 
transfers for you. In that case, the server needs a 
Data Transferer configured, and the above properties apply, with the caveat 
that the FM server is now the "client" that is transferring
the data to itself :)


> 
> I'm wanting to cater for a situation where files could be ingested locally as 
> well as remotely using a single file manager. Is this possible?

Sure can. One way to do this is to write a Facade java class, e.g., 
MultiTransferer, that can e.g., on a per-product type basis, 
decide whether to call and delegate to LocalDataTransfer or RemoteDataTransfer. 
If wrote in a configurable way, that would be
an awesome addition to the OODT code base. We could call it 
ProductTypeDelegatingDataTransfer.

> 
> 2) Copy and ingested product on a back up archive
> 
> For backup (and access purposes), I'm wanting to ingest the product into an 
> off site archive (at our main engineering office) with it's own separate 
> catalogue.
> What is the recommended way of doing this? 

One way to do it is to simply stand up a file manager at the remote site and 
catalog, and then do remote data transfer (and met transfer) to take care of 
that.
Then as long as your XML-RPC ports are open both the data and metadata can be 
backed up by simply doing the same ingestion mechanisms. You could
wire that up as a Workflow task to run periodically, or as part of your std 
ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs up to 
the remote
site by ingesting into the remote backup file manager). 

I'd be happy to help you down either path.

> 
> They way I currently do this is by replicate the files using rsync (but I'm 
> then left with finding a way to update the catalogue). I was wondering if 
> there was a neater (more OODT) solution?

I think a good solution might be to run a remote Backup File Manager and just 
ingest it again. Another option too would be to use the File Manager 
ExpImpCatalog tool to replicate the 
metadata out to your remote site, and then to rsync the files. That way you get 
files + met.

> 
> I was thinking, perhaps using the functionality described in OODT-84 (Ability 
> for File Manager to stage an ingested Product to one of its clients) and then 
> have a second crawler on the backup archive which will then update it's own 
> catalogue.

+1, that would work too!

> 
> I just thought I would ask the question in case anyone has tried something 
> similar.

Let me know what you think of the above and we'll work it out!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Data transfer questions

Reply via email to