Hey Tom,
AWESOME. I smell Wiki page :)
Read on below:
On Mar 19, 2012, at 8:18 PM, Thomas Bennett wrote:
>
> Versioner schemes
>
> The Data Transferers have an acute coupling with the Versioner scheme, case
> in point: if you are doing InPlaceTransfer,
> you need a versioner that will handle file paths that don't change from src
> to dest.
>
> The Versioner is used to describe who a target directory is created for a
> file to archive. I.e a directory structure where the data will be place. So
> if I have an archive root at /var/kat/archive/data/ and I use a basic
> versioner it will archive a file called 1234567890.h5 at
> /var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe the
> destination for a local data transfer.
>
> I have the following versioner set in my policy/product-types.xml.
>
> policy/product-types.xml
> <versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>
Ah, gotcha. You may consider using the MetadataBasedFileVersioner. It lets you
define a filePathSpec,
e.g., /[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]
And then versions or "places" the resulting product files in that specification
structure.
To create the above, you would simply subclass the Versioner like so:
public KATVersioner extends MetadataBasedFileVersioner{
String filePathSpec =
"/[PrincipalInvestigator]/[Project]/[AcquisitionDate]/[Filename]";
public KATVersioner(){
setFilePathSpec(filePathSpec);
}
}
You can even refer to keys that don't exist yet, and then dynamically generate
them (and their
values) by overriding the createDatStoreReferences method:
@Override
public void createDataStoreReferences(Product product, Metadata met){
// do work to generate AcquisitionDate here
met.replaceMetadata("AcquisitionDate", acqdate);
super.createDataStoreReferences(product, met);
}
>
> Just out of curiosity... why is this called a versioner?
Hehe, if it's weird in OODT, it most likely resulted from me :) I originally
saw
this as a great tool to "version" or allow for multiple copies of a file on
disk, e.g., with different
file (or directory-based) metadata to delineate the versioners. Over time it
really grew to be a
"URIGenerationScheme" or "ArchivePathGenerator". Those would be better names,
but Versioner
stuck, so here we are :)
>
> Using the File Manager as the client
>
> Configuring a data trransfer in filemgr.properties, and then not using the
> crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
> you can tell the server (on the ingest(...) method) to handle all the file
> transfers for you. In that case, the server needs a
> Data Transferer configured, and the above properties apply, with the caveat
> that the FM server is now the "client" that is transferring
> the data to itself :)
>
> If I set the following property in the etc/filemgr.property file
>
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransfer
>
> I did a quick try of this today, trying an ingest on my localhost, (to avoid
> any sticky network issues) and I was able to perform an ingest.
>
> I see you can specify the data transfer factory to use, so I assume then that
> the filemgr.datatransfer.factory setting is just the default if none is
> specified on the command line. Is this true?
It's true, if you are doing server-based transfers (by calling the
filemgr-client --ingestProduct method directory, without specifying the data
transfer factory on the command line,
yep).
>
> I ran a version of the command line client (my own version of filemgr-client
> with abs paths to the configuration files):
>
> cas-filemgr-client.sh --url http://localhost:9101 --operation --ingestProduct
> --refs /Users/thomas/1331871808.h5 --productStructure Flat --productTypeName
> KatFile --metadataFil/Users/thomas/1331871808.h5.met --productName
> 1331871808.h5 --clientTransfer --dataTransfer
> org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
>
> With the data factory also type spec'ed as:
>
> etc/filemgr.properties
> filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory
>
> And the versioner set as:
>
> policy/product-types.xml
> <versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>
>
> And it ingested the file. +1 for OODT!
WOOT!
>
> Local and remote transfers to the same filemgr
>
> One way to do this is to write a Facade java class, e.g., MultiTransferer,
> that can e.g., on a per-product type basis,
> decide whether to call and delegate to LocalDataTransfer or
> RemoteDataTransfer. If wrote in a configurable way, that would be
> an awesome addition to the OODT code base. We could call it
> ProductTypeDelegatingDataTransfer.
>
> I'm thinking I would prefer to have some crawlers specifying how file should
> be transferred. Is there any particular reason why this would not be a good
> idea - as long as the client specifies the transfer method to use?
Yeah this is totally acceptable -- you can simply tell the crawler which
TransferFactory to use. If you wanted the crawlers to sense it
automatically based on Product Type (which also has to be provided), then you
could use a method similar to the above.
>
> Getting the product to a second archive
>
> One way to do it is to simply stand up a file manager at the remote site and
> catalog, and then do remote data transfer (and met transfer) to take care of
> that.
> Then as long as your XML-RPC ports are open both the data and metadata can be
> backed up by simply doing the same ingestion mechanisms. You could
> wire that up as a Workflow task to run periodically, or as part of your std
> ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs up to
> the remote
> site by ingesting into the remote backup file manager).
>
> Okay. Got it! I'll see if I can wire up both options!
AWESOME.
>
> I'd be happy to help you down either path.
>
> Thanks! Much appreciated.
>
> > I was thinking, perhaps using the functionality described in OODT-84
> > (Ability for File Manager to stage an ingested Product to one of its
> > clients) and then have a second crawler on the backup archive which will
> > then update it's own catalogue.
>
> +1, that would work too!
>
> Once again, thanks for the input and advice - always informative ;)
Haha anytime dude. Great work!
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++