Hi Tom, > I finally got around to getting my AutoDetectProductCrawler working. In > response, Chris I hope you don't mind I've given some feedback about my > experiences with the crawler on the wiki page that you created below. I hope > thats okay. Please feel free to modify/add/revert as you wish.
Awesome! Thanks for the contribution, Tom. Wow you really rocked that page! Keep em' comin'! Cheers, Chris > > Cheers, > Tom > > On 4 June 2011 07:40, Mattmann, Chris A (388J) > <[email protected]> wrote: > Brian, I created a wiki page with your guidance below: > > https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help > > Others can feel free to jump on and contribute. > > Cheers, > Chris > > On Jun 1, 2011, at 2:20 PM, holenoter wrote: > > > hey thomas, > > > > you are using StdProductCrawler which assumes a *.met file already exist > > for each file (it has only one precondition which is the existing of the > > *.met file) . . . if you want a *.met file generated you will have to use > > one of the other 2 crawlers. running: ./crawler_launcher -psc will give > > you a list of supported crawlers. you can then run: ./crawler_launcher -h > > -cid <crawler_id> where crawler id is one of the ids from the previous > > command . . . unfortunately i don't think the other crawlers are documented > > all that extensively . . . MetExtractorProductCrawler will use a single > > extractor for all files . . . AutoDetectProductCrawler requires a mapping > > file to be filled out an mime-types defined > > > > * MetExtractorProductCrawler example configuration can be found in the > > source: > > - allows you to specify how the crawler will run your extractor > > https://svn.apache.org/repos/asf/oodt/trunk/metadata/src/main/resources/examples/extern-config.xml > > > > * AutoDetectProductCrawler example configuration can be found in the source: > > - uses the same metadata extractor specification file (you will have one > > of these for each mime-type) > > - allows you to define your mime-types -- that is, give a mime-type for a > > given filename regular expression > > https://svn.apache.org/repos/asf/oodt/trunk/crawler/src/main/resources/examples/mimetypes.xml > > > > - your file might look something like: > > > > <mime-info> > > > > > > > > <mime-type type="product/hdf5"> > > > > > > <glob pattern="*.h5"/> > > > > > > </mime-type> > > > > > > </mime-info> > > - maps your mime-types to extractors > > https://svn.apache.org/repos/asf/oodt/trunk/crawler/src/main/resources/examples/mime-extractor-map.xml > > > > Hope this helps . . . > > -brian > > > > On Jun 01, 2011, at 12:54 PM, Thomas Bennett <[email protected]> wrote: > > > >> Hi, > >> > >> I've successfully got the CmdLineIngester working with an > >> ExternMetExtractor (written in python): > >> > >> However, when I try launch the crawler I get a warning telling me the the > >> preconditions for ingest have not been met. No .met file has been created. > >> > >> Two questions: > >> 1) I'm just wondering if there is any configuration that I'm missing. > >> 2) Where should I start hunting in the code or logs to find out why my met > >> extractor was not run? > >> > >> Kind regards, > >> Thomas > >> > >> For your reference, here is the command and output. > >> > >> bin$ ./crawler_launcher --crawlerId StdProductCrawler --productPath > >> /usr/local/meerkat/data/staging/products/hdf5 --filemgrUrl > >> http://localhost:9000 --failureDir /tmp --actionIds DeleteDataFile > >> MoveDataFileToFailureDir Unique --metFileExtension met --clientTransferer > >> org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory > >> --metExtractor org.apache.oodt.cas.metadata.extractors.ExternMetExtractor > >> --metExtractorConfig > >> /usr/local/meerkat/extractors/katextractor/katextractor.config > >> http://localhost:9000 > >> StdProductCrawler > >> Jun 1, 2011 9:48:07 PM org.apache.oodt.cas.crawlProductCrawler crawl > >> INFO: Crawling /usr/local/meerkat/data/staging/products/hdf5 > >> Jun 1, 2011 9:48:07 PM org.apache.oodt.cascrawl.ProductCrawler handleFile > >> INFO: Handling file > >> /usr/local/meerkat/data/staging/products/hdf5/1263940095.h5 > >> Jun 1, 2011 9:48:07 PM org.apache.oodt.cas.crawl.ProductCrawler handleFile > >> WARNING: Failed to pass preconditions for ingest of product: > >> [/usr/local/meerkat/data/staging/products/hdf5/1263940095.h5] > >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > -- > Thomas Bennett > > SKA South Africa > > Office : +2721 506 7341 > Mobile : +2779 523 7105 > Email : [email protected] > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
