Hi Tom,

Thanks. Comments below:

On Sep 22, 2011, at 2:30 PM, Thomas Bennett wrote:

> Hi,
> 
> I have a few questions about building queries for filemgr lucene catalogs and 
> I was thinking someone may be able to help me.
> 
> I've ingested some files into catalog and then using the command line tools 
> (and aliases - thanks Cameron!) to query the catalog.
> 
> I'm not too familiar with writing SQL queries, but I've been able to achieve 
> the the following types of queries:
> 
> bin$ ./query_tool --url http://localhost:9000 --sql -query "SELECT 
> Observer,Description,Duration,ExperimentID FROM KatFile WHERE 
> Observer='jasper'" --sortBy Duration
> 
> Which returns:
> .....
> jasper,a9909ae6-822b-11e0-a7a1-0060dd4721d8,Target track,637.841571569
> jasper,47c3a4da-822a-11e0-a7a1-0060dd4721d8,Target track,565.859450817
> jasper,777b0f34-8224-11e0-a7a1-0060dd4721d8,Target track,80.9798858166
> 
> 
> bin$ ./query_tool --url http://localhost:9000 --lucene -query 
> 'Observer:sharmila'
> 
> Which returns:
> .......
> ba9b292e-e506-11e0-ad74-9f1c5e7f0611
> b93dbc0d-e506-11e0-ad74-9f1c5e7f0611
> b7e530ec-e506-11e0-ad74-9f1c5e7f0611
> b66ff60b-e506-11e0-ad74-9f1c5e7f0611
> afc6556a-e506-11e0-ad74-9f1c5e7f0611
> 
> 
> Questions:
>       • The SQL query does what I expect ;-) but with one problem - in what 
> order will I receive the data? I can't figure out an automatic way to find 
> out which column is which data.

Good question! It looks like it just prints the metadata in any order, as 
opposed to the order that you received it. This is probably not a great thing 
to do, so 
can you file an issue and we can take a look at it?

>       • Is full SQL query syntax supported?

Nope, it's just a small subset. You can see what's supported here:

http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/util/SqlParser.html

Improvements welcome! :)

>       • The Lucene query returns the productID. Is there a class I can use 
> that will return something similar to the sql query? (Although I should look 
> at the code and find this out for myself - asking is free :-)

Heh, great question, but the answer is no. We didn't really standardize on the 
output from these tools. I originally developed the QueryTool (which understood 
Lucene to begin with, and later Brian Foster added his SQL syntax to it, and 
the associated response format). 

Maybe we should open up an issue (and associated wiki page) on standardizing on 
the output. Feel free to propose something and I'll be happy to join in 
(hopefully others will too).

>       • I've not yet tested any more complex SQL and Lucene queries - I was 
> just wondering if there where any useful info out there that would show me 
> some more funky example queries. So far I've found lucene tutorial and sql 
> quick ref. I'll tie this into OODT Filemgr User Guide once I've figured these 
> things out.

+1, that's the best place to start. We also only support a limited set of the 
Lucene syntax as well, see the following class:

http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/tools/CASAnalyzer.html

>       • I see the version of lucene being used it quiet old (2.0.0 and the 
> latest ver is 2.9.1). Is there any reason why OODT is using this old version?

I would *love* to upgrade to 2.9.1 or 2.9.4.

Upgrading to 3.0 will break APIs for us, b/c Lucene changed to the 
ScoreCollector method for getting hits back I believe in the 3.x 
series, however we should be forwards compat to e.g., 2.9.4.

http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.9.4/

>       • Should I be spending the effort to use a different (i.e. sql 
> database) or are other OODT implementations using lucene?
> Thanks in advance for any help.

Great question.

Most of the folks use Lucene to begin with, because it requires no external 
database or service, it just works out of the box. It 
also has a number of other advantages:

* Easy unit testing against your index
* You can copy around FM index directories and share them between machines
* You can test locally on your laptop by copying the FM index off of a server 
onto your laptop, and then spinning up a local FM from there. The file refs 
won't exist, but you can play around with the catalog and most other things 
work.
* You can open up the FM index in Luke http://getopt.org/luke/ and then browse 
and query the Index using the Full Lucene Syntax
* It's fairly scalable (up to 10s of M of products). You can scale beyond, but 
you have to get into index partitioning, backups, etc., Also time queries at 
that stage token explosion (e.g., doing a range query for 
2001-01-01T00:00:00.000Z to 2003-01-01T00:00:00.000Z will explode), mainly to 
do with the SerDe format for storing CAS metadata and product information that 
we used in the LuceneCatalog. This can be improved to scale beyond a few 
million products, but no one has invested the effort into that yet, they 
typically just use a SQL RDBMS, and the DataSourceCatalog at that point 

To move your existing index to the DataSourceCatalog, there's a tool in FM that 
I wrote called ExpImpCatalog. You can find it here: http://s.apache.org/Xuq

To use the tool in an existing FM deployment, do the following:

1. Stand up a new FM that you are going to configure with your 
DataSourceCatalog. 
  - change the port to 9010
  - if your existing FM is in e.g., /usr/local/filemgr, put this new one in 
/usr/local/filemgr2
  - configure it with the DataSourceCatalog
  - set up your DB and bake in the parameters to the FM config

2. Go into /usr/local/filemgr/bin (your existing, Lucene-based FM)
    - run java -Djava.ext.dirs=../lib 
org.apache.oodt.cas.filemgr.tools.ExpImpCatalog you should see:

]$ java -Djava.ext.dirs=../lib org.apache.oodt.cas.filemgr.tools.ExpImpCatalog
ExpImpCatalog [options] 
--source <url>
--dest <url>
 --unique
[--types <comma separate list of product type names>]
[--sourceCatProps <file> --destCatProps <file>]

This tool works like the following:
   You give it either a combination of: --source and --dest OR
                                  a combination of: --sourceCataProps and 
--destCatProps

In the case of simply --source and --dest, it will import all of the source 
catalog into the dest catalog via XML-RPC, talking to 
your source FM URL, and your dest FM URL. In the case of the--sourceCatProps 
and --destCatProps, it will do the same 
thing, except it won't use XML-RPC as the transport layer, it will simply 
instantiate a copy of the source Catalog interface object, 
and the dest Catalog interface object (in a single JVM), and import product and 
met at a time from source to dest. I made the 
props based portion of the tool to avoid transferring large met and product 
objects over XML-RPC, and to keep them 
within a JVM. 

The --unique parameter will not import a source product ID into a dest catalog 
if that product ID exists in the dest catalog. The 
--types parameter specifies a comma separated list of Product Types to export 
from the source catalog into the dest catalog. 
If --types is omitted all product types are assumed.

So, there is an easy way to migrate from an existing Lucene index FM catalog 
into any other Catalog fronted by the FM. 
Another thing people do sometimes is that if you have the source data and the 
ingestion pipeline, they will just blow away 
the Lucene (or whatever) Catalog, and then re-ingest using the 
Crawler/FM/Curation pipeline into e.g., a new DataSourceCat, 
that they configure their existing FM to now use.

Hope that helps explain things. These would probably be good javadocs, plus 
Wiki pages for these tools and migration :)

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to