query for point in time

2011-09-15 Thread gary tam
Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
 unit testing  2010-01-04
2010-01-06
 QA support 2010-01-07
2010-01-12
 implementation   2010-01-13
 2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks


Re: query for point in time

2011-09-15 Thread Jonathan Rochkind
You didn't tell us what your schema looks like, what fields with what 
types are involved.


But similar to how you'd do it in your database, you need to find 
'documents' that have a start date before your date in question, and an 
end date after your date in question, to find the ones whose range 
includes your date in question.


Something like this:

q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

Of course, you need to add on your restriction to just documents about 
'John Smith', through another AND clause or an 'fq'.


But in general, if you've got a db with this info already, and this is 
all you need, why not just use the db?  Multi-hieararchy data like this 
is going to give you trouble in Solr eventually, you've got to arrange 
the solr indexes/schema to answer your questions, and eventually you're 
going to have two questions which require mutually incompatible schema 
to answer.


An rdbms is a great general purpose question answering tool for 
structured data.  lucene/Solr is a great indexing tool for text matching.


On 9/15/2011 2:55 PM, gary tam wrote:

Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
2010-01-06
  QA support 2010-01-07
2010-01-12
  implementation   2010-01-13
  2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks



Re: query for point in time

2011-09-15 Thread gary tam
Thanks for the reply.  We had the search within the database initially, but
it proven to be too slow.  With solr we have much better performance.

One more question, how could I find the most current job for each employee

My data looks like


John Smith  department A   web site bug fix   2010-01-01
2010-01-03
 unit testing
 2010-01-04   2010-01-06
 QA support
2010-01-07   2010-01-12
 implementation   2010-01-13
   2010-01-22

Jane Doe  department A  QA support 2010-01-01
2010-05-01
 implementation   2010-05-02
   2010-09-28

Joe Doe  department APHP development  2011-01-01
2011-08-31
 Java Development  2011-09-01
2011-09-15

I would like to return this as my search result

John Smith   department Aimplementation  2010-01-13
  2010-01-22
Jane Doe  department Aimplementation  2010-05-02
  2010-09-28
Joe Doedepartment AJava Development  2011-09-01
  2011-09-15


Thanks in advance
Gary



On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 You didn't tell us what your schema looks like, what fields with what types
 are involved.

 But similar to how you'd do it in your database, you need to find
 'documents' that have a start date before your date in question, and an end
 date after your date in question, to find the ones whose range includes your
 date in question.

 Something like this:

 q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

 Of course, you need to add on your restriction to just documents about
 'John Smith', through another AND clause or an 'fq'.

 But in general, if you've got a db with this info already, and this is all
 you need, why not just use the db?  Multi-hieararchy data like this is going
 to give you trouble in Solr eventually, you've got to arrange the solr
 indexes/schema to answer your questions, and eventually you're going to have
 two questions which require mutually incompatible schema to answer.

 An rdbms is a great general purpose question answering tool for structured
 data.  lucene/Solr is a great indexing tool for text matching.


 On 9/15/2011 2:55 PM, gary tam wrote:

 Hi

 I have a scenario that I am not sure how to write the query for.

 Here is the scenario - have an employee record with multi value for
 project,
 started date, end date.

 looks something like


 John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
 2010-01-06
  QA support 2010-01-07
 2010-01-12
  implementation   2010-01-13
  2010-01-22

 I want to find what project John Smith was working on 2010-01-05

 Is this possible or I have to back to my database ?


 Thanks




Re: query for point in time

2011-09-15 Thread Jonathan Rochkind

I think there's something wrong with your database then, but okay.

You still haven't said what your Solr schema looks like -- that list of 
values doesn't say what the solr field names or types are. I think this 
is maybe because you don't actually have a Solr database and have no 
idea how Solr works, you're just asking in theory? On the other hand, 
you just said you have better performance with solr -- I'm not sure how 
you were able to tell the performance of solr in answering these queries 
if you don't even know how to make them!


But, again, assuming your data is set up like i'm guessing it is, it's 
quite similar to what you'd do with an rdbms.


What does 'most current' mean? Can jobs be overlapping? To find the 
project with the latest start date for a given person, just limit to 
documents with that current person in a 'q' or 'fq', and then sort by 
start_date desc. Perhaps limit to 1 if you really only want one hit.  
Same principle as you would in an rdbms.


Again, this requires setting up your solr index in such a way to answer 
these sorts of questions. Each document in Solr will represent a 
person-project pair.  It'll have fields for person (or multiple fields, 
personID, personFirst, personLast, etc), project name, project start 
date, project end date.  This will make it easy/possible to answer 
questions like your examples with Solr, but will make it hard to answer 
many other sorts of questions -- unlike an rdbms, it is difficult to set 
up a Solr index that can flexibly answer just about any question you 
through at it, particularly when you have hieararchical or otherwise 
multi-entity data.


If you are interested, the standard Solr tutorial is pretty good: 
http://lucene.apache.org/solr/tutorial.html





On 9/15/2011 6:39 PM, gary tam wrote:

Thanks for the reply.  We had the search within the database initially, but
it proven to be too slow.  With solr we have much better performance.

One more question, how could I find the most current job for each employee

My data looks like


John Smith  department A   web site bug fix   2010-01-01
2010-01-03
  unit testing
  2010-01-04   2010-01-06
  QA support
2010-01-07   2010-01-12
  implementation   2010-01-13
2010-01-22

Jane Doe  department A  QA support 2010-01-01
2010-05-01
  implementation   2010-05-02
2010-09-28

Joe Doe  department APHP development  2011-01-01
2011-08-31
  Java Development  2011-09-01
 2011-09-15

I would like to return this as my search result

John Smith   department Aimplementation  2010-01-13
   2010-01-22
Jane Doe  department Aimplementation  2010-05-02
   2010-09-28
Joe Doedepartment AJava Development  2011-09-01
   2011-09-15


Thanks in advance
Gary



On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:


You didn't tell us what your schema looks like, what fields with what types
are involved.

But similar to how you'd do it in your database, you need to find
'documents' that have a start date before your date in question, and an end
date after your date in question, to find the ones whose range includes your
date in question.

Something like this:

q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

Of course, you need to add on your restriction to just documents about
'John Smith', through another AND clause or an 'fq'.

But in general, if you've got a db with this info already, and this is all
you need, why not just use the db?  Multi-hieararchy data like this is going
to give you trouble in Solr eventually, you've got to arrange the solr
indexes/schema to answer your questions, and eventually you're going to have
two questions which require mutually incompatible schema to answer.

An rdbms is a great general purpose question answering tool for structured
data.  lucene/Solr is a great indexing tool for text matching.


On 9/15/2011 2:55 PM, gary tam wrote:


Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for
project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
2010-01-06
  QA support 2010-01-07
2010-01-12
  implementation   2010-01-13
  2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks