Cheers, Geert-Jan, that's very helpful.
We won't always be searching with dates and we wouldn't want
duplicates to show up in the results, so your second suggestion looks
like a good workaround if I can't solve the actual problem. I didn't
know about FieldCollapsing, so I'll definitely keep it in mind.
Thanks
Mark
On 22 Jun 2010, at 3:44 pm, Geert-Jan Brits wrote:
Perhaps my answer is useless, bc I don't have an answer to your direct
question, but:
You *might* want to consider if your concept of a solr-document is
on the
correct granular level, i.e:
your problem posted could be tackled (afaik) by defining a document
being a
'sub-event' with only 1 daterange.
So for each event-doc you have now, this is replaced by several sub-
event
docs in this proposed situation.
Additionally each sub-event doc gets an additional field 'parent-
eventid'
which maps to something like an event-id (which you're probably
using) .
So several sub-event docs can point to the same event-id.
Lastly, all sub-event docs belonging to a particular event implement
all the
other fields that you may have stored in that particular event-doc.
Now you can query for events based on data-rages like you
envisioned, but
instead of returning events you return sub-event-docs. However since
all
data of the original event (except the multiple dateranges) is
available in
the subevent-doc this shouldn't really bother the client. If you
need to
display all dates of an event (the only info missing from the returned
solr-doc) you could easily store it in a RDB and fetch it using the
defined
parent-eventid.
The only caveat I see, is that possibly multiple sub-events with the
same
'parent-eventid' might get returned for a particular query.
This however depends on the type of queries you envision. i.e:
1) If you always issue queries with date-filters, and *assuming* that
sub-events of a particular event don't temporally overlap, you will
never
get multiple sub-events returned.
2) if 1) doesn't hold and assuming you *do* mind multiple sub-
events of
the same actual event, you could try to use Field Collapsing on
'parent-eventid' to only return the first sub-event per parent-
eventid that
matches the rest of your query. (Note however, that Field Collapsing
is a
patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)
Not sure if this helped you at all, but at the very least it was a
nice
conceptual exercise ;-)
Cheers,
Geert-Jan
2010/6/22 Mark Allan <mark.al...@ed.ac.uk>
Hi all,
Firstly, I apologise for the length of this email but I need to
describe
properly what I'm doing before I get to the problem!
I'm working on a project just now which requires the ability to
store and
search on temporal coverage data - ie. a field which specifies a
date range
during which a certain event took place.
I hunted around for a few days and couldn't find anything which
seemed to
fit, so I had a go at writing my own field type based on
solr.PointType.
It's used as follows:
schema.xml
<fieldType name="temporal" class="solr.TemporalCoverage"
dimension="2" subFieldSuffix="_i"/>
<field name="daterange" type="temporal" indexed="true"
stored="true"
multiValued="true"/>
data.xml
<add>
<doc>
...
<field name="daterange">1940,1945</field>
</doc>
</add>
Internally, this gets stored as:
<arr name="daterange"><str>1940,1945</str></arr>
<int name="daterange_0_i">19400000</int>
<int name="daterange_1_i">19450000</int>
In due course, I'll declare the subfields as a proper date type,
but in the
meantime, this works absolutely fine. I can search for an
individual date
and Solr will check (queryDate > daterange_0 AND queryDate <
daterange_1 )
and the correct documents are returned. My code also allows the
user to
input a date range in the query but I won't complicate matters with
that
just now!
The problem arises when a document has more than one "daterange"
field
(imagine a news broadcast which covers a variety of topics and
hence time
periods).
A document with two daterange fields
<doc>
...
<field name="daterange">19820402,19820614</field>
<field name="daterange">1990,2000</field>
</doc>
gets stored internally as
<arr
name="daterange"><str>19820402,19820614</str><str>1990,2000</str></
arr>
<arr name="daterange_0_i"><int>19820402</int><int>19900000</int></
arr>
<arr name="daterange_1_i"><int>19820614</int><int>20000000</int></
arr>
In this situation, searching for 1985 should yield zero results as
it is
contained within neither daterange, however, the above document is
returned
in the result set. What Solr is doing is checking that the
queryDate (1985)
is greater than *any* of the values in daterange_0 AND queryDate is
less
than *any* of the values in daterange_1.
How can I get Solr to respect the positions of each item in the
daterange_0
and _1 arrays? Ideally I'd like the search to use the following
logic, thus
preventing the above document from being returned in a search for
1985:
(queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
(queryDate > daterange_0[1] AND queryDate < daterange_1[1])
Someone else had a very similar problem recently on the mailing
list with a
multiValued PointType field but the thread went cold without a final
solution.
While I could filter the results when they get back to my application
layer, it seems like it's not really the right place to do it.
Any help getting Solr to respect the positions of items in arrays
would be
very gratefully received.
Many thanks,
Mark
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.