sorting by date (XML)

2004-04-27 Thread Michael Wechner
my XML files contain something like

date
 year2004/yearmonth04/monthday27/day...
/date
and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something like
a millisecond field and then sort by this, correct?
Has anyone done something like this yet?

Thanks

Michi

--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: sorting by date (XML)

2004-04-27 Thread Nader S. Henein

Here's my two cents on this:
Both ways you will need to combine the date in one field, but if you use a
millisecond representation you will not be able to use the FLOAT sort type
and you'll have use STRING sort (Slower) because the millisecond
representation is longer than FLOAT allows, so you have three options:

1) Use MMDD and sort by FLOAT type
2) Use the millisecond representation and sort by STRING type
3) If the date you're entering here is the date of indexing then you can
just sort by DOC type (which is the DOC ID) and save yourself the pain

Hope this helps.

Nader Henein

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 3:52 PM
To: Lucene Users List
Subject: sorting by date (XML)


my XML files contain something like

date
  year2004/yearmonth04/monthday27/day...
/date

and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something like a
millisecond field and then sort by this, correct?

Has anyone done something like this yet?

Thanks

Michi

-- 
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting by date (XML)

2004-04-27 Thread Michael Wechner
Nader S. Henein wrote:

Here's my two cents on this:
Both ways you will need to combine the date in one field, but if you use a
millisecond representation you will not be able to use the FLOAT sort type
and you'll have use STRING sort (Slower) because the millisecond
representation is longer than FLOAT allows, so you have three options:
1) Use MMDD and sort by FLOAT type
 

ok, I guess then will take the FLOAT type

2) Use the millisecond representation and sort by STRING type
3) If the date you're entering here is the date of indexing then you can
just sort by DOC type (which is the DOC ID) and save yourself the pain
 

unfortunately this isn't possible.

Thanks a lot for your help

Michi

Hope this helps.

Nader Henein

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 27, 2004 3:52 PM
To: Lucene Users List
Subject: sorting by date (XML)

my XML files contain something like

date
 year2004/yearmonth04/monthday27/day...
/date
and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something like a
millisecond field and then sort by this, correct?
Has anyone done something like this yet?

Thanks

Michi

 



--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Otis Gospodnetic
Beware of storing timestamps (DateFields, I guess) in Lucene, if you
intend to use range queries (xxx TO yyy).

Otis

--- Michael Wechner [EMAIL PROTECTED] wrote:
 my XML files contain something like
 
 date
   year2004/yearmonth04/monthday27/day...
 /date
 
 and I would like to sort by this date.
 
 So I guess I need to modify the Documentparser and generate something
 like
 a millisecond field and then sort by this, correct?
 
 Has anyone done something like this yet?
 
 Thanks
 
 Michi
 
 -- 
 Michael Wechner
 Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
 http://www.wyona.com  http://cocoon.apache.org/lenya/
 [EMAIL PROTECTED][EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Otis Gospodnetic wrote:

Beware of storing timestamps (DateFields, I guess) in Lucene, if you
intend to use range queries (xxx TO yyy).
Why?

We have attributes that contain iso8601 date strings and when indexing:

Date date = isoConv.parse(value, new ParsePosition(0));
String dateString = DateField.dateToString(date);
doc.add(Field.Keyword(name, dateString));
then when searching:

String from = DateField.timeToString(searchFromDate);
String to = DateField.timeToString(searchToDate);
RangeQuery rq = new RangeQuery(new Term(searchKey, from),
   new Term(searchKey, to), true);
Is this not correct?

bst,
-Rob

Otis

--- Michael Wechner [EMAIL PROTECTED] wrote:

my XML files contain something like

date
 year2004/yearmonth04/monthday27/day...
/date
and I would like to sort by this date.

So I guess I need to modify the Documentparser and generate something
like
a millisecond field and then sort by this, correct?
Has anyone done something like this yet?

Thanks

Michi

--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Otis Gospodnetic
Because having small time units like milliseconds will result in Range
query expanding to a large number of BooleanQueries, if you have a lot
of documents with unique time stamps.  Rounding the timestamp to
minutes, hours, or days, can drastically reduce the number of unique
time stamps, hence resulting in less BooleanQueries.

Otis

--- Robert Koberg [EMAIL PROTECTED] wrote:
 Otis Gospodnetic wrote:
 
  Beware of storing timestamps (DateFields, I guess) in Lucene, if
 you
  intend to use range queries (xxx TO yyy).
 
 Why?
 
 We have attributes that contain iso8601 date strings and when
 indexing:
 
 Date date = isoConv.parse(value, new ParsePosition(0));
 String dateString = DateField.dateToString(date);
 doc.add(Field.Keyword(name, dateString));
 
 then when searching:
 
 String from = DateField.timeToString(searchFromDate);
 String to = DateField.timeToString(searchToDate);
 RangeQuery rq = new RangeQuery(new Term(searchKey, from),
 new Term(searchKey, to), true);
 
 Is this not correct?
 
 bst,
 -Rob
 
 
  
  Otis
  
  --- Michael Wechner [EMAIL PROTECTED] wrote:
  
 my XML files contain something like
 
 date
   year2004/yearmonth04/monthday27/day...
 /date
 
 and I would like to sort by this date.
 
 So I guess I need to modify the Documentparser and generate
 something
 like
 a millisecond field and then sort by this, correct?
 
 Has anyone done something like this yet?
 
 Thanks
 
 Michi
 
 -- 
 Michael Wechner
 Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
 http://www.wyona.com  http://cocoon.apache.org/lenya/
 [EMAIL PROTECTED][EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Otis Gospodnetic wrote:

Because having small time units like milliseconds will result in Range
query expanding to a large number of BooleanQueries, if you have a lot
of documents with unique time stamps.  Rounding the timestamp to
minutes, hours, or days, can drastically reduce the number of unique
time stamps, hence resulting in less BooleanQueries.
Cool, thanks. So DateField.dateToString is the best, most efficient way, 
correct?

Otis

--- Robert Koberg [EMAIL PROTECTED] wrote:

Otis Gospodnetic wrote:


Beware of storing timestamps (DateFields, I guess) in Lucene, if
you

intend to use range queries (xxx TO yyy).
Why?

We have attributes that contain iso8601 date strings and when
indexing:
Date date = isoConv.parse(value, new ParsePosition(0));
String dateString = DateField.dateToString(date);
doc.add(Field.Keyword(name, dateString));
then when searching:

String from = DateField.timeToString(searchFromDate);
String to = DateField.timeToString(searchToDate);
RangeQuery rq = new RangeQuery(new Term(searchKey, from),
   new Term(searchKey, to), true);
Is this not correct?

bst,
-Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote:
Otis Gospodnetic wrote:

Because having small time units like milliseconds will result in Range
query expanding to a large number of BooleanQueries, if you have a lot
of documents with unique time stamps.  Rounding the timestamp to
minutes, hours, or days, can drastically reduce the number of unique
time stamps, hence resulting in less BooleanQueries.
Cool, thanks. So DateField.dateToString is the best, most efficient 
way, correct?
It all depends.  But if all you care about is year, month, day, it is 
_not_ the most efficient.  DateField converts down to milliseconds, and 
is what Otis was referring to.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Erik Hatcher wrote:

On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote:

Otis Gospodnetic wrote:

Because having small time units like milliseconds will result in Range
query expanding to a large number of BooleanQueries, if you have a lot
of documents with unique time stamps.  Rounding the timestamp to
minutes, hours, or days, can drastically reduce the number of unique
time stamps, hence resulting in less BooleanQueries.


Cool, thanks. So DateField.dateToString is the best, most efficient 
way, correct?


It all depends.  But if all you care about is year, month, day, it is 
_not_ the most efficient.  DateField converts down to milliseconds, and 
is what Otis was referring to.
Oops, I meant to write DateField.timeToString which I use when querying. 
If I use DateField.dateToString when indexing but timeToString when 
searching is that a bad practice? I do only need month, day and year. So 
should I be indexing with timeToString?

How would you do it if the above is still a bad practice?

Sorry for the basic questions...

best,
-Rob

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote:
Oops, I meant to write DateField.timeToString which I use when 
querying. If I use DateField.dateToString when indexing but 
timeToString when searching is that a bad practice? I do only need 
month, day and year. So should I be indexing with timeToString?

How would you do it if the above is still a bad practice?

Sorry for the basic questions...
No worries.  This is the type of thing that is a gotcha with dates, 
and is a prime candidate for a wiki page (nudge, nudge)...

You should represent dates (at index and search time) using MMDD 
format - it needs to be lexicographically ordered.  Forget DateField 
and Field.Keyword(String,Date) altogether.

Some tricks are needed if you need to use QueryParser to translate 
mm/dd/ format to how you represent it, but it is quite simple. 
(subclass QueryParser, override getRangeQuery).

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Erik Hatcher wrote:

On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote:

Oops, I meant to write DateField.timeToString which I use when 
querying. If I use DateField.dateToString when indexing but 
timeToString when searching is that a bad practice? I do only need 
month, day and year. So should I be indexing with timeToString?

How would you do it if the above is still a bad practice?

Sorry for the basic questions...


No worries.  This is the type of thing that is a gotcha with dates, 
and is a prime candidate for a wiki page (nudge, nudge)...

You should represent dates (at index and search time) using MMDD 
format - it needs to be lexicographically ordered.  Forget DateField and 
Field.Keyword(String,Date) altogether.

Some tricks are needed if you need to use QueryParser to translate 
mm/dd/ format to how you represent it, but it is quite simple. 
(subclass QueryParser, override getRangeQuery).
Ah. Great - thanks! I see you added it to the wiki. Thanks again :)

This is perfect in my case since iso8601 is in the format:

2004-04-27T01:23:33

Luckily so far, from my logs, hardly anyone uses the date search. I 
guess I should have been doing this from the beginning, don't know why I 
didn't...

best,
-Rob

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorting by date (XML)

2004-04-27 Thread Michael Wechner
Robert Koberg wrote:

Ah. Great - thanks! I see you added it to the wiki. Thanks again :)


I guess you mean

http://wiki.apache.org/jakarta-lucene/IndexingDateFields

Thanks as well

Michi


This is perfect in my case since iso8601 is in the format:

2004-04-27T01:23:33

Luckily so far, from my logs, hardly anyone uses the date search. I 
guess I should have been doing this from the beginning, don't know why 
I didn't...

best,
-Rob

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com  http://cocoon.apache.org/lenya/
[EMAIL PROTECTED][EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]