sorting by date (XML)
my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: sorting by date (XML)
Here's my two cents on this: Both ways you will need to combine the date in one field, but if you use a millisecond representation you will not be able to use the FLOAT sort type and you'll have use STRING sort (Slower) because the millisecond representation is longer than FLOAT allows, so you have three options: 1) Use MMDD and sort by FLOAT type 2) Use the millisecond representation and sort by STRING type 3) If the date you're entering here is the date of indexing then you can just sort by DOC type (which is the DOC ID) and save yourself the pain Hope this helps. Nader Henein -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 3:52 PM To: Lucene Users List Subject: sorting by date (XML) my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Nader S. Henein wrote: Here's my two cents on this: Both ways you will need to combine the date in one field, but if you use a millisecond representation you will not be able to use the FLOAT sort type and you'll have use STRING sort (Slower) because the millisecond representation is longer than FLOAT allows, so you have three options: 1) Use MMDD and sort by FLOAT type ok, I guess then will take the FLOAT type 2) Use the millisecond representation and sort by STRING type 3) If the date you're entering here is the date of indexing then you can just sort by DOC type (which is the DOC ID) and save yourself the pain unfortunately this isn't possible. Thanks a lot for your help Michi Hope this helps. Nader Henein -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 3:52 PM To: Lucene Users List Subject: sorting by date (XML) my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Otis --- Michael Wechner [EMAIL PROTECTED] wrote: my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Otis Gospodnetic wrote: Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Why? We have attributes that contain iso8601 date strings and when indexing: Date date = isoConv.parse(value, new ParsePosition(0)); String dateString = DateField.dateToString(date); doc.add(Field.Keyword(name, dateString)); then when searching: String from = DateField.timeToString(searchFromDate); String to = DateField.timeToString(searchToDate); RangeQuery rq = new RangeQuery(new Term(searchKey, from), new Term(searchKey, to), true); Is this not correct? bst, -Rob Otis --- Michael Wechner [EMAIL PROTECTED] wrote: my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of unique time stamps, hence resulting in less BooleanQueries. Otis --- Robert Koberg [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Why? We have attributes that contain iso8601 date strings and when indexing: Date date = isoConv.parse(value, new ParsePosition(0)); String dateString = DateField.dateToString(date); doc.add(Field.Keyword(name, dateString)); then when searching: String from = DateField.timeToString(searchFromDate); String to = DateField.timeToString(searchToDate); RangeQuery rq = new RangeQuery(new Term(searchKey, from), new Term(searchKey, to), true); Is this not correct? bst, -Rob Otis --- Michael Wechner [EMAIL PROTECTED] wrote: my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Otis Gospodnetic wrote: Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of unique time stamps, hence resulting in less BooleanQueries. Cool, thanks. So DateField.dateToString is the best, most efficient way, correct? Otis --- Robert Koberg [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Why? We have attributes that contain iso8601 date strings and when indexing: Date date = isoConv.parse(value, new ParsePosition(0)); String dateString = DateField.dateToString(date); doc.add(Field.Keyword(name, dateString)); then when searching: String from = DateField.timeToString(searchFromDate); String to = DateField.timeToString(searchToDate); RangeQuery rq = new RangeQuery(new Term(searchKey, from), new Term(searchKey, to), true); Is this not correct? bst, -Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote: Otis Gospodnetic wrote: Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of unique time stamps, hence resulting in less BooleanQueries. Cool, thanks. So DateField.dateToString is the best, most efficient way, correct? It all depends. But if all you care about is year, month, day, it is _not_ the most efficient. DateField converts down to milliseconds, and is what Otis was referring to. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Erik Hatcher wrote: On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote: Otis Gospodnetic wrote: Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of unique time stamps, hence resulting in less BooleanQueries. Cool, thanks. So DateField.dateToString is the best, most efficient way, correct? It all depends. But if all you care about is year, month, day, it is _not_ the most efficient. DateField converts down to milliseconds, and is what Otis was referring to. Oops, I meant to write DateField.timeToString which I use when querying. If I use DateField.dateToString when indexing but timeToString when searching is that a bad practice? I do only need month, day and year. So should I be indexing with timeToString? How would you do it if the above is still a bad practice? Sorry for the basic questions... best, -Rob Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote: Oops, I meant to write DateField.timeToString which I use when querying. If I use DateField.dateToString when indexing but timeToString when searching is that a bad practice? I do only need month, day and year. So should I be indexing with timeToString? How would you do it if the above is still a bad practice? Sorry for the basic questions... No worries. This is the type of thing that is a gotcha with dates, and is a prime candidate for a wiki page (nudge, nudge)... You should represent dates (at index and search time) using MMDD format - it needs to be lexicographically ordered. Forget DateField and Field.Keyword(String,Date) altogether. Some tricks are needed if you need to use QueryParser to translate mm/dd/ format to how you represent it, but it is quite simple. (subclass QueryParser, override getRangeQuery). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Erik Hatcher wrote: On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote: Oops, I meant to write DateField.timeToString which I use when querying. If I use DateField.dateToString when indexing but timeToString when searching is that a bad practice? I do only need month, day and year. So should I be indexing with timeToString? How would you do it if the above is still a bad practice? Sorry for the basic questions... No worries. This is the type of thing that is a gotcha with dates, and is a prime candidate for a wiki page (nudge, nudge)... You should represent dates (at index and search time) using MMDD format - it needs to be lexicographically ordered. Forget DateField and Field.Keyword(String,Date) altogether. Some tricks are needed if you need to use QueryParser to translate mm/dd/ format to how you represent it, but it is quite simple. (subclass QueryParser, override getRangeQuery). Ah. Great - thanks! I see you added it to the wiki. Thanks again :) This is perfect in my case since iso8601 is in the format: 2004-04-27T01:23:33 Luckily so far, from my logs, hardly anyone uses the date search. I guess I should have been doing this from the beginning, don't know why I didn't... best, -Rob Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Robert Koberg wrote: Ah. Great - thanks! I see you added it to the wiki. Thanks again :) I guess you mean http://wiki.apache.org/jakarta-lucene/IndexingDateFields Thanks as well Michi This is perfect in my case since iso8601 is in the format: 2004-04-27T01:23:33 Luckily so far, from my logs, hardly anyone uses the date search. I guess I should have been doing this from the beginning, don't know why I didn't... best, -Rob Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]