Re: sort by field length
You might be better off starting with the Lucene CheckIndex program. It walks all of the Lucene index data structures. I have done forensics by fiddling with the CheckIndex code. On Thu, Aug 26, 2010 at 9:11 AM, Shawn Heisey wrote: > On 5/24/2010 6:30 AM, Sascha Szott wrote: >> >> Hi folks, >> >> is it possible to sort by field length without having to (redundantly) >> save the length information in a seperate index field? At first, I thought >> to accomplish this using a function query, but I couldn't find an >> appropriate one. >> > > I have a slightly different need related to this, though it may turn out > that what Sascha wants is similar. I would like to understand my data > better so I can improve my schema. I need to do some data mining that is > (to my knowledge) difficult or impossible with the source database. > Performance is irrelevant, as long as it finishes eventually. Completing > in less than an hour would be nice. > > I would do this on a test system with much lower performance and memory > (4GB) than my production servers, as a single index instead of multiple > shards. When it finishes building, the entire test index is likely to be > about 75GB. > > What I'm after is an output that would look very much like faceting, but I > want it to show document counts associated with field length (for a simple > string) and number of terms (for a tokenized field) instead of field value. > Can Solr do that, and if so, what do I need to have enabled in the schema > to get it? Would branch_3x be enough, or would trunk be better? > > Thanks, > Shawn > > -- Lance Norskog goks...@gmail.com
Re: sort by field length
On 5/24/2010 6:30 AM, Sascha Szott wrote: Hi folks, is it possible to sort by field length without having to (redundantly) save the length information in a seperate index field? At first, I thought to accomplish this using a function query, but I couldn't find an appropriate one. I have a slightly different need related to this, though it may turn out that what Sascha wants is similar. I would like to understand my data better so I can improve my schema. I need to do some data mining that is (to my knowledge) difficult or impossible with the source database. Performance is irrelevant, as long as it finishes eventually. Completing in less than an hour would be nice. I would do this on a test system with much lower performance and memory (4GB) than my production servers, as a single index instead of multiple shards. When it finishes building, the entire test index is likely to be about 75GB. What I'm after is an output that would look very much like faceting, but I want it to show document counts associated with field length (for a simple string) and number of terms (for a tokenized field) instead of field value. Can Solr do that, and if so, what do I need to have enabled in the schema to get it? Would branch_3x be enough, or would trunk be better? Thanks, Shawn
Re: sort by field length
Take a look at the scoring algorithm on the Wiki, it already takes this into account, albeit modified by how many times the term is mentioned in the field. So a field with 5 terms and one match will score higher than one with 10 terms and one match. Where it lands with 10 terms and 2 matches I leave as an exercise for the reader. I really think you're reinventing the wheel here and looking at the default scoring mechanism would be a good use of your time. Best Erick On Wed, May 26, 2010 at 4:04 AM, Sascha Szott wrote: > Hi Erick, > > Erick Erickson wrote: > >> Ah, I may have misunderstood, I somehow got it in my mind >> you were talking about the length of each term (as in string length). >> >> But if you're looking at the field length as the count of terms, that's >> another question, sorry for the confusion... >> >> I have to ask, though, why you want to sort this way? The relevance >> calculations already factor in both term frequency and field length. >> What's >> the use-case for sorting by field length given the above? >> > It's not a real world use-case -- I just want to get a better understanding > of the data I'm indexing (therefore, performance is neglectable). In my > current use case, you can think of the field length as an indicator of data > quality (i.e., the longer the field content, the worse the quality is). > Being able to sort the field data in order of decreasing length would allow > me to investigate "exceptional" data items that are not appropriately > handled by my curation process. > > Best, > Sascha > > > >> Best >> Erick >> >> On Tue, May 25, 2010 at 3:40 AM, Sascha Szott wrote: >> >> Hi Erick, >>> >>> >>> Erick Erickson wrote: >>> >>> Are you sure you want to recompute the length when sorting? >>>> It's the classic time/space tradeoff, but I'd suggest that when >>>> your index is big enough to make taking up some more space >>>> a problem, it's far too big to spend the cycles calculating each >>>> term length for sorting purposes considering you may be >>>> sorting all the terms in your index worst-case. >>>> >>>> Good point, thank you for the clarification. I "thought" that Lucene >>> internally stores the field length (e.g., in order to compute the >>> relevance) >>> and getting this information at query time requires only a simple lookup. >>> >>> -Sascha >>> >>> >>> >>> But you could consider payloads for storing the length, although >>>> that would still be redundant... >>>> >>>> Best >>>> Erick >>>> >>>> On Mon, May 24, 2010 at 8:30 AM, Sascha Szott wrote: >>>> >>>> Hi folks, >>>> >>>>> >>>>> is it possible to sort by field length without having to (redundantly) >>>>> save >>>>> the length information in a seperate index field? At first, I thought >>>>> to >>>>> accomplish this using a function query, but I couldn't find an >>>>> appropriate >>>>> one. >>>>> >>>>> Thanks in advance, >>>>> Sascha >>>>> >>>>> >>>>> >>>>> >>> >> >
Re: sort by field length
Hi Erick, Erick Erickson wrote: Ah, I may have misunderstood, I somehow got it in my mind you were talking about the length of each term (as in string length). But if you're looking at the field length as the count of terms, that's another question, sorry for the confusion... I have to ask, though, why you want to sort this way? The relevance calculations already factor in both term frequency and field length. What's the use-case for sorting by field length given the above? It's not a real world use-case -- I just want to get a better understanding of the data I'm indexing (therefore, performance is neglectable). In my current use case, you can think of the field length as an indicator of data quality (i.e., the longer the field content, the worse the quality is). Being able to sort the field data in order of decreasing length would allow me to investigate "exceptional" data items that are not appropriately handled by my curation process. Best, Sascha Best Erick On Tue, May 25, 2010 at 3:40 AM, Sascha Szott wrote: Hi Erick, Erick Erickson wrote: Are you sure you want to recompute the length when sorting? It's the classic time/space tradeoff, but I'd suggest that when your index is big enough to make taking up some more space a problem, it's far too big to spend the cycles calculating each term length for sorting purposes considering you may be sorting all the terms in your index worst-case. Good point, thank you for the clarification. I "thought" that Lucene internally stores the field length (e.g., in order to compute the relevance) and getting this information at query time requires only a simple lookup. -Sascha But you could consider payloads for storing the length, although that would still be redundant... Best Erick On Mon, May 24, 2010 at 8:30 AM, Sascha Szott wrote: Hi folks, is it possible to sort by field length without having to (redundantly) save the length information in a seperate index field? At first, I thought to accomplish this using a function query, but I couldn't find an appropriate one. Thanks in advance, Sascha
Re: sort by field length
Ah, I may have misunderstood, I somehow got it in my mind you were talking about the length of each term (as in string length). But if you're looking at the field length as the count of terms, that's another question, sorry for the confusion... I have to ask, though, why you want to sort this way? The relevance calculations already factor in both term frequency and field length. What's the use-case for sorting by field length given the above? Best Erick On Tue, May 25, 2010 at 3:40 AM, Sascha Szott wrote: > Hi Erick, > > > Erick Erickson wrote: > >> Are you sure you want to recompute the length when sorting? >> It's the classic time/space tradeoff, but I'd suggest that when >> your index is big enough to make taking up some more space >> a problem, it's far too big to spend the cycles calculating each >> term length for sorting purposes considering you may be >> sorting all the terms in your index worst-case. >> > Good point, thank you for the clarification. I "thought" that Lucene > internally stores the field length (e.g., in order to compute the relevance) > and getting this information at query time requires only a simple lookup. > > -Sascha > > > >> But you could consider payloads for storing the length, although >> that would still be redundant... >> >> Best >> Erick >> >> On Mon, May 24, 2010 at 8:30 AM, Sascha Szott wrote: >> >> Hi folks, >>> >>> is it possible to sort by field length without having to (redundantly) >>> save >>> the length information in a seperate index field? At first, I thought to >>> accomplish this using a function query, but I couldn't find an >>> appropriate >>> one. >>> >>> Thanks in advance, >>> Sascha >>> >>> >>> >
Re: sort by field length
Hi Erick, Erick Erickson wrote: Are you sure you want to recompute the length when sorting? It's the classic time/space tradeoff, but I'd suggest that when your index is big enough to make taking up some more space a problem, it's far too big to spend the cycles calculating each term length for sorting purposes considering you may be sorting all the terms in your index worst-case. Good point, thank you for the clarification. I "thought" that Lucene internally stores the field length (e.g., in order to compute the relevance) and getting this information at query time requires only a simple lookup. -Sascha But you could consider payloads for storing the length, although that would still be redundant... Best Erick On Mon, May 24, 2010 at 8:30 AM, Sascha Szott wrote: Hi folks, is it possible to sort by field length without having to (redundantly) save the length information in a seperate index field? At first, I thought to accomplish this using a function query, but I couldn't find an appropriate one. Thanks in advance, Sascha
Re: sort by field length
Are you sure you want to recompute the length when sorting? It's the classic time/space tradeoff, but I'd suggest that when your index is big enough to make taking up some more space a problem, it's far too big to spend the cycles calculating each term length for sorting purposes considering you may be sorting all the terms in your index worst-case. But you could consider payloads for storing the length, although that would still be redundant... Best Erick On Mon, May 24, 2010 at 8:30 AM, Sascha Szott wrote: > Hi folks, > > is it possible to sort by field length without having to (redundantly) save > the length information in a seperate index field? At first, I thought to > accomplish this using a function query, but I couldn't find an appropriate > one. > > Thanks in advance, > Sascha > >
sort by field length
Hi folks, is it possible to sort by field length without having to (redundantly) save the length information in a seperate index field? At first, I thought to accomplish this using a function query, but I couldn't find an appropriate one. Thanks in advance, Sascha