Re: sort by field length

2010-08-27 Thread Lance Norskog
You might be better off starting with the Lucene CheckIndex program.
It walks all of the Lucene index data structures. I have done
forensics by fiddling with the CheckIndex code.

On Thu, Aug 26, 2010 at 9:11 AM, Shawn Heisey s...@elyograg.org wrote:
  On 5/24/2010 6:30 AM, Sascha Szott wrote:

 Hi folks,

 is it possible to sort by field length without having to (redundantly)
 save the length information in a seperate index field? At first, I thought
 to accomplish this using a function query, but I couldn't find an
 appropriate one.


 I have a slightly different need related to this, though it may turn out
 that what Sascha wants is similar.  I would like to understand my data
 better so I can improve my schema.  I need to do some data mining that is
 (to my knowledge) difficult or impossible with the source database.
  Performance is irrelevant, as long as it finishes eventually.  Completing
 in less than an hour would be nice.

 I would do this on a test system with much lower performance and memory
 (4GB) than my production servers, as a single index instead of multiple
 shards.  When it finishes building, the entire test index is likely to be
 about 75GB.

 What I'm after is an output that would look very much like faceting, but I
 want it to show document counts associated with field length (for a simple
 string) and number of terms (for a tokenized field) instead of field value.
  Can Solr do that, and if so, what do I need to have enabled in the schema
 to get it?  Would branch_3x be enough, or would trunk be better?

 Thanks,
 Shawn





-- 
Lance Norskog
goks...@gmail.com


Re: sort by field length

2010-08-26 Thread Shawn Heisey

 On 5/24/2010 6:30 AM, Sascha Szott wrote:

Hi folks,

is it possible to sort by field length without having to (redundantly) 
save the length information in a seperate index field? At first, I 
thought to accomplish this using a function query, but I couldn't find 
an appropriate one.




I have a slightly different need related to this, though it may turn out 
that what Sascha wants is similar.  I would like to understand my data 
better so I can improve my schema.  I need to do some data mining that 
is (to my knowledge) difficult or impossible with the source database.  
Performance is irrelevant, as long as it finishes eventually.  
Completing in less than an hour would be nice.


I would do this on a test system with much lower performance and memory 
(4GB) than my production servers, as a single index instead of multiple 
shards.  When it finishes building, the entire test index is likely to 
be about 75GB.


What I'm after is an output that would look very much like faceting, but 
I want it to show document counts associated with field length (for a 
simple string) and number of terms (for a tokenized field) instead of 
field value.  Can Solr do that, and if so, what do I need to have 
enabled in the schema to get it?  Would branch_3x be enough, or would 
trunk be better?


Thanks,
Shawn



Re: sort by field length

2010-05-26 Thread Sascha Szott

Hi Erick,

Erick Erickson wrote:

Ah, I may have misunderstood, I somehow got it in my mind
you were talking about the length of each term (as in string length).

But if you're looking at the field length as the count of terms, that's
another question, sorry for the confusion...

I have to ask, though, why you want to sort this way? The relevance
calculations already factor in both term frequency and field length. What's
the use-case for sorting by field length given the above?
It's not a real world use-case -- I just want to get a better 
understanding of the data I'm indexing (therefore, performance is 
neglectable). In my current use case, you can think of the field length 
as an indicator of data quality (i.e., the longer the field content, the 
worse the quality is). Being able to sort the field data in order of 
decreasing length would allow me to investigate exceptional data items 
that are not appropriately handled by my curation process.


Best,
Sascha



Best
Erick

On Tue, May 25, 2010 at 3:40 AM, Sascha Szottsz...@zib.de  wrote:


Hi Erick,


Erick Erickson wrote:


Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.


Good point, thank you for the clarification. I thought that Lucene
internally stores the field length (e.g., in order to compute the relevance)
and getting this information at query time requires only a simple lookup.

-Sascha




But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szottsz...@zib.de   wrote:

  Hi folks,


is it possible to sort by field length without having to (redundantly)
save
the length information in a seperate index field? At first, I thought to
accomplish this using a function query, but I couldn't find an
appropriate
one.

Thanks in advance,
Sascha











Re: sort by field length

2010-05-26 Thread Erick Erickson
Take a look at the scoring algorithm on the Wiki, it already takes
this into account, albeit modified by how many times the term
is mentioned in the field. So a field with 5 terms and one match
will score higher than one with 10 terms and one match. Where
it lands with 10 terms and 2 matches I leave as an exercise for
the reader.

I really think you're reinventing the wheel here and looking at the
default scoring mechanism would be a good use of your time.

Best
Erick

On Wed, May 26, 2010 at 4:04 AM, Sascha Szott sz...@zib.de wrote:

 Hi Erick,

 Erick Erickson wrote:

 Ah, I may have misunderstood, I somehow got it in my mind
 you were talking about the length of each term (as in string length).

 But if you're looking at the field length as the count of terms, that's
 another question, sorry for the confusion...

 I have to ask, though, why you want to sort this way? The relevance
 calculations already factor in both term frequency and field length.
 What's
 the use-case for sorting by field length given the above?

 It's not a real world use-case -- I just want to get a better understanding
 of the data I'm indexing (therefore, performance is neglectable). In my
 current use case, you can think of the field length as an indicator of data
 quality (i.e., the longer the field content, the worse the quality is).
 Being able to sort the field data in order of decreasing length would allow
 me to investigate exceptional data items that are not appropriately
 handled by my curation process.

 Best,
 Sascha



 Best
 Erick

 On Tue, May 25, 2010 at 3:40 AM, Sascha Szottsz...@zib.de  wrote:

  Hi Erick,


 Erick Erickson wrote:

  Are you sure you want to recompute the length when sorting?
 It's the classic time/space tradeoff, but I'd suggest that when
 your index is big enough to make taking up some more space
 a problem, it's far too big to spend the cycles calculating each
 term length for sorting purposes considering you may be
 sorting all the terms in your index worst-case.

  Good point, thank you for the clarification. I thought that Lucene
 internally stores the field length (e.g., in order to compute the
 relevance)
 and getting this information at query time requires only a simple lookup.

 -Sascha



  But you could consider payloads for storing the length, although
 that would still be redundant...

 Best
 Erick

 On Mon, May 24, 2010 at 8:30 AM, Sascha Szottsz...@zib.de   wrote:

  Hi folks,


 is it possible to sort by field length without having to (redundantly)
 save
 the length information in a seperate index field? At first, I thought
 to
 accomplish this using a function query, but I couldn't find an
 appropriate
 one.

 Thanks in advance,
 Sascha









Re: sort by field length

2010-05-25 Thread Sascha Szott

Hi Erick,

Erick Erickson wrote:

Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.
Good point, thank you for the clarification. I thought that Lucene 
internally stores the field length (e.g., in order to compute the 
relevance) and getting this information at query time requires only a 
simple lookup.


-Sascha



But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szottsz...@zib.de  wrote:


Hi folks,

is it possible to sort by field length without having to (redundantly) save
the length information in a seperate index field? At first, I thought to
accomplish this using a function query, but I couldn't find an appropriate
one.

Thanks in advance,
Sascha






Re: sort by field length

2010-05-25 Thread Erick Erickson
Ah, I may have misunderstood, I somehow got it in my mind
you were talking about the length of each term (as in string length).

But if you're looking at the field length as the count of terms, that's
another question, sorry for the confusion...

I have to ask, though, why you want to sort this way? The relevance
calculations already factor in both term frequency and field length. What's
the use-case for sorting by field length given the above?

Best
Erick

On Tue, May 25, 2010 at 3:40 AM, Sascha Szott sz...@zib.de wrote:

 Hi Erick,


 Erick Erickson wrote:

 Are you sure you want to recompute the length when sorting?
 It's the classic time/space tradeoff, but I'd suggest that when
 your index is big enough to make taking up some more space
 a problem, it's far too big to spend the cycles calculating each
 term length for sorting purposes considering you may be
 sorting all the terms in your index worst-case.

 Good point, thank you for the clarification. I thought that Lucene
 internally stores the field length (e.g., in order to compute the relevance)
 and getting this information at query time requires only a simple lookup.

 -Sascha



 But you could consider payloads for storing the length, although
 that would still be redundant...

 Best
 Erick

 On Mon, May 24, 2010 at 8:30 AM, Sascha Szottsz...@zib.de  wrote:

  Hi folks,

 is it possible to sort by field length without having to (redundantly)
 save
 the length information in a seperate index field? At first, I thought to
 accomplish this using a function query, but I couldn't find an
 appropriate
 one.

 Thanks in advance,
 Sascha






Re: sort by field length

2010-05-24 Thread Erick Erickson
Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.

But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szott sz...@zib.de wrote:

 Hi folks,

 is it possible to sort by field length without having to (redundantly) save
 the length information in a seperate index field? At first, I thought to
 accomplish this using a function query, but I couldn't find an appropriate
 one.

 Thanks in advance,
 Sascha