Re: sort by field length

2010-08-27 Thread Lance Norskog
You might be better off starting with the Lucene CheckIndex program.
It walks all of the Lucene index data structures. I have done
forensics by fiddling with the CheckIndex code.

On Thu, Aug 26, 2010 at 9:11 AM, Shawn Heisey  wrote:
>  On 5/24/2010 6:30 AM, Sascha Szott wrote:
>>
>> Hi folks,
>>
>> is it possible to sort by field length without having to (redundantly)
>> save the length information in a seperate index field? At first, I thought
>> to accomplish this using a function query, but I couldn't find an
>> appropriate one.
>>
>
> I have a slightly different need related to this, though it may turn out
> that what Sascha wants is similar.  I would like to understand my data
> better so I can improve my schema.  I need to do some data mining that is
> (to my knowledge) difficult or impossible with the source database.
>  Performance is irrelevant, as long as it finishes eventually.  Completing
> in less than an hour would be nice.
>
> I would do this on a test system with much lower performance and memory
> (4GB) than my production servers, as a single index instead of multiple
> shards.  When it finishes building, the entire test index is likely to be
> about 75GB.
>
> What I'm after is an output that would look very much like faceting, but I
> want it to show document counts associated with field length (for a simple
> string) and number of terms (for a tokenized field) instead of field value.
>  Can Solr do that, and if so, what do I need to have enabled in the schema
> to get it?  Would branch_3x be enough, or would trunk be better?
>
> Thanks,
> Shawn
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: sort by field length

2010-08-26 Thread Shawn Heisey

 On 5/24/2010 6:30 AM, Sascha Szott wrote:

Hi folks,

is it possible to sort by field length without having to (redundantly) 
save the length information in a seperate index field? At first, I 
thought to accomplish this using a function query, but I couldn't find 
an appropriate one.




I have a slightly different need related to this, though it may turn out 
that what Sascha wants is similar.  I would like to understand my data 
better so I can improve my schema.  I need to do some data mining that 
is (to my knowledge) difficult or impossible with the source database.  
Performance is irrelevant, as long as it finishes eventually.  
Completing in less than an hour would be nice.


I would do this on a test system with much lower performance and memory 
(4GB) than my production servers, as a single index instead of multiple 
shards.  When it finishes building, the entire test index is likely to 
be about 75GB.


What I'm after is an output that would look very much like faceting, but 
I want it to show document counts associated with field length (for a 
simple string) and number of terms (for a tokenized field) instead of 
field value.  Can Solr do that, and if so, what do I need to have 
enabled in the schema to get it?  Would branch_3x be enough, or would 
trunk be better?


Thanks,
Shawn



Re: sort by field length

2010-05-26 Thread Erick Erickson
Take a look at the scoring algorithm on the Wiki, it already takes
this into account, albeit modified by how many times the term
is mentioned in the field. So a field with 5 terms and one match
will score higher than one with 10 terms and one match. Where
it lands with 10 terms and 2 matches I leave as an exercise for
the reader.

I really think you're reinventing the wheel here and looking at the
default scoring mechanism would be a good use of your time.

Best
Erick

On Wed, May 26, 2010 at 4:04 AM, Sascha Szott  wrote:

> Hi Erick,
>
> Erick Erickson wrote:
>
>> Ah, I may have misunderstood, I somehow got it in my mind
>> you were talking about the length of each term (as in string length).
>>
>> But if you're looking at the field length as the count of terms, that's
>> another question, sorry for the confusion...
>>
>> I have to ask, though, why you want to sort this way? The relevance
>> calculations already factor in both term frequency and field length.
>> What's
>> the use-case for sorting by field length given the above?
>>
> It's not a real world use-case -- I just want to get a better understanding
> of the data I'm indexing (therefore, performance is neglectable). In my
> current use case, you can think of the field length as an indicator of data
> quality (i.e., the longer the field content, the worse the quality is).
> Being able to sort the field data in order of decreasing length would allow
> me to investigate "exceptional" data items that are not appropriately
> handled by my curation process.
>
> Best,
> Sascha
>
>
>
>> Best
>> Erick
>>
>> On Tue, May 25, 2010 at 3:40 AM, Sascha Szott  wrote:
>>
>>  Hi Erick,
>>>
>>>
>>> Erick Erickson wrote:
>>>
>>>  Are you sure you want to recompute the length when sorting?
>>>> It's the classic time/space tradeoff, but I'd suggest that when
>>>> your index is big enough to make taking up some more space
>>>> a problem, it's far too big to spend the cycles calculating each
>>>> term length for sorting purposes considering you may be
>>>> sorting all the terms in your index worst-case.
>>>>
>>>>  Good point, thank you for the clarification. I "thought" that Lucene
>>> internally stores the field length (e.g., in order to compute the
>>> relevance)
>>> and getting this information at query time requires only a simple lookup.
>>>
>>> -Sascha
>>>
>>>
>>>
>>>  But you could consider payloads for storing the length, although
>>>> that would still be redundant...
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Mon, May 24, 2010 at 8:30 AM, Sascha Szott   wrote:
>>>>
>>>>  Hi folks,
>>>>
>>>>>
>>>>> is it possible to sort by field length without having to (redundantly)
>>>>> save
>>>>> the length information in a seperate index field? At first, I thought
>>>>> to
>>>>> accomplish this using a function query, but I couldn't find an
>>>>> appropriate
>>>>> one.
>>>>>
>>>>> Thanks in advance,
>>>>> Sascha
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>


Re: sort by field length

2010-05-26 Thread Sascha Szott

Hi Erick,

Erick Erickson wrote:

Ah, I may have misunderstood, I somehow got it in my mind
you were talking about the length of each term (as in string length).

But if you're looking at the field length as the count of terms, that's
another question, sorry for the confusion...

I have to ask, though, why you want to sort this way? The relevance
calculations already factor in both term frequency and field length. What's
the use-case for sorting by field length given the above?
It's not a real world use-case -- I just want to get a better 
understanding of the data I'm indexing (therefore, performance is 
neglectable). In my current use case, you can think of the field length 
as an indicator of data quality (i.e., the longer the field content, the 
worse the quality is). Being able to sort the field data in order of 
decreasing length would allow me to investigate "exceptional" data items 
that are not appropriately handled by my curation process.


Best,
Sascha



Best
Erick

On Tue, May 25, 2010 at 3:40 AM, Sascha Szott  wrote:


Hi Erick,


Erick Erickson wrote:


Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.


Good point, thank you for the clarification. I "thought" that Lucene
internally stores the field length (e.g., in order to compute the relevance)
and getting this information at query time requires only a simple lookup.

-Sascha




But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szott   wrote:

  Hi folks,


is it possible to sort by field length without having to (redundantly)
save
the length information in a seperate index field? At first, I thought to
accomplish this using a function query, but I couldn't find an
appropriate
one.

Thanks in advance,
Sascha











Re: sort by field length

2010-05-25 Thread Erick Erickson
Ah, I may have misunderstood, I somehow got it in my mind
you were talking about the length of each term (as in string length).

But if you're looking at the field length as the count of terms, that's
another question, sorry for the confusion...

I have to ask, though, why you want to sort this way? The relevance
calculations already factor in both term frequency and field length. What's
the use-case for sorting by field length given the above?

Best
Erick

On Tue, May 25, 2010 at 3:40 AM, Sascha Szott  wrote:

> Hi Erick,
>
>
> Erick Erickson wrote:
>
>> Are you sure you want to recompute the length when sorting?
>> It's the classic time/space tradeoff, but I'd suggest that when
>> your index is big enough to make taking up some more space
>> a problem, it's far too big to spend the cycles calculating each
>> term length for sorting purposes considering you may be
>> sorting all the terms in your index worst-case.
>>
> Good point, thank you for the clarification. I "thought" that Lucene
> internally stores the field length (e.g., in order to compute the relevance)
> and getting this information at query time requires only a simple lookup.
>
> -Sascha
>
>
>
>> But you could consider payloads for storing the length, although
>> that would still be redundant...
>>
>> Best
>> Erick
>>
>> On Mon, May 24, 2010 at 8:30 AM, Sascha Szott  wrote:
>>
>>  Hi folks,
>>>
>>> is it possible to sort by field length without having to (redundantly)
>>> save
>>> the length information in a seperate index field? At first, I thought to
>>> accomplish this using a function query, but I couldn't find an
>>> appropriate
>>> one.
>>>
>>> Thanks in advance,
>>> Sascha
>>>
>>>
>>>
>


Re: sort by field length

2010-05-25 Thread Sascha Szott

Hi Erick,

Erick Erickson wrote:

Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.
Good point, thank you for the clarification. I "thought" that Lucene 
internally stores the field length (e.g., in order to compute the 
relevance) and getting this information at query time requires only a 
simple lookup.


-Sascha



But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szott  wrote:


Hi folks,

is it possible to sort by field length without having to (redundantly) save
the length information in a seperate index field? At first, I thought to
accomplish this using a function query, but I couldn't find an appropriate
one.

Thanks in advance,
Sascha






Re: sort by field length

2010-05-24 Thread Erick Erickson
Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.

But you could consider payloads for storing the length, although
that would still be redundant...

Best
Erick

On Mon, May 24, 2010 at 8:30 AM, Sascha Szott  wrote:

> Hi folks,
>
> is it possible to sort by field length without having to (redundantly) save
> the length information in a seperate index field? At first, I thought to
> accomplish this using a function query, but I couldn't find an appropriate
> one.
>
> Thanks in advance,
> Sascha
>
>


sort by field length

2010-05-24 Thread Sascha Szott

Hi folks,

is it possible to sort by field length without having to (redundantly) 
save the length information in a seperate index field? At first, I 
thought to accomplish this using a function query, but I couldn't find 
an appropriate one.


Thanks in advance,
Sascha