Synonym filter with support for phrases?

2009-04-22 Thread Dawid Weiss


Hello everyone,

I'm looking for feedback and thoughts on the following problem (it's more of 
development than user-centered problem, hope the dev list is appropriate):


- a token stream is given,

- a set of synonyms is given, where synonyms are token sequences to be matched 
and token sequences to be added as synonyms.


An example to make things clearer (apologies for lame synonyms). Given a set of 
synonyms like this:


{new, york} - {
{big, apple}},

{restaurant}  - {
{diner},
{food, place},
{full, belly}}
}

a token stream (I try to indicate positional information here):

0 | 1   | 2  | 3  | 4   | 5
a | new | restaurant | in | new | york

would be enriched to an index of (note overlapping tokens on the same 
positions):

0 | 1   | 2  | 3 | 4   | 5
a | new | restaurant | in| new | york
  | | diner  |   | big | apple
  | | food   | place | |
  | | full   | belly | |

The point is for phrase queries to work for synonyms and for the original text 
(of course multi-word synonyms longer than the original phrase would overlap 
with the text, but this shouldn't be much of a worry).


In the current Lucene's trunk there is a synonym filter, but its implementation 
is not really suitable for achieving the above. I wrote a token filter that 
implements the above functionality, but then I thought that synonyms would be 
something frequently dealt with so my questions are:


a) are there any thoughts on how the above could be implemented using existing 
Lucene infrastructure (perhaps I missed something obvious),


b) if (a) is not applicable, would such a token filter constitute a useful 
addition to Lucene?


Dawid


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Synonym filter with support for phrases?

2009-04-22 Thread Earwin Burrfoot
 Hello everyone,

 I'm looking for feedback and thoughts on the following problem (it's more of
 development than user-centered problem, hope the dev list is appropriate):

 - a token stream is given,

 - a set of synonyms is given, where synonyms are token sequences to be
 matched and token sequences to be added as synonyms.

 An example to make things clearer (apologies for lame synonyms). Given a set
 of synonyms like this:

 {new, york} - {
        {big, apple}},

 {restaurant}  - {
        {diner},
        {food, place},
        {full, belly}}
 }

 a token stream (I try to indicate positional information here):

 0 | 1   | 2          | 3  | 4   | 5
 a | new | restaurant | in | new | york

 would be enriched to an index of (note overlapping tokens on the same
 positions):

 0 | 1   | 2          | 3     | 4   | 5
 a | new | restaurant | in    | new | york
  |     | diner      |       | big | apple
  |     | food       | place |     |
  |     | full       | belly |     |

 The point is for phrase queries to work for synonyms and for the original
 text (of course multi-word synonyms longer than the original phrase would
 overlap with the text, but this shouldn't be much of a worry).

 In the current Lucene's trunk there is a synonym filter, but its
 implementation is not really suitable for achieving the above. I wrote a
 token filter that implements the above functionality, but then I thought
 that synonyms would be something frequently dealt with so my questions are:

 a) are there any thoughts on how the above could be implemented using
 existing Lucene infrastructure (perhaps I missed something obvious),

 b) if (a) is not applicable, would such a token filter constitute a useful
 addition to Lucene?
Your synonyms will break if you try searching for phrases.
Building on your example, food place in new york will find nothing,
because 'place' and 'in' share the same position.

I've implemented multiword synonyms on my project, it works, but is
really hairy.
While building the index, I inject synonym group ids instead of actual
words, then I detect synonyms in queries and replace them with group
ids too. Hard part comes after that, you have to adjust
positionIncrements on syngroup id tokens, with respect to the longest
synonym contained in that group, then you have to treat overlapping
synonyms. When query rewrite is finished, I end up with a mixture of
Term/Phrase/MultiPhrase/SpanQueries :)

More correct approach is to index as-is and expand queries with actual
synonym phrases instead of ids, but then queries become really
humongous if you have any decent synonym dictionary (I have 20+ phrase
groups).

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Synonym filter with support for phrases?

2009-04-22 Thread Dawid Weiss



Your synonyms will break if you try searching for phrases.


Good point, I did write that filter, but I never actually got to searching for 
exact phrases in it (there was a very specific scenario and we used prefix 
queries which worked quite well).



Building on your example, food place in new york will find nothing,
because 'place' and 'in' share the same position.


You're right, but is it such a big problem in real life? What you're describing 
is searching for a phrase that spawns both the synonym and the actual token 
sequence. What I thought was: searching for phrases that were either just 
synonyms or synonyms and text with an identical position layout (which is the 
case with single-word synonyms). I dare say this covers majority of cases, 
although I have nothing to support this claim.



While building the index, I inject synonym group ids instead of actual
words, then I detect synonyms in queries and replace them with group
ids too. Hard part comes after that, you have to adjust
positionIncrements on syngroup id tokens, with respect to the longest

 [snip]

Yep, hairy ;)


More correct approach is to index as-is and expand queries with actual
synonym phrases instead of ids, but then queries become really
humongous if you have any decent synonym dictionary (I have 20+ phrase
groups).


Query expansion is not the option for me, unfortunately -- to many synonyms. It 
would be much better to do it once at indexing time and rely on this information 
since.


Thanks for sharing your thoughts, Кирилл.

Dawid

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Synonym filter with support for phrases?

2009-04-22 Thread Earwin Burrfoot
 Building on your example, food place in new york will find nothing,
 because 'place' and 'in' share the same position.
 You're right, but is it such a big problem in real life?

Well, everyone has his own requirements for the search quality. For us
it was a problem.
User enters a query, then refines it by adding new words, then
WHIZBANG! he suddenly sees 'Nothing was found', even though he knows
matching documents exist.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



New TokenStream API usage

2009-04-22 Thread Grant Ingersoll
Has anyone started using the new TokenStream/AttributeSource API?  I'm  
wondering how it is turning out in practice.


-Grant

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Synonym filter with support for phrases?

2009-04-22 Thread Dawid Weiss



Well, everyone has his own requirements for the search quality. For us
it was a problem.


The topic is subjective... I don't see this as a deterioration in search 
quality. Let me explain.


Your example concerns phrase queries, so somebody would have to keep adding 
terms to a phrase. My experience with open search queries (I had access to a 
larger slice of queries from Microsoft Live) is that phrases are a minority of 
all searches. In the most common case, people will look for a union of terms, 
and for these queries the solution I described would work just fine.


Another thing is that my use case for phrase synonyms is that people would 
look for exact synonym phrases, but rarely expand them to cover something 
beyond. Therefore a phrase big apple would find a synonym match (which is what 
I want), but longer phrases such as restaurants in the big apple would not 
(like you said). The big question is, of course, if somebody asking for that 
specific phrase would be interested in finding a document where this phrase does 
not occur in its exact form (but as a synonym).


We deviated off course with this conversation though. I see your point and I 
respect it.


Dawid

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Synonym filter with support for phrases?

2009-04-22 Thread Earwin Burrfoot
 Your example concerns phrase queries, so somebody would have to keep adding
 terms to a phrase. My experience with open search queries (I had access to a
 larger slice of queries from Microsoft Live) is that phrases are a minority
 of all searches. In the most common case, people will look for a union of
 terms, and for these queries the solution I described would work just fine.
We're a bit special. Most of our searches are ordered by date, so we
can't use relevance dependant on query term proximity, or whatever, to
boost good docs up. That has many consequences, and one of them is
that people use phrase queries a lot.

 Another thing is that my use case for phrase synonyms is that people would
 look for exact synonym phrases, but rarely expand them to cover something
 beyond.
We have a lot of synonyms that are more likely alternate forms rather
than synonyms, plus translations, plus abbrevs - using the same
engine. So guys looking for MSU CMC really want to get Московский
Государственный Университет, факультет ВМиК and his friends.

 We deviated off course with this conversation though. I see your point and I 
 respect it.
Hm? I just shared some experience. Will no longer steer away :)

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Synonym filter with support for phrases?

2009-04-22 Thread Dawid Weiss



engine. So guys looking for MSU CMC really want to get Московский
Государственный Университет, факультет ВМиК and his friends.


And? How often do they extend this particular phrase with further terms? It must 
be fun to have an index running concurrently on multi language synonyms, mixing 
the two.



We deviated off course with this conversation though. I see your point and I 
respect it.

Hm? I just shared some experience. Will no longer steer away :)


Oh, don't get me wrong, I appreciate you talking about your experiences -- the 
way you implemented synonyms is certainly interesting. I just didn't want this 
thread to become focused on the discussion what's right and wrong because 
everything depends on the application. I'm wondering what other people did in 
similar situations, that's all.


Dawid

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1608) CustomScoreQuery should support arbitrary Queries

2009-04-22 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen reassigned LUCENE-1608:
---

Assignee: Doron Cohen

 CustomScoreQuery should support arbitrary Queries
 -

 Key: LUCENE-1608
 URL: https://issues.apache.org/jira/browse/LUCENE-1608
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Query/Scoring
Reporter: Steven Bethard
Assignee: Doron Cohen
Priority: Minor

 CustomScoreQuery only allows the secondary queries to be of type 
 ValueSourceQuery instead of allowing them to be any type of Query. As a 
 result, what you can do with CustomScoreQuery is pretty limited.
 It would be nice to extend CustomScoreQuery to allow arbitrary Query objects. 
 Most of the code should stay about the same, though a little more care would 
 need to be taken in CustomScorer.score() to use 0.0 when the sub-scorer does 
 not produce a score for the current document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Create an index from known terms and frequencies

2009-04-22 Thread Johnny B21

Hi!
I want to create an index with lucene but i want to do it without having to
analyze the text
since i already have the terms and term frequencies.
How can i create an index like that?
I am searching the source of lucene but i can't find where the terms and
term frequencies are stored.
Please help me!
Thanks a lot,
John Boutsis

-- 
View this message in context: 
http://www.nabble.com/Create-an-index-from-known-terms-and-frequencies-tp23175684p23175684.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-04-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701626#action_12701626
 ] 

Earwin Burrfoot commented on LUCENE-1607:
-

I tried it out. Works a little bit better than simple cache (no stray interns 
must've paid off), doesn't degrade at all.
I'd like to change starter value to something 256-1024, it works way better for 
10-20 fields.

Why h  7? I understand that you're sacking collision-guilty bits, but why not 
exact amount that was used (have to store it?), or a whole byte or two?

 String.intern() faster alternative
 --

 Key: LUCENE-1607
 URL: https://issues.apache.org/jira/browse/LUCENE-1607
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot
 Fix For: 2.9

 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch


 By using our own interned string pool on top of default, String.intern() can 
 be greatly optimized.
 On my setup (java 6) this alternative runs ~15.8x faster for already interned 
 strings, and ~2.2x faster for 'new String(interned)'
 For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Question around LOM | Lucene Ontology

2009-04-22 Thread Rangan Gupta
Hi

Am a newbie to Lucene and hence this question about how to implement Ontology 
based search using Lucene (LOM).
It would be useful to guide to any useful books, white papers etc. detailing 
out the same.

Thanks
R



[jira] Commented: (LUCENE-1608) CustomScoreQuery should support arbitrary Queries

2009-04-22 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701683#action_12701683
 ] 

Doron Cohen commented on LUCENE-1608:
-

I thought I had written a class exactly for this purpose but I was wrong - my 
class was different in that it had an actual value source, just that it was 
sparse - values for quite many docs were missing. It is similar in a way, but 
different since here the input is a query.

But I did promise... so I wrote a quick wrapper for a query to create a value 
source.
That value source can be used to create a value source query.

Although the patch coming soon is tested and all, I am not considering to 
commit this patch, because it is not clean. 

I would like to reorganize this package to take better care of this request and 
other related issues (like LUCENE-850) and to make it worth for Solr to move to 
use this package. (last time I checked it wasn't). But this is a different 
issue...

 CustomScoreQuery should support arbitrary Queries
 -

 Key: LUCENE-1608
 URL: https://issues.apache.org/jira/browse/LUCENE-1608
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Query/Scoring
Reporter: Steven Bethard
Assignee: Doron Cohen
Priority: Minor

 CustomScoreQuery only allows the secondary queries to be of type 
 ValueSourceQuery instead of allowing them to be any type of Query. As a 
 result, what you can do with CustomScoreQuery is pretty limited.
 It would be nice to extend CustomScoreQuery to allow arbitrary Query objects. 
 Most of the code should stay about the same, though a little more care would 
 need to be taken in CustomScorer.score() to use 0.0 when the sub-scorer does 
 not produce a score for the current document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1608) CustomScoreQuery should support arbitrary Queries

2009-04-22 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-1608:


Attachment: LUCENE-1608.patch

Patch for passing arbitrary queries to custom-score-query.
Not intended for committing.
See TestQueryWrapperValueSource for usage of this wrapper.
- Doron

 CustomScoreQuery should support arbitrary Queries
 -

 Key: LUCENE-1608
 URL: https://issues.apache.org/jira/browse/LUCENE-1608
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Query/Scoring
Reporter: Steven Bethard
Assignee: Doron Cohen
Priority: Minor
 Attachments: LUCENE-1608.patch


 CustomScoreQuery only allows the secondary queries to be of type 
 ValueSourceQuery instead of allowing them to be any type of Query. As a 
 result, what you can do with CustomScoreQuery is pretty limited.
 It would be nice to extend CustomScoreQuery to allow arbitrary Query objects. 
 Most of the code should stay about the same, though a little more care would 
 need to be taken in CustomScorer.score() to use 0.0 when the sub-scorer does 
 not produce a score for the current document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Spatial package plans

2009-04-22 Thread Wouter Heijke
The amount of replies and the state of the code make me think making my
own distance filter using a real GIS solution like geotools is the way to
go.
I wonder anyway if GIS code should be in any Lucene package..

Wouter

 Yeah it's hard coded to use miles, 5 years in the US gets to you..
 But the functionality doesn't change radius is double so you just need to
 convert km to miles
 for the DistanceQueryBuilder and just convert back from miles to km to
 display.

 On Mon, Apr 20, 2009 at 8:14 AM, Wouter Heijke whei...@xs4all.nl wrote:


 I'm working on local search functionality and am about to use the
 spatial
 code in contrib.
 I managed to have a proof of concept running using
 LatLongDistanceFilter.
 The only problem I have with this filter is that it is hardcoded to use
 Miles!

 Basically my question is what are the plans for the spatial code? Is it
 going to stay the way it is?

 Wouter




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Spatial package plans

2009-04-22 Thread patrick o'leary
Free world, help yourself :-)

On Wed, Apr 22, 2009 at 6:39 PM, Wouter Heijke whei...@xs4all.nl wrote:

 The amount of replies and the state of the code make me think making my
 own distance filter using a real GIS solution like geotools is the way to
 go.
 I wonder anyway if GIS code should be in any Lucene package..

 Wouter

  Yeah it's hard coded to use miles, 5 years in the US gets to you..
  But the functionality doesn't change radius is double so you just need to
  convert km to miles
  for the DistanceQueryBuilder and just convert back from miles to km to
  display.
 
  On Mon, Apr 20, 2009 at 8:14 AM, Wouter Heijke whei...@xs4all.nl
 wrote:
 
 
  I'm working on local search functionality and am about to use the
  spatial
  code in contrib.
  I managed to have a proof of concept running using
  LatLongDistanceFilter.
  The only problem I have with this filter is that it is hardcoded to use
  Miles!
 
  Basically my question is what are the plans for the spatial code? Is it
  going to stay the way it is?
 
  Wouter
 



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present

2009-04-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701726#action_12701726
 ] 

Jason Rutherglen commented on LUCENE-1252:
--

When flexible indexing goes in, users will be able to put data
into the index that allow scorers to calculate a cheap score,
collect, then go through and calculate a presumably more
expensive score. 

Would it be good to implement this patch with this sort of more
general framework in mind? 

It seems like this could affect the HitCollector API as we'd
want a more generic way of representing scores than the
primitive float we assume now. Aren't we rewriting the
HitCollector APIs right now? Can we implement this change now?

 Avoid using positions when not all required terms are present
 -

 Key: LUCENE-1252
 URL: https://issues.apache.org/jira/browse/LUCENE-1252
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Paul Elschot
Priority: Minor

 In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, 
 currently next() and skipTo() will use position information even when other 
 parts of the query cannot match because some required terms are not present.
 This could be avoided by adding some methods to Scorer that relax the 
 postcondition of next() and skipTo() to something like all required terms 
 are present, but no position info was checked yet, and implementing these 
 methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and 
 SpanScorer/NearSpans.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Future projects

2009-04-22 Thread Jason Rutherglen
Hey Michael,

You're in San Jose?  Feel free to come by one of these days on our pizza
days.

Also, can you post what you have of LUCENE-1231?  I got a lot more familiar
with IndexWriter internals with LUCENE-1516 and could to a good whack at
getting LUCENE-1231 integrated.

Cheers!

Jason

On Sun, Apr 12, 2009 at 3:28 PM, Michael Busch busch...@gmail.com wrote:

  On 4/4/09 4:42 AM, Michael McCandless wrote:

  As I recently mentioned on 1231 I'm looking into changing the Document and
 Field APIs. I've some rough prototype. I think we should also try to get it
 in before 2.9? On the other hand I don't want to block the 2.9 release with
 too much stuff.


  That'd be great -- I'd say post the rough prototype and let's iterate?




 OK. I'll attach it as a new Jira issue. It's not really integrated into
 anything (like DocumentsWriter), but I wrote some demo classes to show how I
 intend to use it.

 -Michael



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701751#action_12701751
 ] 

Jason Rutherglen commented on LUCENE-831:
-

I'm trying to figure out how to integrate Bobo faceting field
caches with this patch, I applied the patch, browsed the
ValueSource API and yeah, it's not what I expected. we can
return arrays, objects, or anything and your grandmother not
Grandma! But yeah we need to somehow support probably plain Java
objects rather than every primitive derivative? 

(In reference to Mark's post 2nd to last post) Bobo efficiently
nicely calculates facets for multiple values per doc which is
the same thing as multi value faceting? 

 by back compat with deletes, norms though.

Are norms and deletes implemented? These would just be byte
arrays in the current approach? If not how would they be
represented? It seems like for deleted docs we'd want the
BitVector returned from a ValueSource.get type of method?

M.M.: Updatability is tricky... ValueSource would maybe need a
startChanges() API, which would copy the array (copy-on-write)
if it's not already private

Hmm... Does this mean we'd replace the current IndexReader
method of performing updates on norms and deletes with this more
generic update mechanism?

It would be cool to get CSF going?

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Spatial package plans

2009-04-22 Thread Ryan McKinley
Patrick's original version of localluce included geotools -- to make  
it Apache license compatible we took that out and make the distance  
calculations pluggable.


The hardcoded miles part should be changeable -- feel free to post any  
patches and we can make it a better solution.


best
ryan


On Apr 22, 2009, at 6:39 PM, Wouter Heijke wrote:

The amount of replies and the state of the code make me think making  
my
own distance filter using a real GIS solution like geotools is the  
way to

go.
I wonder anyway if GIS code should be in any Lucene package..

Wouter


Yeah it's hard coded to use miles, 5 years in the US gets to you..
But the functionality doesn't change radius is double so you just  
need to

convert km to miles
for the DistanceQueryBuilder and just convert back from miles to km  
to

display.

On Mon, Apr 20, 2009 at 8:14 AM, Wouter Heijke whei...@xs4all.nl  
wrote:




I'm working on local search functionality and am about to use the
spatial
code in contrib.
I managed to have a proof of concept running using
LatLongDistanceFilter.
The only problem I have with this filter is that it is hardcoded  
to use

Miles!

Basically my question is what are the plans for the spatial code?  
Is it

going to stay the way it is?

Wouter





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701768#action_12701768
 ] 

Jason Rutherglen commented on LUCENE-1539:
--

{quote}
I think it should mean delete XXX% of the remaining
undeleted docs?
{quote}

Yeah? Ok. So the deleteDocsByPercent method needs to somehow
take into account whether it's deleted before by adjusting the
doc nums it's deleting?

{quote}
I don't think we can relax that. This (single transaction
(writer) open at once) is a core assumption in Lucene.
{quote}

True, however doesn't mean we have to stick with it, especially
internally. Hopefully we can move to a more componentized model
someone could change this if they wanted. Perhaps in the
flexible indexing revamp?





 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
 LUCENE-1539.patch, sortBench2.py, sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Greetings and questions about patches

2009-04-22 Thread Erick Erickson
Hi all:

I've been participating in the user list for some time, and I'd like
to start helping maintain/enhance the code. So I thought I'd start
with something small, mostly to get the process down. Unit tests
sure fit the bill it seems to me, less chance of introducing errors
through ignorance but a fine way to extend *my* understanding
of Lucene.

I managed to check out the code and run the unit tests, which
was amazingly easy. I even managed to get the project into
IntelliJ and connect the codestyle.xml file. Kudos for whoever
set up the checkout/build process, I was dreading spending
days setting this up, fortunately I didn't have to.

So I, with Chris's help, found the code coverage report and
chose something pretty straightforward to test, BitUtil since it
was nice and self-contained. As I said, I'm looking at understanding
the process rather than adding much value the first time.

Alas, even something as simple as BitUtil generates questions
that I'm asking mostly to understand what approach the veterans
prefer. I'll argue with y'all next year sometime G.

So, according to the coverage report, there are two methods that
are never executed by the unit tests (actually 4, 2 that operate on
ints and 2 that operate on longs), isPowerOfTwo and
nextHighestPowerOfTwo. nextHighestPowerOfTwo is especially
clever, had to get out a paper and pencil to really understand it.

Issues:
1 none of these methods is ever called. I commented them out
 and ran all the unit tests and all is well. Additionally, commenting
 out one of the other methods produces compile-time errors so I'm
fairly sure I didn't do something completely stupid that just *looked*
like it was OK. I grepped recursively and they're nowhere in the
*.java files.
  1a What's the consensus about unused code? Take it out (my
 preference) along with leaving a comment on where it can
 be found (since it *is* clever code)? Leave it in because someone
 found some pretty neat algorithms that we may need sometime?
  1b I'm not entirely sure about the contrib area, but the contrib jars
 are all new so I assume ant clean test compiles them as well.

2 I don't agree with the behavior of nextHighestPowerOfTwo. Should
 I make changes if we decide to keep it?
  2a Why should it return the parameter passed in when it happens to be
a perfect power of two? e.g. this passes:
   assertEquals(BitUtil.nextHighestPowerOfTwo(128L), 128);
   I'd expect this to actually return 256, given the name.
2b Why should it ever return 0? There's no power of two that is
   zero. e.g. this passes:
   assertEquals(BitUtil.nextHighestPowerOfTwo(-1), 0);
   as does this: assertEquals(BitUtil.nextHighestPowerOfTwo(0), 0).
   *Assuming* that someone wants to use this sometime to, say, size
an array they'd have to test against a return of 0.


I'm fully aware that these are trivial issues in the grand scheme of things,

and I *really* don't want to waste much time hashing them over. I'll provide

a patch either way and go on to something slightly more complicated for
my next trick.

Best
Erick


Re: Greetings and questions about patches

2009-04-22 Thread Chris Miller

Issues:
1 none of these methods is ever called.


Note that Yonik's suggested patch for LUCENE-1607 contains the following 
code:


+  public SimpleStringInterner(int sz) {
+cache = new String[BitUtil.nextHighestPowerOfTwo(sz)];
+  }

...so the int flavour of nextHighestPowerOfTwo() might be in use shortly! :-)





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org