Synonym filter with support for phrases?
Hello everyone, I'm looking for feedback and thoughts on the following problem (it's more of development than user-centered problem, hope the dev list is appropriate): - a token stream is given, - a set of synonyms is given, where synonyms are token sequences to be matched and token sequences to be added as synonyms. An example to make things clearer (apologies for lame synonyms). Given a set of synonyms like this: {new, york} - { {big, apple}}, {restaurant} - { {diner}, {food, place}, {full, belly}} } a token stream (I try to indicate positional information here): 0 | 1 | 2 | 3 | 4 | 5 a | new | restaurant | in | new | york would be enriched to an index of (note overlapping tokens on the same positions): 0 | 1 | 2 | 3 | 4 | 5 a | new | restaurant | in| new | york | | diner | | big | apple | | food | place | | | | full | belly | | The point is for phrase queries to work for synonyms and for the original text (of course multi-word synonyms longer than the original phrase would overlap with the text, but this shouldn't be much of a worry). In the current Lucene's trunk there is a synonym filter, but its implementation is not really suitable for achieving the above. I wrote a token filter that implements the above functionality, but then I thought that synonyms would be something frequently dealt with so my questions are: a) are there any thoughts on how the above could be implemented using existing Lucene infrastructure (perhaps I missed something obvious), b) if (a) is not applicable, would such a token filter constitute a useful addition to Lucene? Dawid - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Synonym filter with support for phrases?
Hello everyone, I'm looking for feedback and thoughts on the following problem (it's more of development than user-centered problem, hope the dev list is appropriate): - a token stream is given, - a set of synonyms is given, where synonyms are token sequences to be matched and token sequences to be added as synonyms. An example to make things clearer (apologies for lame synonyms). Given a set of synonyms like this: {new, york} - { {big, apple}}, {restaurant} - { {diner}, {food, place}, {full, belly}} } a token stream (I try to indicate positional information here): 0 | 1 | 2 | 3 | 4 | 5 a | new | restaurant | in | new | york would be enriched to an index of (note overlapping tokens on the same positions): 0 | 1 | 2 | 3 | 4 | 5 a | new | restaurant | in | new | york | | diner | | big | apple | | food | place | | | | full | belly | | The point is for phrase queries to work for synonyms and for the original text (of course multi-word synonyms longer than the original phrase would overlap with the text, but this shouldn't be much of a worry). In the current Lucene's trunk there is a synonym filter, but its implementation is not really suitable for achieving the above. I wrote a token filter that implements the above functionality, but then I thought that synonyms would be something frequently dealt with so my questions are: a) are there any thoughts on how the above could be implemented using existing Lucene infrastructure (perhaps I missed something obvious), b) if (a) is not applicable, would such a token filter constitute a useful addition to Lucene? Your synonyms will break if you try searching for phrases. Building on your example, food place in new york will find nothing, because 'place' and 'in' share the same position. I've implemented multiword synonyms on my project, it works, but is really hairy. While building the index, I inject synonym group ids instead of actual words, then I detect synonyms in queries and replace them with group ids too. Hard part comes after that, you have to adjust positionIncrements on syngroup id tokens, with respect to the longest synonym contained in that group, then you have to treat overlapping synonyms. When query rewrite is finished, I end up with a mixture of Term/Phrase/MultiPhrase/SpanQueries :) More correct approach is to index as-is and expand queries with actual synonym phrases instead of ids, but then queries become really humongous if you have any decent synonym dictionary (I have 20+ phrase groups). -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Synonym filter with support for phrases?
Your synonyms will break if you try searching for phrases. Good point, I did write that filter, but I never actually got to searching for exact phrases in it (there was a very specific scenario and we used prefix queries which worked quite well). Building on your example, food place in new york will find nothing, because 'place' and 'in' share the same position. You're right, but is it such a big problem in real life? What you're describing is searching for a phrase that spawns both the synonym and the actual token sequence. What I thought was: searching for phrases that were either just synonyms or synonyms and text with an identical position layout (which is the case with single-word synonyms). I dare say this covers majority of cases, although I have nothing to support this claim. While building the index, I inject synonym group ids instead of actual words, then I detect synonyms in queries and replace them with group ids too. Hard part comes after that, you have to adjust positionIncrements on syngroup id tokens, with respect to the longest [snip] Yep, hairy ;) More correct approach is to index as-is and expand queries with actual synonym phrases instead of ids, but then queries become really humongous if you have any decent synonym dictionary (I have 20+ phrase groups). Query expansion is not the option for me, unfortunately -- to many synonyms. It would be much better to do it once at indexing time and rely on this information since. Thanks for sharing your thoughts, Кирилл. Dawid - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Synonym filter with support for phrases?
Building on your example, food place in new york will find nothing, because 'place' and 'in' share the same position. You're right, but is it such a big problem in real life? Well, everyone has his own requirements for the search quality. For us it was a problem. User enters a query, then refines it by adding new words, then WHIZBANG! he suddenly sees 'Nothing was found', even though he knows matching documents exist. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
New TokenStream API usage
Has anyone started using the new TokenStream/AttributeSource API? I'm wondering how it is turning out in practice. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Synonym filter with support for phrases?
Well, everyone has his own requirements for the search quality. For us it was a problem. The topic is subjective... I don't see this as a deterioration in search quality. Let me explain. Your example concerns phrase queries, so somebody would have to keep adding terms to a phrase. My experience with open search queries (I had access to a larger slice of queries from Microsoft Live) is that phrases are a minority of all searches. In the most common case, people will look for a union of terms, and for these queries the solution I described would work just fine. Another thing is that my use case for phrase synonyms is that people would look for exact synonym phrases, but rarely expand them to cover something beyond. Therefore a phrase big apple would find a synonym match (which is what I want), but longer phrases such as restaurants in the big apple would not (like you said). The big question is, of course, if somebody asking for that specific phrase would be interested in finding a document where this phrase does not occur in its exact form (but as a synonym). We deviated off course with this conversation though. I see your point and I respect it. Dawid - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Synonym filter with support for phrases?
Your example concerns phrase queries, so somebody would have to keep adding terms to a phrase. My experience with open search queries (I had access to a larger slice of queries from Microsoft Live) is that phrases are a minority of all searches. In the most common case, people will look for a union of terms, and for these queries the solution I described would work just fine. We're a bit special. Most of our searches are ordered by date, so we can't use relevance dependant on query term proximity, or whatever, to boost good docs up. That has many consequences, and one of them is that people use phrase queries a lot. Another thing is that my use case for phrase synonyms is that people would look for exact synonym phrases, but rarely expand them to cover something beyond. We have a lot of synonyms that are more likely alternate forms rather than synonyms, plus translations, plus abbrevs - using the same engine. So guys looking for MSU CMC really want to get Московский Государственный Университет, факультет ВМиК and his friends. We deviated off course with this conversation though. I see your point and I respect it. Hm? I just shared some experience. Will no longer steer away :) -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Synonym filter with support for phrases?
engine. So guys looking for MSU CMC really want to get Московский Государственный Университет, факультет ВМиК and his friends. And? How often do they extend this particular phrase with further terms? It must be fun to have an index running concurrently on multi language synonyms, mixing the two. We deviated off course with this conversation though. I see your point and I respect it. Hm? I just shared some experience. Will no longer steer away :) Oh, don't get me wrong, I appreciate you talking about your experiences -- the way you implemented synonyms is certainly interesting. I just didn't want this thread to become focused on the discussion what's right and wrong because everything depends on the application. I'm wondering what other people did in similar situations, that's all. Dawid - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1608) CustomScoreQuery should support arbitrary Queries
[ https://issues.apache.org/jira/browse/LUCENE-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen reassigned LUCENE-1608: --- Assignee: Doron Cohen CustomScoreQuery should support arbitrary Queries - Key: LUCENE-1608 URL: https://issues.apache.org/jira/browse/LUCENE-1608 Project: Lucene - Java Issue Type: New Feature Components: Query/Scoring Reporter: Steven Bethard Assignee: Doron Cohen Priority: Minor CustomScoreQuery only allows the secondary queries to be of type ValueSourceQuery instead of allowing them to be any type of Query. As a result, what you can do with CustomScoreQuery is pretty limited. It would be nice to extend CustomScoreQuery to allow arbitrary Query objects. Most of the code should stay about the same, though a little more care would need to be taken in CustomScorer.score() to use 0.0 when the sub-scorer does not produce a score for the current document. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Create an index from known terms and frequencies
Hi! I want to create an index with lucene but i want to do it without having to analyze the text since i already have the terms and term frequencies. How can i create an index like that? I am searching the source of lucene but i can't find where the terms and term frequencies are stored. Please help me! Thanks a lot, John Boutsis -- View this message in context: http://www.nabble.com/Create-an-index-from-known-terms-and-frequencies-tp23175684p23175684.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1607) String.intern() faster alternative
[ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701626#action_12701626 ] Earwin Burrfoot commented on LUCENE-1607: - I tried it out. Works a little bit better than simple cache (no stray interns must've paid off), doesn't degrade at all. I'd like to change starter value to something 256-1024, it works way better for 10-20 fields. Why h 7? I understand that you're sacking collision-guilty bits, but why not exact amount that was used (have to store it?), or a whole byte or two? String.intern() faster alternative -- Key: LUCENE-1607 URL: https://issues.apache.org/jira/browse/LUCENE-1607 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot Fix For: 2.9 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch By using our own interned string pool on top of default, String.intern() can be greatly optimized. On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)' For java 5 and 4 speedup is lower, but still considerable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Question around LOM | Lucene Ontology
Hi Am a newbie to Lucene and hence this question about how to implement Ontology based search using Lucene (LOM). It would be useful to guide to any useful books, white papers etc. detailing out the same. Thanks R
[jira] Commented: (LUCENE-1608) CustomScoreQuery should support arbitrary Queries
[ https://issues.apache.org/jira/browse/LUCENE-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701683#action_12701683 ] Doron Cohen commented on LUCENE-1608: - I thought I had written a class exactly for this purpose but I was wrong - my class was different in that it had an actual value source, just that it was sparse - values for quite many docs were missing. It is similar in a way, but different since here the input is a query. But I did promise... so I wrote a quick wrapper for a query to create a value source. That value source can be used to create a value source query. Although the patch coming soon is tested and all, I am not considering to commit this patch, because it is not clean. I would like to reorganize this package to take better care of this request and other related issues (like LUCENE-850) and to make it worth for Solr to move to use this package. (last time I checked it wasn't). But this is a different issue... CustomScoreQuery should support arbitrary Queries - Key: LUCENE-1608 URL: https://issues.apache.org/jira/browse/LUCENE-1608 Project: Lucene - Java Issue Type: New Feature Components: Query/Scoring Reporter: Steven Bethard Assignee: Doron Cohen Priority: Minor CustomScoreQuery only allows the secondary queries to be of type ValueSourceQuery instead of allowing them to be any type of Query. As a result, what you can do with CustomScoreQuery is pretty limited. It would be nice to extend CustomScoreQuery to allow arbitrary Query objects. Most of the code should stay about the same, though a little more care would need to be taken in CustomScorer.score() to use 0.0 when the sub-scorer does not produce a score for the current document. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1608) CustomScoreQuery should support arbitrary Queries
[ https://issues.apache.org/jira/browse/LUCENE-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1608: Attachment: LUCENE-1608.patch Patch for passing arbitrary queries to custom-score-query. Not intended for committing. See TestQueryWrapperValueSource for usage of this wrapper. - Doron CustomScoreQuery should support arbitrary Queries - Key: LUCENE-1608 URL: https://issues.apache.org/jira/browse/LUCENE-1608 Project: Lucene - Java Issue Type: New Feature Components: Query/Scoring Reporter: Steven Bethard Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-1608.patch CustomScoreQuery only allows the secondary queries to be of type ValueSourceQuery instead of allowing them to be any type of Query. As a result, what you can do with CustomScoreQuery is pretty limited. It would be nice to extend CustomScoreQuery to allow arbitrary Query objects. Most of the code should stay about the same, though a little more care would need to be taken in CustomScorer.score() to use 0.0 when the sub-scorer does not produce a score for the current document. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Spatial package plans
The amount of replies and the state of the code make me think making my own distance filter using a real GIS solution like geotools is the way to go. I wonder anyway if GIS code should be in any Lucene package.. Wouter Yeah it's hard coded to use miles, 5 years in the US gets to you.. But the functionality doesn't change radius is double so you just need to convert km to miles for the DistanceQueryBuilder and just convert back from miles to km to display. On Mon, Apr 20, 2009 at 8:14 AM, Wouter Heijke whei...@xs4all.nl wrote: I'm working on local search functionality and am about to use the spatial code in contrib. I managed to have a proof of concept running using LatLongDistanceFilter. The only problem I have with this filter is that it is hardcoded to use Miles! Basically my question is what are the plans for the spatial code? Is it going to stay the way it is? Wouter - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Spatial package plans
Free world, help yourself :-) On Wed, Apr 22, 2009 at 6:39 PM, Wouter Heijke whei...@xs4all.nl wrote: The amount of replies and the state of the code make me think making my own distance filter using a real GIS solution like geotools is the way to go. I wonder anyway if GIS code should be in any Lucene package.. Wouter Yeah it's hard coded to use miles, 5 years in the US gets to you.. But the functionality doesn't change radius is double so you just need to convert km to miles for the DistanceQueryBuilder and just convert back from miles to km to display. On Mon, Apr 20, 2009 at 8:14 AM, Wouter Heijke whei...@xs4all.nl wrote: I'm working on local search functionality and am about to use the spatial code in contrib. I managed to have a proof of concept running using LatLongDistanceFilter. The only problem I have with this filter is that it is hardcoded to use Miles! Basically my question is what are the plans for the spatial code? Is it going to stay the way it is? Wouter - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701726#action_12701726 ] Jason Rutherglen commented on LUCENE-1252: -- When flexible indexing goes in, users will be able to put data into the index that allow scorers to calculate a cheap score, collect, then go through and calculate a presumably more expensive score. Would it be good to implement this patch with this sort of more general framework in mind? It seems like this could affect the HitCollector API as we'd want a more generic way of representing scores than the primitive float we assume now. Aren't we rewriting the HitCollector APIs right now? Can we implement this change now? Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
Hey Michael, You're in San Jose? Feel free to come by one of these days on our pizza days. Also, can you post what you have of LUCENE-1231? I got a lot more familiar with IndexWriter internals with LUCENE-1516 and could to a good whack at getting LUCENE-1231 integrated. Cheers! Jason On Sun, Apr 12, 2009 at 3:28 PM, Michael Busch busch...@gmail.com wrote: On 4/4/09 4:42 AM, Michael McCandless wrote: As I recently mentioned on 1231 I'm looking into changing the Document and Field APIs. I've some rough prototype. I think we should also try to get it in before 2.9? On the other hand I don't want to block the 2.9 release with too much stuff. That'd be great -- I'd say post the rough prototype and let's iterate? OK. I'll attach it as a new Jira issue. It's not really integrated into anything (like DocumentsWriter), but I wrote some demo classes to show how I intend to use it. -Michael
[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701751#action_12701751 ] Jason Rutherglen commented on LUCENE-831: - I'm trying to figure out how to integrate Bobo faceting field caches with this patch, I applied the patch, browsed the ValueSource API and yeah, it's not what I expected. we can return arrays, objects, or anything and your grandmother not Grandma! But yeah we need to somehow support probably plain Java objects rather than every primitive derivative? (In reference to Mark's post 2nd to last post) Bobo efficiently nicely calculates facets for multiple values per doc which is the same thing as multi value faceting? by back compat with deletes, norms though. Are norms and deletes implemented? These would just be byte arrays in the current approach? If not how would they be represented? It seems like for deleted docs we'd want the BitVector returned from a ValueSource.get type of method? M.M.: Updatability is tricky... ValueSource would maybe need a startChanges() API, which would copy the array (copy-on-write) if it's not already private Hmm... Does this mean we'd replace the current IndexReader method of performing updates on norms and deletes with this more generic update mechanism? It would be cool to get CSF going? Complete overhaul of FieldCache API/Implementation -- Key: LUCENE-831 URL: https://issues.apache.org/jira/browse/LUCENE-831 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Hoss Man Assignee: Mark Miller Fix For: 3.0 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff, fieldcache-overhaul.diff, LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch Motivation: 1) Complete overhaul the API/implementation of FieldCache type things... a) eliminate global static map keyed on IndexReader (thus eliminating synch block between completley independent IndexReaders) b) allow more customization of cache management (ie: use expiration/replacement strategies, disk backed caches, etc) c) allow people to define custom cache data logic (ie: custom parsers, complex datatypes, etc... anything tied to a reader) d) allow people to inspect what's in a cache (list of CacheKeys) for an IndexReader so a new IndexReader can be likewise warmed. e) Lend support for smarter cache management if/when IndexReader.reopen is added (merging of cached data from subReaders). 2) Provide backwards compatibility to support existing FieldCache API with the new implementation, so there is no redundent caching as client code migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Spatial package plans
Patrick's original version of localluce included geotools -- to make it Apache license compatible we took that out and make the distance calculations pluggable. The hardcoded miles part should be changeable -- feel free to post any patches and we can make it a better solution. best ryan On Apr 22, 2009, at 6:39 PM, Wouter Heijke wrote: The amount of replies and the state of the code make me think making my own distance filter using a real GIS solution like geotools is the way to go. I wonder anyway if GIS code should be in any Lucene package.. Wouter Yeah it's hard coded to use miles, 5 years in the US gets to you.. But the functionality doesn't change radius is double so you just need to convert km to miles for the DistanceQueryBuilder and just convert back from miles to km to display. On Mon, Apr 20, 2009 at 8:14 AM, Wouter Heijke whei...@xs4all.nl wrote: I'm working on local search functionality and am about to use the spatial code in contrib. I managed to have a proof of concept running using LatLongDistanceFilter. The only problem I have with this filter is that it is hardcoded to use Miles! Basically my question is what are the plans for the spatial code? Is it going to stay the way it is? Wouter - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701768#action_12701768 ] Jason Rutherglen commented on LUCENE-1539: -- {quote} I think it should mean delete XXX% of the remaining undeleted docs? {quote} Yeah? Ok. So the deleteDocsByPercent method needs to somehow take into account whether it's deleted before by adjusting the doc nums it's deleting? {quote} I don't think we can relax that. This (single transaction (writer) open at once) is a core assumption in Lucene. {quote} True, however doesn't mean we have to stick with it, especially internally. Hopefully we can move to a more componentized model someone could change this if they wanted. Perhaps in the flexible indexing revamp? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Greetings and questions about patches
Hi all: I've been participating in the user list for some time, and I'd like to start helping maintain/enhance the code. So I thought I'd start with something small, mostly to get the process down. Unit tests sure fit the bill it seems to me, less chance of introducing errors through ignorance but a fine way to extend *my* understanding of Lucene. I managed to check out the code and run the unit tests, which was amazingly easy. I even managed to get the project into IntelliJ and connect the codestyle.xml file. Kudos for whoever set up the checkout/build process, I was dreading spending days setting this up, fortunately I didn't have to. So I, with Chris's help, found the code coverage report and chose something pretty straightforward to test, BitUtil since it was nice and self-contained. As I said, I'm looking at understanding the process rather than adding much value the first time. Alas, even something as simple as BitUtil generates questions that I'm asking mostly to understand what approach the veterans prefer. I'll argue with y'all next year sometime G. So, according to the coverage report, there are two methods that are never executed by the unit tests (actually 4, 2 that operate on ints and 2 that operate on longs), isPowerOfTwo and nextHighestPowerOfTwo. nextHighestPowerOfTwo is especially clever, had to get out a paper and pencil to really understand it. Issues: 1 none of these methods is ever called. I commented them out and ran all the unit tests and all is well. Additionally, commenting out one of the other methods produces compile-time errors so I'm fairly sure I didn't do something completely stupid that just *looked* like it was OK. I grepped recursively and they're nowhere in the *.java files. 1a What's the consensus about unused code? Take it out (my preference) along with leaving a comment on where it can be found (since it *is* clever code)? Leave it in because someone found some pretty neat algorithms that we may need sometime? 1b I'm not entirely sure about the contrib area, but the contrib jars are all new so I assume ant clean test compiles them as well. 2 I don't agree with the behavior of nextHighestPowerOfTwo. Should I make changes if we decide to keep it? 2a Why should it return the parameter passed in when it happens to be a perfect power of two? e.g. this passes: assertEquals(BitUtil.nextHighestPowerOfTwo(128L), 128); I'd expect this to actually return 256, given the name. 2b Why should it ever return 0? There's no power of two that is zero. e.g. this passes: assertEquals(BitUtil.nextHighestPowerOfTwo(-1), 0); as does this: assertEquals(BitUtil.nextHighestPowerOfTwo(0), 0). *Assuming* that someone wants to use this sometime to, say, size an array they'd have to test against a return of 0. I'm fully aware that these are trivial issues in the grand scheme of things, and I *really* don't want to waste much time hashing them over. I'll provide a patch either way and go on to something slightly more complicated for my next trick. Best Erick
Re: Greetings and questions about patches
Issues: 1 none of these methods is ever called. Note that Yonik's suggested patch for LUCENE-1607 contains the following code: + public SimpleStringInterner(int sz) { +cache = new String[BitUtil.nextHighestPowerOfTwo(sz)]; + } ...so the int flavour of nextHighestPowerOfTwo() might be in use shortly! :-) - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org