RE: Custom tokenizer
Hi, Extending an existing Analyzer is not useful, because it is just a factory that returns a TokenStream instance to consumers. If you want to change the Tokenizer of an existing Analyzer, just clone it and rewrite its createComponents() method, see the example in the Javadocs: http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/Analyzer.html If you want to add additional TokenFilters to the chain, you can do this with AnalyzerWrapper (http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/AnalyzerWrapper.html), but this does not work with Tokenizers, because those are instantiated before the TokenFilters which depend on them, so changing the Tokenizer afterwards is impossible. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Vihari Piratla [mailto:viharipira...@gmail.com] > Sent: Monday, January 12, 2015 8:51 AM > To: java-user@lucene.apache.org > Subject: Custom tokenizer > > Hi, > I am trying to implement a custom tokenizer for my application and I have > few queries regarding the same. > 1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer) the > custom tokenizer and make it use this tokenizer instead of say > StandardTokenizer? > 2. Why are analyzers such as Standard and EnglishAnalyzers defined final? > Because of which, I cannot extend them. > > Thank you. > -- > V - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Custom tokenizer
Thanks for the reply. Hmm, I understand. I know about AnalyzerWrapper, but that is not what I am looking for. I also know about cloning and overriding. I want my analyzer to behave exactly the same as EnglishAnalyzer and right now I am copying the code from the EnglishAnalyzer to mimic the behavior, which is a dirty solution. Is there any other proper solution(s) to this problem? Thank you. On Mon, Jan 12, 2015 at 1:36 PM, Uwe Schindler wrote: > Hi, > > Extending an existing Analyzer is not useful, because it is just a factory > that returns a TokenStream instance to consumers. If you want to change the > Tokenizer of an existing Analyzer, just clone it and rewrite its > createComponents() method, see the example in the Javadocs: > http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/Analyzer.html > > If you want to add additional TokenFilters to the chain, you can do this > with AnalyzerWrapper ( > http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/AnalyzerWrapper.html), > but this does not work with Tokenizers, because those are instantiated > before the TokenFilters which depend on them, so changing the Tokenizer > afterwards is impossible. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Vihari Piratla [mailto:viharipira...@gmail.com] > > Sent: Monday, January 12, 2015 8:51 AM > > To: java-user@lucene.apache.org > > Subject: Custom tokenizer > > > > Hi, > > I am trying to implement a custom tokenizer for my application and I have > > few queries regarding the same. > > 1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer) > the > > custom tokenizer and make it use this tokenizer instead of say > > StandardTokenizer? > > 2. Why are analyzers such as Standard and EnglishAnalyzers defined final? > > Because of which, I cannot extend them. > > > > Thank you. > > -- > > V > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- V
RE: Custom tokenizer
> Thanks for the reply. > > Hmm, I understand. > I know about AnalyzerWrapper, but that is not what I am looking for. > > I also know about cloning and overriding. I want my analyzer to behave > exactly the same as EnglishAnalyzer and right now I am copying the code > from the EnglishAnalyzer to mimic the behavior, which is a dirty solution. > Is there any other proper solution(s) to this problem? NO. Analyzers that are provided by Lucene have a configuration (combination of Tokenizers and Filters) that won't change unless the matchVersion differs (which is documented in the Javadocs). The reason for this is: If you have indexed with a given analyzer you have to use it unmodified always when updating/searching the index, otherwise the results of those actions are undefined. So on updating Lucene every Analyzer should return exactly the same results. Otherwise all users would need to rebuild their indexes also in minor versions. Also, see Lucene Analyzers as "example" code. What counts here is the combination of Tokenizers and TokenFilters, which is freely configureable. The ones provided by Lucene are useful for common cases, but whenever you have custom requirements, you have to define your Analyzer *completely* yourself. This is also what Solr and Elasticsearch users do in their config files. Uwe > Thank you. > > On Mon, Jan 12, 2015 at 1:36 PM, Uwe Schindler wrote: > > > Hi, > > > > Extending an existing Analyzer is not useful, because it is just a > > factory that returns a TokenStream instance to consumers. If you want > > to change the Tokenizer of an existing Analyzer, just clone it and > > rewrite its > > createComponents() method, see the example in the Javadocs: > > > http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/A > > nalyzer.html > > > > If you want to add additional TokenFilters to the chain, you can do > > this with AnalyzerWrapper ( > > > http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/A > > nalyzerWrapper.html), but this does not work with Tokenizers, because > > those are instantiated before the TokenFilters which depend on them, > > so changing the Tokenizer afterwards is impossible. > > > > Uwe > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > > > -Original Message- > > > From: Vihari Piratla [mailto:viharipira...@gmail.com] > > > Sent: Monday, January 12, 2015 8:51 AM > > > To: java-user@lucene.apache.org > > > Subject: Custom tokenizer > > > > > > Hi, > > > I am trying to implement a custom tokenizer for my application and I > > > have few queries regarding the same. > > > 1. Is there a way to provide an existing analyzer (say > > > EnglishAnanlyzer) > > the > > > custom tokenizer and make it use this tokenizer instead of say > > > StandardTokenizer? > > > 2. Why are analyzers such as Standard and EnglishAnalyzers defined final? > > > Because of which, I cannot extend them. > > > > > > Thank you. > > > -- > > > V > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > V - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
howto: handle temporal visibility of a document?
We have documents that are not always visible (visiblefrom-visibleto). In order to not have to query the originating object of the document whether it is currently visible (after the query), we'd like to put metadata into the documents, so that the visibility can be determined at query-time (by the query itself or a query filter). Any suggestions on how to index and query this metadata?
AW: howto: handle temporal visibility of a document?
I'll add/start with my proposal ;) Document-meta fields: + visiblefrom [long] + visibleto [long] Query or query filter: (*:* -visiblefrom:[* TO *] AND -visibleto:[* TO *]) OR (*:* -visiblefrom:[* TO *] AND visibleto:[ TO *]) OR (*:* -visibleto:[ * TO *] AND visiblefrom:[* TO ]) OR ( visiblefrom:[* TO ] AND visibleto:[ TO *]) -Ursprüngliche Nachricht- Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] Gesendet: Montag, 12. Januar 2015 09:40 An: java-user@lucene.apache.org Betreff: howto: handle temporal visibility of a document? We have documents that are not always visible (visiblefrom-visibleto). In order to not have to query the originating object of the document whether it is currently visible (after the query), we'd like to put metadata into the documents, so that the visibility can be determined at query-time (by the query itself or a query filter). Any suggestions on how to index and query this metadata?
fill 'empty' facet-values, sampling, taxoreader
Hi all, I'm building an application in which users can add arbitrary documents, and all fields will be added as facets as well. This allows users to browse their documents by their own defined facets easily. However, when the number of documents gets very large, I switch to random-sampled facets to make sure the application stays responsive. By the nature of sampling, documents (and thus facet-values) will be missed. I let the user select the number of facet-values he want to see for each facets. For example, the default is 10. If a facet contains values 1 to 20, the user will always see 10 values if all documents are returned in the search and no sampling is done. If sampling is done, and the values are non-uniformly distributed, the user might end up with only 5 values instead of 10. I want to 'fill' the empty 5 facet-value-slots with existing facet-values and an unknown facet-count (?). The reason behind this, is that this value might exist in the resultset and for interaction purposes, it is very nice if this value can be selected and added to the query, to quickly find if there are documents that also contain this facet value. It is even more useful if these facet values are not sorted by count, but by label. The user can then quickly see there are document that contain a certain value. I can iterate over the ordinals via the TaxonomyReader and TaxonomyFacets (by leveraging the 'children'), but these ordinals might no longer be used in the documents. What would be a good approach to tackle this issue?
MultiPhraseQuery:Rewrite to BooleanQuery
Hi folks! I have a multiphrase query, for example, from units: Directory indexStore = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random(), indexStore); add("blueberry chocolate pie", writer); add("blueberry chocolate tart", writer); IndexReader r = writer.getReader(); writer.close(); IndexSearcher searcher = newSearcher(r); MultiPhraseQuery q = new MultiPhraseQuery(); q.add(new Term("body", "blueberry")); q.add(new Term("body", "chocolate")); q.add(new Term[] {new Term("body", "pie"), new Term("body", "tart")}); assertEquals(2, searcher.search(q, 1).totalHits); r.close(); indexStore.close(); I need to know on which phrase query will be match. Explanation doesn't return exact information, only that is match by this query. So can I rewrite this query to Boolean?, like BooleanQuery q = new BooleanQuery(); PhraseQuery pq1 = new PhraseQuery(); pq1.add(new Term("body", "blueberry")); pq1.add(new Term("body", "chocolate")); pq1.add(new Term("body", "pie")); q.add(pq1, BooleanClause.Occur.SHOULD); PhraseQuery pq2 = new PhraseQuery(); pq2.add(new Term("body", "blueberry")); pq2.add(new Term("body", "chocolate")); pq2.add(new Term("body", "tart")); q.add(pq2, BooleanClause.Occur.SHOULD); In this case I'll exact know on which query I have a match. But main querstion is, Is this rewrite is equal/true? Thanks. -- dennis yermakov mailto: dem...@gmail.com
Re: AW: howto: handle temporal visibility of a document?
The basic idea seems sound, but I think you can simplify that query a bit. For one thing, the *:* clauses can be removed in a few places: also if you index an explicit null value you won't need them at all; for visiblefrom, if you don't have a from time, use 0, for visibleto, if you don't have a to time, use maxlong. -Mike On 1/12/15 4:23 AM, Clemens Wyss DEV wrote: I'll add/start with my proposal ;) Document-meta fields: + visiblefrom [long] + visibleto [long] Query or query filter: (*:* -visiblefrom:[* TO *] AND -visibleto:[* TO *]) OR (*:* -visiblefrom:[* TO *] AND visibleto:[ TO *]) OR (*:* -visibleto:[ * TO *] AND visiblefrom:[* TO ]) OR ( visiblefrom:[* TO ] AND visibleto:[ TO *]) -Ursprüngliche Nachricht- Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] Gesendet: Montag, 12. Januar 2015 09:40 An: java-user@lucene.apache.org Betreff: howto: handle temporal visibility of a document? We have documents that are not always visible (visiblefrom-visibleto). In order to not have to query the originating object of the document whether it is currently visible (after the query), we'd like to put metadata into the documents, so that the visibility can be determined at query-time (by the query itself or a query filter). Any suggestions on how to index and query this metadata? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Finding a match for an automaton against a FST
On Sat, Jan 10, 2015 at 8:23 AM, Olivier Binda wrote: > On 01/10/2015 11:00 AM, Michael McCandless wrote: >> >> On Fri, Jan 9, 2015 at 6:42 AM, Olivier Binda >> wrote: >>> >>> Hello. >>> >>> 1) What is the best way to check if an automaton (from a regex or a >>> string >>> with a wildcard) >>> has at least 1 match against a FST (from a WFSTCompletionLookup) ? >> >> You need to implement "intersect". We already have this method for >> two automata (Operations.java); maybe you can start from that but >> cutover to the FST APIs instead for the 2nd automaton? > > > I looked a bit into this. This is complicated stuff :/ Sorry, yes it is. If you have any ideas to simplify the APIs that would be awesome :) > I think I get what the nested loops in intersect() do : transitions consist > of a double dimension array, somehow those arays are intersected. > I don't understand yet why is there a .min and a .max for a transition (why > not just a codepoint ?) Most automaton transitions cover a wide range of unicode characters, so requiring a separate transition for each would be too costly (too many objects / RAM). > Fst and automaton (and maybee the lucene codec stuff) are 3 different > implementations > of finite state machine/transducers, right ? I think we have only 2 implementations (FST, Automaton). > How does regexQuery (automaton) match against an index ? > Does it use intersect() internally ? (if it does, maybee I could reuse that > code too) RegexpQuery in core (NOT to be confused with the much slower, differing in name by only one letter, RegexQuery in sandbox) builds an Automaton and then uses Terms.intersect API. However, I would not look for inspiration from Terms.intersect: that implementation (in block tree terms dict) works with the terms dictionary data structures to perform a fast intersection and that code is crazy complex. Possibly a place to look for inspiration/poaching is FSTUtil.intersectPrefixPaths: that intersects an automaton with an FST. It's used by the fuzzy suggester... Mike McCandless http://blog.mikemccandless.com > > >> >>> 2) Also, is there a simple/efficient way to find the lowest and the >>> highest >>> arcs of a FST that match against an automaton ? >> >> Hmm arcs leaving which state? The initial state? You could simply >> walk all arcs leaving the initial state from the FST and check if the >> automaton accepts them leaving its initial state (assuming the >> automaton has no dead states)? >> >> Or, if you are already doing an intersection here, just save this >> information as a side effect since you will have already computed it. > > > thanks for the tips, it helps. > Olivier > > > >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
MultiPhraseQuery:Rewrite to BooleanQuery
Hi folks! I have a multiphrase query, for example, from units: Directory indexStore = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random(), indexStore); add("blueberry chocolate pie", writer); add("blueberry chocolate tart", writer); IndexReader r = writer.getReader(); writer.close(); IndexSearcher searcher = newSearcher(r); MultiPhraseQuery q = new MultiPhraseQuery(); q.add(new Term("body", "blueberry")); q.add(new Term("body", "chocolate")); q.add(new Term[] {new Term("body", "pie"), new Term("body", "tart")}); assertEquals(2, searcher.search(q, 1).totalHits); r.close(); indexStore.close(); I need to know on which phrase query will be match. Explanation doesn't return exact information, only that is match by this query. So can I rewrite this query to Boolean?, like BooleanQuery q = new BooleanQuery(); PhraseQuery pq1 = new PhraseQuery(); pq1.add(new Term("body", "blueberry")); pq1.add(new Term("body", "chocolate")); pq1.add(new Term("body", "pie")); q.add(pq1, BooleanClause.Occur.SHOULD); PhraseQuery pq2 = new PhraseQuery(); pq2.add(new Term("body", "blueberry")); pq2.add(new Term("body", "chocolate")); pq2.add(new Term("body", "tart")); q.add(pq2, BooleanClause.Occur.SHOULD); In this case I'll exact know on which query I have a match. But main querstion is, Is this rewrite is equal/true? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/MultiPhraseQuery-Rewrite-to-BooleanQuery-tp4178894.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
AW: AW: howto: handle temporal visibility of a document?
Thx, I will simplify/optimize ;) -Ursprüngliche Nachricht- Von: Michael Sokolov [mailto:msoko...@safaribooksonline.com] Gesendet: Montag, 12. Januar 2015 14:41 An: java-user@lucene.apache.org Betreff: Re: AW: howto: handle temporal visibility of a document? The basic idea seems sound, but I think you can simplify that query a bit. For one thing, the *:* clauses can be removed in a few places: also if you index an explicit null value you won't need them at all; for visiblefrom, if you don't have a from time, use 0, for visibleto, if you don't have a to time, use maxlong. -Mike On 1/12/15 4:23 AM, Clemens Wyss DEV wrote: > I'll add/start with my proposal ;) > > Document-meta fields: > + visiblefrom [long] > + visibleto [long] > > Query or query filter: > (*:* -visiblefrom:[* TO *] AND -visibleto:[* TO *]) OR (*:* > -visiblefrom:[* TO *] AND visibleto:[ TO *]) OR (*:* > -visibleto:[ * TO *] AND visiblefrom:[* TO ]) OR ( > visiblefrom:[* TO ] AND visibleto:[ TO *]) > > -Ursprüngliche Nachricht- > Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] > Gesendet: Montag, 12. Januar 2015 09:40 > An: java-user@lucene.apache.org > Betreff: howto: handle temporal visibility of a document? > > We have documents that are not always visible (visiblefrom-visibleto). In > order to not have to query the originating object of the document whether it > is currently visible (after the query), we'd like to put metadata into the > documents, so that the visibility can be determined at query-time (by the > query itself or a query filter). Any suggestions on how to index and query > this metadata? > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Details on setting block parameters for Lucene41PostingsFormat
Thanks Mike, > OK. It would be good to know where all your RAM is being consumed, > and how much of that is really the terms index: it ought to be a very > small part of it. > > I made a bunch of heap dumps. I just watched with jconsole and ran jmap -histo when memory use got high. I've appended a bit more from the error trace and the top memory users from one of the heap dumps below.. I tried to send a bunch of heap dumps to the mailing list but the message got rejected. I'll send them directly to you. Tom java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.(FreqProxTermsWriterPerField.java:212) at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:230) at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48) at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:252) at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:292) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:659) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359) --- top memory users from one of the heap dumps: 1: 1131932 2546933736 [B 2:308670 743033280 [I 3:696803 203038680 [C 4:383039 36771744 org.apache.lucene.codecs.lucene41.Lucene41PostingsWriter$IntBlockTermState 5: 1089864 26156736 org.apache.lucene.util.AttributeSource$State 6:544870 26153760 org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl 7:687500 1650 org.apache.lucene.util.BytesRef 8:1358209779040 org.apache.lucene.util.fst.FST$Arc 9:3825199180456 org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$PendingTerm 10:382037916 org.apache.lucene.codecs.TermStats 11:5449528719232 org.apache.lucene.util.BytesRefBuilder
Re: Details on setting block parameters for Lucene41PostingsFormat
Thanks Mike, Do you know how I can configure Solr to use the min=200 and max=398 block sizes you suggested? Or should I ask on the Solr list? Tom On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > The first int to Lucene41PostingsFormat is the min block size (default > 25) and the second is the max (default 48) for the block tree terms > dict. > > The max must be >= 2*(min-1). > > Since you were using 8X the default before, maybe try min=200 and > max=398? However, block tree should have been more RAM efficient than > 3.x's terms index... if you run CheckIndex with -verbose it will print > additional details about the block structure of your terms indices... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West > wrote: > > Hello all, > > > > We have over 3 billion unique terms in our indexes and with Solr 3.x we > set > > the TermIndexInterval to about 8 times its default value in order to > index > > without OOMs. ( > > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again) > > > > We are now working with Solr 4 and running into memory issues and are > > wondering if we need to do something analogous for Solr 4. > > > > The javadoc for IndexWriterConfig ( > > > http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29 > > ) > > indicates that the lucene 4.1 postings format has some parameters which > may > > be set: > > "..To configure its parameters (the minimum and maximum size for a > block), > > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int, > > int) > > < > https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29 > > > > " > > > > Is there documentation or discussion somewhere about how to determine > > appropriate parameters or some detail about what setting the maxBlockSize > > and minBlockSize does? > > > > Tom Burton-West > > http://www.hathitrust.org/blogs/large-scale-search > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
StoredField available in Collector.setNextReader
Hello, I have tried to retrieve values stored via StoredField type inside a Collector when its method setNextReader(AtomicReaderContext) was called. I used the following method from FieldCache, but do not get back any values: FieldCache.DEFAULT.getTerms(indexReader, field, false); Retrieving the values from the document itself during call to Collector.collect(int) works fine. But this is much much slower than getting all terms at once as by the above method. My question: Is there a way to get binary content with similar performance as by the above described concept, i.e. retrieving the field terms when setting the reader in a Collector? Besides, the concept works fine for any stored field that is indexed, e.g. like in the following code snippet: final FieldType fieldType = new FieldType(); { fieldType.setStored(true); fieldType.setIndexed(true); // need to index, otherwise no fast retrieval of terms in collector is possible fieldType.setIndexOptions(IndexOptions.DOCS_ONLY); fieldType.setTokenized(false); fieldType.setOmitNorms(true); fieldType.freeze(); } Field field = new Field(fieldName, fieldValue, fieldType); // fieldValue is of type String But this does not allow me to store binary content (i.e. values in byte[] arrays) as is available for StoredField. The constructor expects field content of type String. I have tried to convert the content into base64 encoded strings, but the conversion from base64 encoded strings to byte arrays is quite expensive for large indexes. Thanks for your advice. Best regards, Josef
Problem with Custom FieldComparator
I'm changing one web application with lucene 2.4.1 to lucene 2.9.4 (basically because this bug: https://issues.apache.org/jira/browse/LUCENE-1304). I'm trying to migrate a custom sort field according to some examples i read. But I cannot make it work right. I have a field with string values and when i find a pattern I extract a number (priority). This priority is used for sorting de documents. The field has values like this: "pub.generic1 pub.generic1.zonahome pub.generic1.zonahome.prio.1 pub.generic1.zonahome.lateral.slash.derecha pub.generic1.zonahome.lateral.slash.derecha.prio.1 pub.generic1.seccion.seccion1 pub.generic1.seccion.seccion1.prio.10" This is the new comparator code: public class HighTrafficSortComparator extends FieldComparatorSource { protected static final Log LOG = CmsLog.getLog(HighTrafficSortComparator.class); private List priorityPreffix; private boolean ascending = false; public HighTrafficSortComparator(String[] prefifx, boolean ascending){ this.ascending = ascending; // here I make the preffix // ... } public FieldComparator newComparator(String fieldname, int numHits, int sortPos, boolean reversed) throws IOException { return new HighTrafficFieldComparator(numHits, fieldname); } class HighTrafficFieldComparator extends FieldComparator { String field; int[] docValues; int[] slotValues; int bottomValue; HighTrafficFieldComparator(int numHits, String fieldName) { slotValues = new int[numHits]; field = fieldName; } public void copy(int slot, int doc) { slotValues[slot] = docValues[doc]; } public int compare(int slot1, int slot2) { return slotValues[slot1] - slotValues[slot2]; } public int compareBottom(int doc) { return bottomValue - docValues[doc]; } public void setBottom(int bottom) { bottomValue = slotValues[bottom]; } public void setNextReader(IndexReader reader, int docBase) throws IOException { docValues = FieldCache.DEFAULT.getInts(reader, field, new FieldCache.IntParser() { public final int parseInt(final String val) { return getPrioridad(val); } }); } public Comparable value(int slot) { return new Integer(slotValues[slot]); } } private Integer getPrioridad(String text) { int prioridad = !ascending ? Integer.MAX_VALUE : Integer.MIN_VALUE; if (text!=null) { String[] termstext = text.split(" "); for (String termtext : termstext) { int idx = termtext.indexOf(NoticiacontentExtrator.KEY_SEPARATOR + "prio" + NoticiacontentExtrator.VALUE_SEPARATOR); if (idx>-1) { //is a priority. String termPreffix = termtext.substring(0,idx); if (priorityPreffix.contains(termPreffix)) { //has the requested priority try { int prioridadTerm = Integer.parseInt(termtext.substring(idx+6)); if (!ascending && prioridadTerm < prioridad) prioridad = prioridadTerm; else if (ascending && prioridadTerm > prioridad) prioridad = prioridadTerm; } catch (NumberFormatException ex) { } } } } } return new Integer(prioridad); } } this is how I use this custom sort: camposOrden = new SortField(luceneFieldName, new HighTrafficSortComparator(preffix,isAscending), isAscending); When I made the query the result is not sorted correclty... But I dont know what I doing wrong. This is the old code working correctly in lucene 2.4.1: public class HighTrafficSortComparator implements SortComparatorSource { private List priorityPreffix; public ScoreDocComparator newComparator(final IndexReader indexReader, final String fieldname) throws IOException { return new ScoreDocComparator() { private Map cachedScores = new HashMap(); public int compare(ScoreDoc scoreDoc1, ScoreDoc scoreDoc2) { try { Integer priorityDoc1 = cachedScores.get(scoreDoc1.doc); Integer priorityDoc2 = cachedScores.get(scoreDoc2.doc); if (priorityDoc1==null) { final Document doc1 = indexReader.document(scoreDoc1.doc); final String strVal1 = doc1.get(fieldname); priorityDoc1 = getPrioridad(strVal1); cachedScores.put(scoreDoc1.doc, priorityDoc1); } if (priorityDoc2==null) { final Document doc2 = indexReader.document(scoreDoc2.doc); final String strVal2 = doc2.get(fieldname); priorityDoc2 = getPrioridad(strVal2); cachedScores.put(scoreDoc2.doc, priorityDoc2); } return priorityDoc1.compareTo(priorityDoc2); } catch (IOException e) { LOG.error("Cannot read doc", e); } return 0; } public Comparable sortValue(ScoreDoc scoreDoc) { try { Integer priorityDoc = cachedScores.get(scoreDoc.doc); if (priorityDoc==null) { final Document doc = indexReader.document(scoreDoc.doc); final String strVal = doc.get(fieldname); priorityDoc = getPrioridad(strVal); } return priorityDoc; } catch (IOException e) { LOG.error("Cannot read doc", e); } return 0; } public int sortType() { return SortField.CUSTOM; } private Integer getPrioridad(String text) { int prioridad = !ascending ? Integer.MAX_VALUE : Integer.MIN_VALUE; if (text!=null) { String[] termstext = text.split(" "); for (String termtext : termstext) { int idx = termtext.indexOf(NoticiacontentExtrator.KEY_SEPARATOR + "prio" + NoticiacontentExtrator.VALUE_SEPARATOR); if
RE: RE: howto: handle temporal visibility of a document?
reduced to: ( ( *:* -visiblefrom:[* TO *] AND -visibleto:[* TO *] ) OR (-visiblefrom:[* TO *] AND visibleto:[ TO ]) OR (-visibleto:[ * TO *] AND visiblefrom:[0 TO ]) OR ( visiblefrom:[0 TO ] AND visibleto:[ TO ]) ) > also if you index an explicit null value you won't need them at all Could it then be reduced to (-visiblefrom:[* TO *] AND visibleto:[ TO ]) OR (-visibleto:[ * TO *] AND visiblefrom:[0 TO ]) OR ( visiblefrom:[0 TO ] AND visibleto:[ TO ]) ? Would I gain a lot more speed if I set visiblefrom to 0 and visibleto to . The query would then be: visiblefrom:[0 TO ] AND visibleto:[ TO ] And a rather Solr'y question, nevertheless I ask it here: I intended to use this very query as query filter (qf), but I guess it doesn't make sense because '' changes at every call ;) -Ursprüngliche Nachricht- Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] Gesendet: Montag, 12. Januar 2015 17:14 An: java-user@lucene.apache.org Betreff: AW: AW: howto: handle temporal visibility of a document? Thx, I will simplify/optimize ;) -Ursprüngliche Nachricht- Von: Michael Sokolov [mailto:msoko...@safaribooksonline.com] Gesendet: Montag, 12. Januar 2015 14:41 An: java-user@lucene.apache.org Betreff: Re: AW: howto: handle temporal visibility of a document? The basic idea seems sound, but I think you can simplify that query a bit. For one thing, the *:* clauses can be removed in a few places: also if you index an explicit null value you won't need them at all; for visiblefrom, if you don't have a from time, use 0, for visibleto, if you don't have a to time, use maxlong. -Mike On 1/12/15 4:23 AM, Clemens Wyss DEV wrote: > I'll add/start with my proposal ;) > > Document-meta fields: > + visiblefrom [long] > + visibleto [long] > > Query or query filter: > (*:* -visiblefrom:[* TO *] AND -visibleto:[* TO *]) OR (*:* > -visiblefrom:[* TO *] AND visibleto:[ TO *]) OR (*:* > -visibleto:[ * TO *] AND visiblefrom:[* TO ]) OR ( > visiblefrom:[* TO ] AND visibleto:[ TO *]) > > -Ursprüngliche Nachricht- > Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] > Gesendet: Montag, 12. Januar 2015 09:40 > An: java-user@lucene.apache.org > Betreff: howto: handle temporal visibility of a document? > > We have documents that are not always visible (visiblefrom-visibleto). In > order to not have to query the originating object of the document whether it > is currently visible (after the query), we'd like to put metadata into the > documents, so that the visibility can be determined at query-time (by the > query itself or a query filter). Any suggestions on how to index and query > this metadata? > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org