Help with Numeric Range
Hi all, I'm new to Lucene, as well as Cassandra. I'm working on the Lucandra project to modify it to add some extra functionality. It hasn't been fully testing with range queries, so I've created some tests and contributed them. You can view my source here. http://github.com/tnine/Lucandra/blob/master/test/lucandra/NumericRangeTests.java First, is this a sensible test? I'm specifically testing the case of longs where I need millisecond precision on my searches. Second, I see that Numeric Fields are built via terms. I think the issue lies in the encoding of these terms into bytes for the Cassandra keys. Can anyone point me to some documentation on numeric queries and terms, and how they are encoded at the byte level based on the precision? Thanks, Todd
RE: Help with Numeric Range
Hi Uwe, Thank you for your help, it is greatly appreciated. Unfortunately, my tests all fail except for RangeInclusive. I've changed the step to be 6 as per your recommendation. I had it at max to eliminate step precision as the cause of the test failure. Essentially, all keys in Cassandra are UTF-8 Keys. In the Lucandra, the keys are constructed in the following way. 1. Get the token stream for the field. In this case it's a NumericTokenStream with (numeric,valSize=64,precisionStep=6) 2. For all tokens in the stream, create a UTF8 String in the following format \u 3. Set the term frequency to 1 This gives us a list of tokens, prefixed with the field name and the delimiter. then we do this for each term from above create a key of the format \u\u and write it to TermInfo column Family After debugging the implementation of the LucandraTermEnum, it is correctly returning values that should match my numeric range query. However, I never get the results in the TopDocs result set after they're handed back to the numeric range query object. Any ideas why this is happening? Thanks, Todd On Wed, 2010-06-23 at 08:53 +0200, Uwe Schindler wrote: > Hi Todd, > > I am not sure if I understand your problem correctly. I am not familiar with > Lucandra/Cassandra at all, but if Lucandra implements the IndexWriter and > IndexReader according to the documentation, numeric queries should work. A > NumericField internally creates a TokenStream and "analyzes" the number to > several Tokens, which are somehow "half binary" (they are terms containing of > characters in the full 0..127 range for optimal UTF8 compression with 3.x > versions of Lucene). The exact encoding can be looked at in the NumericUtils > class + javadocs. > > About your testcase: The test looks good, so does it fail? If yes, where is > the problem? You can also look into Lucene's test TestNumericRangeQuery64 for > more examples. Or modify its @BeforeClass to instead build a Lucandra index. > > The test has one thing, that is not intended to be done like that: > numeric = new NumericField("long", Integer.MAX_VALUE, Store.YES, true); > > You are using MAX_VALUE as precision step, this would slowdown all queries to > the speed of old-style TermRangeQueries. It is always better to stick with > the default of 4, which creates 64 bits / 4 precStep = 16 terms per value. > Alternatively for longs, 6 is a good precision step (see NumericRangeQuery > documentation). MAX_VALUE is only intended for fields that do not do numeric > ranges but e.g. sort only. precisionStep is a performance tuning parameter, > it has nothing to do with better/worse precision on terms or different query > results. If you are using NumericRangeQuery with this large precStep, you are > not using the numeric features at all, so your test should not behave > different from a conventional TermRangeQuery with padded terms. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Todd Nine [mailto:t...@spidertracks.co.nz] > > Sent: Wednesday, June 23, 2010 7:53 AM > > To: java-user@lucene.apache.org > > Subject: Help with Numeric Range > > > > Hi all, > > I'm new to Lucene, as well as Cassandra. I'm working on the Lucandra > > project to modify it to add some extra functionality. It hasn't been fully > > testing with range queries, so I've created some tests and contributed them. > > You can view my source here. > > > > http://github.com/tnine/Lucandra/blob/master/test/lucandra/NumericRang > > eTests.java > > > > First, is this a sensible test? I'm specifically testing the case of longs > > where I > > need millisecond precision on my searches. > > > > > > Second, I see that Numeric Fields are built via terms. I think the issue > > lies in > > the encoding of these terms into bytes for the Cassandra keys. Can anyone > > point me to some documentation on numeric queries and terms, and how > > they are encoded at the byte level based on the precision? > > > > Thanks, > > Todd >
Filtering times stored as longs in HEX
Hi all, I'm using Lucandra to index notes in our system. Since we can't use numeric fields due to a bug in Cassandra (fixed in 0.7), I'm encoding all times a epoch in Hex, then storing the hex string. I have the following fields on my document. createdDate phoneNumber email I want to perform a query where the input is either a phone number, or an email. The user also passes in an epoch timestamp (long in milliseconds), and the count. I need to return all documents with a timestamp <= the given timestamp, and the maximum count. I'm having some trouble building this query in my code. I never get any results, but I can see the data is written to the index properly. Here is my code. http://pastie.org/private/xzvnntmyjzxgpjgctxftrq Thanks, Todd
Help with numeric ranges and querying with sorting and counts
Hi all, Now that the nasty bug in Cassandra has been fixed, I can use numeric fields in my Lucandra for searching and sorting. I'm having a bit of an issue I could use a hand with. We're creating an SoS index. Each Document corresponds to an SoS. Every person contacted for the SoS will be indexed by their email address, their phone numbers, and the time the SoS was created. I have an api with the following functionality. public List getAll(String input, long endTime int count) { Where the string is either a phone number or an email address, the endTime is the epoch end time to seek up to, and the count is the number of records to return. I basically want to perform the following search (email: input OR phone: input) AND endtime[-LONG.MIN : endTime] Then return the last "count" values that are closest to the end time. Here is how I'm creating my document. Document doc = new Document(); DocumentUtils.setRowKey(doc, sos.getId().toString()); doc.add(new Field(FIELD_IMEI, sos.getImeiNumber(), Store.NO, Index.NOT_ANALYZED)); doc.add(new Field(FIELD_TRACKIDX, getHex(sos.getTrackIndexTime()), Store.NO, Index.NOT_ANALYZED)); doc.add(new Field(FIELD_TIER, getHex(sos.getTier().getStoredValue()), Store.NO, Index.NOT_ANALYZED)); doc.add(new NumericField(FIELD_CREATETIME).setLongValue(sos .getCreatedTime().getTime())); doc.add(new Field(FIELD_RESOLVED, getHex(sos.isResolved()), Store.NO, Index.NOT_ANALYZED)); if (sos.getNotes() != null) { doc.add(new Field(FIELD_NOTES, sos.getNotes(), Store.NO, Index.ANALYZED)); } if (sos.isResolved()) { doc.add(new NumericField(FIELD_RESOLVETIME).setLongValue(sos .getResolvedTime().getTime())); } for (ContactedPerson person : sos.getContactedPeople()) { doc.add(new Field(FIELD_EMAIL, person.getEmail(), Store.NO, Index.NOT_ANALYZED)); for (Phone phone : person.getPhones()) { doc.add(new Field(FIELD_PHONE, getNumericString(phone .getNumber()), Store.NO, Index.NOT_ANALYZED)); } } Here is how I'm creating my query. BooleanQuery query = new BooleanQuery(); BooleanQuery inputTerms = new BooleanQuery(); inputTerms.add(new TermQuery(new Term(FIELD_EMAIL, input)), Occur.SHOULD); inputTerms.add( new TermQuery(new Term(FIELD_PHONE, getNumericString(input))), Occur.SHOULD); query.add(inputTerms, Occur.MUST); NumericRangeQuery time = NumericRangeQuery.newLongRange( FIELD_CREATETIME, null, endTime, true, true); query.add(time, BooleanClause.Occur.MUST); And my sorter SortField sort = new SortField(FIELD_CREATETIME, SortField.LONG,true); Finally here is some sample test data. SOS 1: createdTime = 1284093337200L SOS 2: createdTime = 1284093337200L + 1; SOS 3: createdTime = 1284093337200L + 2; My search criteria is the following input = "f...@bar.com" (each record has this email address on it) time = SOS 3 created time. I expect to get 3 records, but instead I'm only getting 1. It's something specific to this query, as I have similar queries that work properly for the numeric range and sorting. Any ideas what is wrong with my query? Thanks, Todd
Numeric range query not returning results
Hi all, I'm having some issues with Numeric Range queries not working as expected. My underlying storage medium is the Lucandra index reader and writer, so I'm not sure if this is an issue within Lucandra or with my usage of numeric field. My numeric range tests that are copies of Uwe's pass in the Lucandra, source, so I have a feeling it's my usage. I have a simple test case, with 5 people. I have a Date field, the LastLogin field. This date is converted to epoch milliseconds, and stored in the index in the following way. NumericField numeric = new NumericField("LastLogin"); numeric.setLongValue(fieldValue); doc.add(numeric); Where I have the following 2 field values on 2 documents. 1282197146L and 1282197946L I then perform the following query. NumericRangeQuery rangeQuery = NumericRangeQuery.newLongRange("LastLogin", 1282197146L, 1282197146L, true, true); IndexReader reader = new IndexReader(columnFamily, getContext(conn)); IndexSearcher searcher = new IndexSearcher(reader); TopDocs docs = searcher.search(query, maxResults); List documents = new ArrayList( docs.totalHits); Set fields = new HashSet(); fields.add(IndexDocument.ROWKEY); fields.add(IndexDocument.IDSERIALIZED); SetBasedFieldSelector selector = new SetBasedFieldSelector( fields, null); for (ScoreDoc score : docs.scoreDocs) { documents.add(reader.document(score.doc, selector)); } return documents; I'm always getting 0 documents. I know this is incorrect, I can see the values getting written to Cassandra when I run it in debug mode. Is this an issue with precision step, or an issue with the Lucandra index reader implementation? Thanks, Todd
RE: Numeric range query not returning results
Hi Uwe, My example wasn't very clear, as I have a load of other code in my actual implementation and I was trying to cut it down for clarity. This is actually my indexing service for my Datanucleus Cassandra plugin, so I have a 1 to 1 relationship where a single document corresponds to a Persistent object. I actually create 5 separate documents, and I would expect 3 of those to be returned. I've ported your entire set of tests for 32 and 64 bit numeric range tests over, and it unfortunately appears that Lucandra is still very broken in terms of numeric ranges even after the Cassandra encoding fix for the 7bit shift into UTF 8 characters. I'll hopefully be able to solve the bugs in the next few days. Thanks again for your help, it's always appreciated. Todd On Mon, 2010-10-04 at 07:55 +0200, Uwe Schindler wrote: > This test works perfectly and returns 1 doucment: > > > > public void testToddNine() throws Exception { > > RAMDirectory directory = new RAMDirectory(); > > IndexWriter writer = new IndexWriter(directory, new > WhitespaceAnalyzer(), true, MaxFieldLength.UNLIMITED); > > try { > > Document doc = new Document(); > > doc.add(new > NumericField("LastLogin").setLongValue(1282197146L)); > > writer.addDocument(doc); > > doc = new Document(); > > doc.add(new > NumericField("LastLogin").setLongValue(1282197946L)); > > writer.addDocument(doc); > > } finally { > > writer.close(); > > } > > > > NumericRangeQuery rangeQuery = > > NumericRangeQuery.newLongRange("LastLogin", 1282197146L, > 1282197146L, true, true); > > > > IndexReader reader = IndexReader.open(directory, true); > > try { > > IndexSearcher searcher = new IndexSearcher(reader); > > TopDocs docs = searcher.search(rangeQuery, 1000); > > assertEquals(1,docs.totalHits); > > } finally { > > reader.close(); > > } > > } > > > > Maybe you have the following problems: > > - Are you executing the same query than created. In your > example code the searcher executed “query” but the range query was > “rangeQuery” variable name > > - Are you sure that your document is not returned, but you > miss some stored fields? E.g. the default NumericField ctor does not > create the field as “stored” to the document? > > > > public NumericField(String name) > > Creates a field for numeric values using the default precisionStep > NumericUtils.PRECISION_STEP_DEFAULT (4). The instance is not yet > initialized with a numeric value, before indexing a document > containing this field, set a value using the various set???Value() > methods. This constructor creates an indexed, but not stored field. > > > > Uwe > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > > > -Original Message- > > > From: Todd Nine [mailto:t...@spidertracks.co.nz] > > > Sent: Monday, October 04, 2010 6:13 AM > > > To: java-user@lucene.apache.org > > > Subject: Numeric range query not returning results > > > > > > Hi all, > > > I'm having some issues with Numeric Range queries not working as > expected. > > > My underlying storage medium is the Lucandra index reader and > writer, so I'm > > > not sure if this is an issue within Lucandra or with my usage of > numeric field. > > > My numeric range tests that are copies of Uwe's pass in the > Lucandra, source, > > > so I have a feeling it's my usage. I have a simple test case, with > 5 people. I > > > have a Date field, the LastLogin field. This date is converted to > epoch > > > milliseconds, and stored in the index in the following way. > > > > > > NumericField numeric = new NumericField("LastLogin"); > > > numeric.setLongValue(fieldValue); doc.add(numeric); > > > > > > Where I have the following 2 field values on 2 documents. > > > > > > 1282197146L and 1282197946L > > > > > > I then perform the following query. > > > > > > NumericRangeQuery rangeQuery = > > > NumericRangeQuery.newLongRange("LastLogin", 1282197146L, > 1282197146L, > > > true, true); > > > > > > IndexReader reader = new IndexReader(columnFamily, > > > > getContext(conn))
RE: Numeric range query not returning results
I've determined the problem. It's the same end bug as we experienced with the Cassandra encoding and the term enumeration not being properly returned. I've outlined the issues in this bug on Lucandra. http://github.com/tjake/Lucandra/issues/#issue/40 As you can see, the enumeration of the LucandraTermEnum does not enumerate in the same way as the SegmentTermEnum when using the RamDirectory. Uwe, can you please elaborate on how the NumericRangeQuery.next() expects the underlying TermEnum to iterate over terms? It appears that it attempts to skip to the min term with the full trie, then uses the most significant bits in the trie to get a majority of the data. How does it expect to enumerate terms for the upper bound. Is my example below correct? Min = 60077f7e6814 (encoded as 32 bits shifted to UTF 8 bytes) Max = 60077f7e7111 (encoded as 32 bits shifted to UTF 8 bytes) Step = 8 Start 60077f7e6814 60077f7e68 60077f7e 60077f7e71 60077f7e7111 End. Thanks, Todd On Tue, 2010-10-05 at 09:20 +1300, Todd Nine wrote: > Hi Uwe, > My example wasn't very clear, as I have a load of other code in my > actual implementation and I was trying to cut it down for clarity. > This is actually my indexing service for my Datanucleus Cassandra > plugin, so I have a 1 to 1 relationship where a single document > corresponds to a Persistent object. I actually create 5 separate > documents, and I would expect 3 of those to be returned. I've ported > your entire set of tests for 32 and 64 bit numeric range tests over, > and it unfortunately appears that Lucandra is still very broken in > terms of numeric ranges even after the Cassandra encoding fix for the > 7bit shift into UTF 8 characters. I'll hopefully be able to solve the > bugs in the next few days. Thanks again for your help, it's always > appreciated. > > Todd > > > > > > > On Mon, 2010-10-04 at 07:55 +0200, Uwe Schindler wrote: > > > This test works perfectly and returns 1 doucment: > > > > > > > > public void testToddNine() throws Exception { > > > > RAMDirectory directory = new RAMDirectory(); > > > > IndexWriter writer = new IndexWriter(directory, new > > WhitespaceAnalyzer(), true, MaxFieldLength.UNLIMITED); > > > > try { > > > > Document doc = new Document(); > > > > doc.add(new > > NumericField("LastLogin").setLongValue(1282197146L)); > > > > writer.addDocument(doc); > > > > doc = new Document(); > > > > doc.add(new > > NumericField("LastLogin").setLongValue(1282197946L)); > > > > writer.addDocument(doc); > > > > } finally { > > > > writer.close(); > > > > } > > > > > > > > NumericRangeQuery rangeQuery = > > > > NumericRangeQuery.newLongRange("LastLogin", 1282197146L, > > 1282197146L, true, true); > > > > > > > > IndexReader reader = IndexReader.open(directory, true); > > > > try { > > > > IndexSearcher searcher = new IndexSearcher(reader); > > > > TopDocs docs = searcher.search(rangeQuery, 1000); > > > > assertEquals(1,docs.totalHits); > > > > } finally { > > > > reader.close(); > > > > } > > > > } > > > > > > > > Maybe you have the following problems: > > > > - Are you executing the same query than created. In your > > example code the searcher executed “query” but the range query was > > “rangeQuery” variable name > > > > - Are you sure that your document is not returned, but you > > miss some stored fields? E.g. the default NumericField ctor does not > > create the field as “stored” to the document? > > > > > > > > public NumericField(String name) > > > > Creates a field for numeric values using the default precisionStep > > NumericUtils.PRECISION_STEP_DEFAULT (4). The instance is not yet > > initialized with a numeric value, before indexing a document > > containing this field, set a value using the various set???Value() > > methods. This constructor creates an indexed, but not stored field. > > > > > > > > Uwe > > > > > > > > - > > > > Uwe Schindler > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > http://www.thetaphi.de > > > > eMail: u...@thetaphi.de > > > > > > > > > > > > > -Ori
escaping of queries with solr
Hi guys, We're migrating from Lucene to Solr. We have a lot of existing code that created queries in memory with Lucene. Below is an example of such a query. BooleanQuery query = new BooleanQuery(); BooleanQuery inputTerms = new BooleanQuery(); inputTerms.add(new TermQuery(new Term(FIELD_EMAIL, input)), Occur.SHOULD); String numeric = getNumericString(input); if (StringUtils.hasText(numeric)) { inputTerms.add(new TermQuery(new Term(FIELD_PHONE, numeric)), Occur.SHOULD); } query.add(inputTerms, Occur.MUST); NumericRangeQuery time = NumericRangeQuery.newLongRange( FIELD_CREATETIME, null, endTime, true, true); query.add(time, Occur.MUST); return getResults(query, count); As per an old thread in the mailing list, I can generally call query.toString() which will generate a query I can send to solr. However, we're getting some strings such as this. +(email_string_multi_index:+641112 phone_string_multi_index:641112) +rslved_string_index:false +crttime_long_index:[* TO 1284093338200]': Which will generate this error Encountered " "+" "+ "" at line 1, column 27. The cause is the "+641112". Is there an equivalent for building a query tree in the Solr4j client that will properly escape the input sequence? Thanks, Todd