Help with Numeric Range

2010-06-22 Thread Todd Nine
Hi all,
  I'm new to Lucene, as well as Cassandra.  I'm working on the Lucandra
project to modify it to add some extra functionality.  It hasn't been
fully testing with range queries, so I've created some tests and
contributed them.  You can view my source here.

http://github.com/tnine/Lucandra/blob/master/test/lucandra/NumericRangeTests.java

First, is this a sensible test?  I'm specifically testing the case of
longs where I need millisecond precision on my searches. 


Second, I see that Numeric Fields are built via terms.  I think the
issue lies in the encoding of these terms into bytes for the Cassandra
keys.  Can anyone point me to some documentation on numeric queries and
terms, and how they are encoded at the byte level based on the
precision?

Thanks,
Todd


RE: Help with Numeric Range

2010-06-23 Thread Todd Nine
Hi Uwe,

  Thank you for your help, it is greatly appreciated.  Unfortunately, my
tests all fail except for RangeInclusive.  I've changed the step to be 6
as per your recommendation.  I had it at max to eliminate step precision
as the cause of the test failure.  Essentially, all keys in Cassandra
are UTF-8 Keys.  In the Lucandra, the keys are constructed in the
following way.

1. Get the token stream for the field.  In this case it's a
NumericTokenStream with (numeric,valSize=64,precisionStep=6)
2. For all tokens in the stream, create a UTF8 String in the following
format \u
3. Set the term frequency to 1

This gives us a list of tokens, prefixed with the field name and the
delimiter.  then we do this

for each term from above create a key of the format
\u\u and write it to TermInfo
column Family

After debugging the implementation of the LucandraTermEnum, it is
correctly returning values that should match my numeric range query.
However, I never get the results in the TopDocs result set after they're
handed back to the numeric range query object.  Any ideas why this is
happening?

Thanks,
Todd




On Wed, 2010-06-23 at 08:53 +0200, Uwe Schindler wrote:

> Hi Todd,
> 
> I am not sure if I understand your problem correctly. I am not familiar with 
> Lucandra/Cassandra at all, but if Lucandra implements the IndexWriter and 
> IndexReader according to the documentation, numeric queries should work. A 
> NumericField internally creates a TokenStream and "analyzes" the number to 
> several Tokens, which are somehow "half binary" (they are terms containing of 
> characters in the full 0..127 range for optimal UTF8 compression with 3.x 
> versions of Lucene). The exact encoding can be looked at in the NumericUtils 
> class + javadocs.
> 
> About your testcase: The test looks good, so does it fail? If yes, where is 
> the problem? You can also look into Lucene's test TestNumericRangeQuery64 for 
> more examples. Or modify its @BeforeClass to instead build a Lucandra index. 
> 
> The test has one thing, that is not intended to be done like that:
> numeric = new NumericField("long", Integer.MAX_VALUE, Store.YES, true);
> 
> You are using MAX_VALUE as precision step, this would slowdown all queries to 
> the speed of old-style TermRangeQueries. It is always better to stick with 
> the default of 4, which creates 64 bits / 4 precStep = 16 terms per value. 
> Alternatively for longs, 6 is a good precision step (see NumericRangeQuery 
> documentation). MAX_VALUE is only intended for fields that do not do numeric 
> ranges but e.g. sort only. precisionStep is a performance tuning parameter, 
> it has nothing to do with better/worse precision on terms or different query 
> results. If you are using NumericRangeQuery with this large precStep, you are 
> not using the numeric features at all, so your test should not behave 
> different from a conventional TermRangeQuery with padded terms.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: Todd Nine [mailto:t...@spidertracks.co.nz]
> > Sent: Wednesday, June 23, 2010 7:53 AM
> > To: java-user@lucene.apache.org
> > Subject: Help with Numeric Range
> > 
> > Hi all,
> >   I'm new to Lucene, as well as Cassandra.  I'm working on the Lucandra
> > project to modify it to add some extra functionality.  It hasn't been fully
> > testing with range queries, so I've created some tests and contributed them.
> > You can view my source here.
> > 
> > http://github.com/tnine/Lucandra/blob/master/test/lucandra/NumericRang
> > eTests.java
> > 
> > First, is this a sensible test?  I'm specifically testing the case of longs 
> > where I
> > need millisecond precision on my searches.
> > 
> > 
> > Second, I see that Numeric Fields are built via terms.  I think the issue 
> > lies in
> > the encoding of these terms into bytes for the Cassandra keys.  Can anyone
> > point me to some documentation on numeric queries and terms, and how
> > they are encoded at the byte level based on the precision?
> > 
> > Thanks,
> > Todd
> 


Filtering times stored as longs in HEX

2010-09-16 Thread Todd Nine
Hi all,
  I'm using Lucandra to index notes in our system.  Since we can't use
numeric fields due to a bug in Cassandra (fixed in 0.7), I'm encoding
all times a epoch in Hex, then storing the hex string.  I have the
following fields on my document.

createdDate
phoneNumber
email


I want to perform a query where the input is either a phone number, or
an email.  The user also passes in an epoch timestamp (long in
milliseconds), and the count.  I need to return all documents with a
timestamp <= the given timestamp, and the maximum count.  I'm having
some trouble building this query in my code.  I never get any results,
but I can see the data is written to the index properly.  Here is my
code.

http://pastie.org/private/xzvnntmyjzxgpjgctxftrq



Thanks,
Todd






Help with numeric ranges and querying with sorting and counts

2010-09-19 Thread Todd Nine
Hi all,
  Now that the nasty bug in Cassandra has been fixed, I can use numeric
fields in my Lucandra for searching and sorting.  I'm having a bit of an
issue I could use a hand with.  We're creating an SoS index.  Each
Document corresponds to an SoS.  Every person contacted for the SoS will
be indexed by their email address, their phone numbers, and the time the
SoS was created.  I have an api with the following functionality.


public List getAll(String input, long endTime int count) {


Where the string is either a phone number or an email address, the
endTime is the epoch end time to seek up to, and the count is the number
of records to return.  I basically want to perform the following search

(email: input OR phone: input) AND endtime[-LONG.MIN : endTime]

Then return the last "count" values that are closest to the end time.
Here is how I'm creating my document.

Document doc = new Document();
DocumentUtils.setRowKey(doc, sos.getId().toString());

doc.add(new Field(FIELD_IMEI, sos.getImeiNumber(), Store.NO,
Index.NOT_ANALYZED));

doc.add(new Field(FIELD_TRACKIDX, 
getHex(sos.getTrackIndexTime()),
Store.NO, Index.NOT_ANALYZED));

doc.add(new Field(FIELD_TIER, 
getHex(sos.getTier().getStoredValue()),
Store.NO, Index.NOT_ANALYZED));

doc.add(new NumericField(FIELD_CREATETIME).setLongValue(sos
.getCreatedTime().getTime()));

doc.add(new Field(FIELD_RESOLVED, getHex(sos.isResolved()), 
Store.NO,
Index.NOT_ANALYZED));

if (sos.getNotes() != null) {
doc.add(new Field(FIELD_NOTES, sos.getNotes(), Store.NO,
Index.ANALYZED));
}

if (sos.isResolved()) {
doc.add(new 
NumericField(FIELD_RESOLVETIME).setLongValue(sos
.getResolvedTime().getTime()));
}

for (ContactedPerson person : sos.getContactedPeople()) {

doc.add(new Field(FIELD_EMAIL, person.getEmail(), 
Store.NO,
Index.NOT_ANALYZED));

for (Phone phone : person.getPhones()) {
doc.add(new Field(FIELD_PHONE, 
getNumericString(phone
.getNumber()), Store.NO, 
Index.NOT_ANALYZED));
}

}


Here is how I'm creating my query.




BooleanQuery query = new BooleanQuery();

BooleanQuery inputTerms = new BooleanQuery();

inputTerms.add(new TermQuery(new Term(FIELD_EMAIL, input)),
Occur.SHOULD);

inputTerms.add(
new TermQuery(new Term(FIELD_PHONE, 
getNumericString(input))),
Occur.SHOULD);


query.add(inputTerms, Occur.MUST);



NumericRangeQuery time = NumericRangeQuery.newLongRange(
FIELD_CREATETIME, null, endTime, true, true);

query.add(time, BooleanClause.Occur.MUST);

 And my sorter

SortField sort = new SortField(FIELD_CREATETIME, SortField.LONG,true);



Finally here is some sample test data.

SOS 1: createdTime = 1284093337200L
SOS 2: createdTime = 1284093337200L + 1;
SOS 3: createdTime = 1284093337200L + 2;

My search criteria is the following

input = "f...@bar.com" (each record has this email address on it)
time = SOS 3 created time.

I expect to get 3 records, but instead I'm only getting 1.  It's
something specific to this query, as I have similar queries that work
properly for the numeric range and sorting.  Any ideas what is wrong
with my query?

Thanks,
Todd



Numeric range query not returning results

2010-10-03 Thread Todd Nine
Hi all,
  I'm having some issues with Numeric Range queries not working as
expected.  My underlying storage medium is the Lucandra index reader and
writer, so I'm not sure if this is an issue within Lucandra or with my
usage of numeric field.  My numeric range tests that are copies of Uwe's
pass in the Lucandra, source, so I have a feeling it's my usage.  I have
a simple test case, with 5 people.  I have a Date field, the LastLogin
field.  This date is converted to epoch milliseconds, and stored in the
index in the following way.

NumericField numeric = new NumericField("LastLogin");
numeric.setLongValue(fieldValue);
doc.add(numeric);

Where I have the following 2 field values on 2 documents.

1282197146L and 1282197946L

I then perform the following query.

NumericRangeQuery rangeQuery =
NumericRangeQuery.newLongRange("LastLogin", 1282197146L, 1282197146L,
true, true);

IndexReader reader = new IndexReader(columnFamily,
getContext(conn));
IndexSearcher searcher = new 
IndexSearcher(reader);

TopDocs docs = searcher.search(query, 
maxResults);

List documents = new 
ArrayList(
docs.totalHits);

Set fields = new HashSet();
fields.add(IndexDocument.ROWKEY);
fields.add(IndexDocument.IDSERIALIZED);

SetBasedFieldSelector selector = new 
SetBasedFieldSelector(
fields, null);

for (ScoreDoc score : docs.scoreDocs) {

documents.add(reader.document(score.doc, selector));
}

return documents;

I'm always getting 0 documents.  I know this is incorrect, I can see the
values getting written to Cassandra when I run it in debug mode.  Is
this an issue with precision step, or an issue with the Lucandra index
reader implementation?

Thanks,
Todd


RE: Numeric range query not returning results

2010-10-04 Thread Todd Nine
Hi Uwe,
  My example wasn't very clear, as I have a load of other code in my
actual implementation and I was trying to cut it down for clarity.  This
is actually my indexing service for my Datanucleus Cassandra plugin, so
I have a 1 to 1 relationship where a single document corresponds to a
Persistent object.  I actually create 5 separate documents, and I would
expect 3 of those to be returned.  I've ported your entire set of tests
for 32 and 64 bit numeric range tests over, and it unfortunately appears
that Lucandra is still very broken in terms of numeric ranges even after
the Cassandra encoding fix for the 7bit shift into UTF 8 characters.
I'll hopefully be able to solve the bugs in the next few days.  Thanks
again for your help, it's always appreciated.

Todd




On Mon, 2010-10-04 at 07:55 +0200, Uwe Schindler wrote:
> This test works perfectly and returns 1 doucment:
> 
>  
> 
>   public void testToddNine() throws Exception {
> 
> RAMDirectory directory = new RAMDirectory();
> 
> IndexWriter writer = new IndexWriter(directory, new
> WhitespaceAnalyzer(), true, MaxFieldLength.UNLIMITED);
> 
> try {
> 
>   Document doc = new Document();
> 
>   doc.add(new
> NumericField("LastLogin").setLongValue(1282197146L));
> 
>   writer.addDocument(doc);
> 
>   doc = new Document();
> 
>   doc.add(new
> NumericField("LastLogin").setLongValue(1282197946L));
> 
>   writer.addDocument(doc);
> 
> } finally {
> 
>   writer.close();
> 
> }
> 
>  
> 
> NumericRangeQuery rangeQuery =
> 
>   NumericRangeQuery.newLongRange("LastLogin", 1282197146L,
> 1282197146L, true, true);
> 
>  
> 
> IndexReader reader = IndexReader.open(directory, true);
> 
> try {
> 
>   IndexSearcher searcher = new IndexSearcher(reader);
> 
>   TopDocs docs = searcher.search(rangeQuery, 1000);
> 
>   assertEquals(1,docs.totalHits);
> 
> } finally {
> 
>   reader.close();
> 
> }
> 
>   }
> 
>  
> 
> Maybe you have the following problems:
> 
> -  Are you executing the same query than created. In your
> example code the searcher executed “query” but the range query was
> “rangeQuery” variable name
> 
> -  Are you sure that your document is not returned, but you
> miss some stored fields? E.g. the default NumericField ctor does not
> create the field as “stored” to the document?
> 
>  
> 
> public NumericField(String name)
> 
> Creates a field for numeric values using the default precisionStep
> NumericUtils.PRECISION_STEP_DEFAULT (4). The instance is not yet
> initialized with a numeric value, before indexing a document
> containing this field, set a value using the various set???Value()
> methods. This constructor creates an indexed, but not stored field.
> 
>  
> 
> Uwe
> 
>  
> 
> -
> 
> Uwe Schindler
> 
> H.-H.-Meier-Allee 63, D-28213 Bremen
> 
> http://www.thetaphi.de
> 
> eMail: u...@thetaphi.de
> 
>  
> 
>  
> 
> > -Original Message-
> 
> > From: Todd Nine [mailto:t...@spidertracks.co.nz]
> 
> > Sent: Monday, October 04, 2010 6:13 AM
> 
> > To: java-user@lucene.apache.org
> 
> > Subject: Numeric range query not returning results
> 
> > 
> 
> > Hi all,
> 
> >   I'm having some issues with Numeric Range queries not working as
> expected.
> 
> > My underlying storage medium is the Lucandra index reader and
> writer, so I'm
> 
> > not sure if this is an issue within Lucandra or with my usage of
> numeric field.
> 
> > My numeric range tests that are copies of Uwe's pass in the
> Lucandra, source,
> 
> > so I have a feeling it's my usage.  I have a simple test case, with
> 5 people.  I
> 
> > have a Date field, the LastLogin field.  This date is converted to
> epoch
> 
> > milliseconds, and stored in the index in the following way.
> 
> > 
> 
> > NumericField numeric = new NumericField("LastLogin");
> 
> > numeric.setLongValue(fieldValue); doc.add(numeric);
> 
> > 
> 
> > Where I have the following 2 field values on 2 documents.
> 
> > 
> 
> > 1282197146L and 1282197946L
> 
> > 
> 
> > I then perform the following query.
> 
> > 
> 
> > NumericRangeQuery rangeQuery =
> 
> > NumericRangeQuery.newLongRange("LastLogin", 1282197146L,
> 1282197146L,
> 
> > true, true);
> 
> > 
> 
> > IndexReader reader = new IndexReader(columnFamily,
> 
> >
> getContext(conn))

RE: Numeric range query not returning results

2010-10-06 Thread Todd Nine
I've determined the problem.  It's the same end bug as we experienced
with the Cassandra encoding and the term enumeration not being properly
returned.  I've outlined the issues in this bug on Lucandra.

http://github.com/tjake/Lucandra/issues/#issue/40

As you can see, the enumeration of the LucandraTermEnum does not
enumerate in the same way as the SegmentTermEnum when using the
RamDirectory.  Uwe, can you please elaborate on how the
NumericRangeQuery.next() expects the underlying TermEnum to iterate over
terms?  It appears that it attempts to skip to the min term with the
full trie, then uses the most significant bits in the trie to get a
majority of the data.  How does it expect to enumerate terms for the
upper bound.  Is my example below correct?


Min  = 60077f7e6814 (encoded as 32 bits shifted to UTF 8 bytes)
Max = 60077f7e7111 (encoded as 32 bits shifted to UTF 8 bytes)

Step = 8

Start

60077f7e6814
60077f7e68
60077f7e
60077f7e71
60077f7e7111

End.

Thanks,
Todd



On Tue, 2010-10-05 at 09:20 +1300, Todd Nine wrote:

> Hi Uwe,
>   My example wasn't very clear, as I have a load of other code in my
> actual implementation and I was trying to cut it down for clarity.
> This is actually my indexing service for my Datanucleus Cassandra
> plugin, so I have a 1 to 1 relationship where a single document
> corresponds to a Persistent object.  I actually create 5 separate
> documents, and I would expect 3 of those to be returned.  I've ported
> your entire set of tests for 32 and 64 bit numeric range tests over,
> and it unfortunately appears that Lucandra is still very broken in
> terms of numeric ranges even after the Cassandra encoding fix for the
> 7bit shift into UTF 8 characters.  I'll hopefully be able to solve the
> bugs in the next few days.  Thanks again for your help, it's always
> appreciated.
> 
> Todd
> 
> 
> 
> 
> 
> 
> On Mon, 2010-10-04 at 07:55 +0200, Uwe Schindler wrote: 
> 
> > This test works perfectly and returns 1 doucment:
> > 
> >  
> > 
> >   public void testToddNine() throws Exception {
> > 
> > RAMDirectory directory = new RAMDirectory();
> > 
> > IndexWriter writer = new IndexWriter(directory, new
> > WhitespaceAnalyzer(), true, MaxFieldLength.UNLIMITED);
> > 
> > try {
> > 
> >   Document doc = new Document();
> > 
> >   doc.add(new
> > NumericField("LastLogin").setLongValue(1282197146L));
> > 
> >   writer.addDocument(doc);
> > 
> >   doc = new Document();
> > 
> >   doc.add(new
> > NumericField("LastLogin").setLongValue(1282197946L));
> > 
> >   writer.addDocument(doc);
> > 
> > } finally {
> > 
> >   writer.close();
> > 
> > }
> > 
> >  
> > 
> > NumericRangeQuery rangeQuery =
> > 
> >   NumericRangeQuery.newLongRange("LastLogin", 1282197146L,
> > 1282197146L, true, true);
> > 
> >  
> > 
> > IndexReader reader = IndexReader.open(directory, true);
> > 
> > try {
> > 
> >   IndexSearcher searcher = new IndexSearcher(reader);
> > 
> >   TopDocs docs = searcher.search(rangeQuery, 1000);
> > 
> >   assertEquals(1,docs.totalHits);
> > 
> > } finally {
> > 
> >   reader.close();
> > 
> > }
> > 
> >   }
> > 
> >  
> > 
> > Maybe you have the following problems:
> > 
> > -  Are you executing the same query than created. In your
> > example code the searcher executed “query” but the range query was
> > “rangeQuery” variable name
> > 
> > -  Are you sure that your document is not returned, but you
> > miss some stored fields? E.g. the default NumericField ctor does not
> > create the field as “stored” to the document?
> > 
> >  
> > 
> > public NumericField(String name)
> > 
> > Creates a field for numeric values using the default precisionStep
> > NumericUtils.PRECISION_STEP_DEFAULT (4). The instance is not yet
> > initialized with a numeric value, before indexing a document
> > containing this field, set a value using the various set???Value()
> > methods. This constructor creates an indexed, but not stored field.
> > 
> >  
> > 
> > Uwe
> > 
> >  
> > 
> > -
> > 
> > Uwe Schindler
> > 
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > 
> > http://www.thetaphi.de
> > 
> > eMail: u...@thetaphi.de
> > 
> >  
> > 
> >  
> > 
> > > -Ori

escaping of queries with solr

2011-02-10 Thread Todd Nine
Hi guys,
  We're migrating from Lucene to Solr.  We have a lot of existing code
that created queries in memory with Lucene.  Below is an example of such
a query.


BooleanQuery query = new BooleanQuery();

BooleanQuery inputTerms = new BooleanQuery();
inputTerms.add(new TermQuery(new Term(FIELD_EMAIL, input)),
Occur.SHOULD);

String numeric = getNumericString(input);

if (StringUtils.hasText(numeric)) {
inputTerms.add(new TermQuery(new Term(FIELD_PHONE, 
numeric)),
Occur.SHOULD);
}

query.add(inputTerms, Occur.MUST);

NumericRangeQuery time = NumericRangeQuery.newLongRange(
FIELD_CREATETIME, null, endTime, true, true);

query.add(time, Occur.MUST);

return getResults(query, count);


As per an old thread in the mailing list, I can generally call
query.toString() which will generate a query I can send to solr.
However, we're getting some strings such as this.

+(email_string_multi_index:+641112
phone_string_multi_index:641112) +rslved_string_index:false
+crttime_long_index:[* TO 1284093338200]':

Which will generate this error Encountered " "+" "+ "" at line 1, column
27.

The cause is the "+641112".  

Is there an equivalent for building a query tree in the Solr4j client
that will properly escape the input sequence?

Thanks,
Todd