RE: external values source

2013-04-22 Thread Maciej Liżewski
Hi Timothy,

Thank you for your answer - it is really helpful. Just to clarify - when using 
ValueSource then flow is something like this:
- user sends query
- solr calls ValueSource to prepare values for every document (this part is 
cached in ExternalFileField implementation I guess)
- solr runs query

And above flow is valid in every use case of ValueSource? There are no 
pre-calculated values, etc (just asking to make it clear)? What caching 
scenario is recommended here to make sure you won't end up with different 
cached entry for every query (I think I would follow the example of 
ExternalFileField)?

Another thing is that in most cases array of values created in this process is 
rather sparse.. so I was thinking if there are no other solutions to store them 
with associatnion to documents index...

--
Maciej Liżewski


-Original Message-
From: Timothy Potter [mailto:thelabd...@gmail.com] 
Sent: Saturday, April 20, 2013 2:02 AM
To: solr-user@lucene.apache.org
Subject: Re: external values source

Hi Maciek,

I think a custom ValueSource is definitely what you want because you need to 
compute some derived value based on an indexed field and some external value.

The trick is figuring how to make the lookup to the external data very, very 
fast. Here's a rough sketch of what we do:

We have a table in a database that contains a numeric value for a user and an 
organization, such as query:

select num from table where userId='bob' and orgId=123 (similar to what you 
stated in question #4)

On the Solr side, documents are indexed with user_id_s field, which is half of 
what I need to do my lookup. The orgId is determined by the Solr client at 
query construction time, so is passed to my custom ValueSource (aka function) 
in the query. In our app, users can be associated with many different orgIds 
and changes frequently so we can't index the association.

To do the lookup to the database, we have a custom ValueSource, something like: 
dbLookup(user_id_s, 123)

(note: user_id_s is the name of the field holding my userID values in the index 
and 123 is the orgId)

Behind the scenes, the ValueSource will have access to the user_id_s field 
values using FieldCache, something like:

final BinaryDocValues dv =
FieldCache.DEFAULT.getTerms(reader.reader(), user_id_s);

This gives us fast access to the user_id_s value for any given doc (question #1 
above) So now we can return an IntDocValues instance by
doing:

@Override
public FunctionValues getValues(Map context, AtomicReaderContext
reader) throws IOException {
final BytesRef br = new BytesRef();
final BinaryDocValues dv =
FieldCache.DEFAULT.getTerms(reader.reader(), fieldName);
return new IntDocValues(this) {
@Override
public int intVal(int doc) {
dv.get(doc,br);
if (br.length == 0)
return 0;

final String user_id_s = br.utf8ToString(); // the indexed 
userID for doc
int val = 0;
// todo: do custom lookup with orgID and user_id_s to compute 
int value for doc
return val;
}
}
...
}

In this code, fieldName is set in the constructor (not shown) by parsing it out 
of the parameters, something like:

this.fieldName =
((org.apache.solr.schema.StrFieldSource)source).getField();

The user_id_s field comes into your ValueSource as a StrFieldSource (or 
whatever type you use) ... here is how the ValueSource gets constructed at 
query time:

public class MyValueSourceParser extends ValueSourceParser {
public void init(NamedList namedList) {}

public ValueSource parse(FunctionQParser fqp) throws SyntaxError {
return new MyValueSource(fqp.parseValueSource(), fqp.parseArg());
}
}

There is one instance of your ValueSourceParser created per core. The parse 
method gets called for every query that uses the ValueSource.

At query time, I might use the ValueSource to return this computed value in my 
fl list, such as:

fl=id,looked_up:dbLookup(user_id_l,123),...

Or to sort by:

sort=dbLookup(user_id_s,123) desc

The data in our table doesn't change that frequently, so we export it to a flat 
file in S3 and our custom ValueSource downloads from S3, transforms it into an 
in-memory HashMap for fast lookups. We thought about just issuing a query to 
load the data from the db directly but we have many nodes and the query is 
expensive and result set is large so we didn't want to hammer our database with 
N Solr nodes querying for the same data at roughly the same time. So we do it 
once and post the compressed results to a shared location. The data in the 
table is sparse as compared to the number of documents and userIds we have.

We simply poll S3 for changes every few minutes, which is good enough for us. 
This happens from many nodes in a large Solr Cloud cluster running in EC2 so S3 
works well for us as a distribution mechanism

external values source

2013-04-19 Thread Maciej Liżewski
I need some explanation on how ValuesSource and related classes work.

There are already implemented ExternalFileField, example on how to load data
from database (
http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.
html
http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.h
tml)

But they all fetch ALL data into memory which may consume large amounts of
this resource. Also documents are referenced by 'doc' integer value.

 

My questions:

1)  Is the 'doc' value pointing to document in whole index? If so - how
to get value of such documents field (for example: field named 'id')?

2)  Is there possibility to create ValuesSource, FieldType (or similar
interface which will provide external data to sort and in query results)
which will work only on some subset of documents and use external source
capabilities to fetch document related data?

3)  How does it all work (memory consumption, hashtable access speed,
etc), when there is a lot of documents in index (tens of millions for
example)?

4)  Are there any other examples on loading external data from database
(I want to have numerical 'rate' from simple table having two columns:
'document unique key' string, 'rate' integer/float) which are not just proof
of concept but real-life examples?

 

Any help and hints appreciated

TIA

 

--

Maciek



merging query results with ata from other source

2013-04-05 Thread Maciej Liżewski
Ok., my case is like this: I have Solr index with some documents that must
be left intact. I also need to store somewhere else some data related to
documents in Solr (it can be SQL database or another Solr core).

In other words - I need to have some data stored independently to main Solr
index (for example tagging, user-rating, etc), but I need then to use such
data in queries to the Solr index.

Now - what I need to extend/replace to be able to:

1)  filter Solr queries with such remote data (I can fetch IDs of
documents that should be listed and I need to intersect it somehow with
query results)?

2)  Somehow extend returned results (documents itself or as additional
section in response similar to highlighter) and provide related data (from
external source) with selected documents.

 

Any help appreciated.

 

--

Maciej Liżewski