Re: adding modes to the add command

2007-01-11 Thread Ryan McKinley

On 1/11/07, Erik Hatcher [EMAIL PROTECTED] wrote:


On Jan 11, 2007, at 1:29 AM, Ryan McKinley wrote:
 mode=add or replace fields
 mode=add fields
 mode=add distinct fields

 The reason i ask is that i would like to frequently update a few
 fields without having to know anything about the other fields.

If you can implement a way to do this efficiently with Lucene, you
will be my (and many others) hero!



I'm still new to this, so *please* don't mistake my ambition for hubris!

I understand lucene needs to add all document fields at the same time.
I don't have any magic idea to change that.  BUT it seems like solr
is a good place to offer a syntax that lets users ignore feel like
they are just updating a field rather then loading all fields and
reindexing.


In Lucene to update a document the operation is really a delete
followed by an add.  You will need to add the complete document as
there is no such update only a field semantics in Lucene.

If all fields are stored, the implementation could simply pull them
all into memory on the Solr side and add the document as if it had
been sent entirely by the client.  But, what happens when for un-
stored fields?



for the unstored fields, is it possible to read the tokens (and all
info) and then write them back directly?  Does lucene let you do this
directly?  or do would i need to write a Tokenizer that takes the old
list of tokens and re-tokenizes them?

I guess this would require a slightly different DocumentBuilder for
'updated' fields where you would skip the analyzers defined by the
schemaField.


Re: adding modes to the add command

2007-01-11 Thread J.J. Larrea
At 6:43 AM -0500 1/11/07, Erik Hatcher wrote:
If all fields are stored, the implementation could simply pull them all into 
memory on the Solr side and add the document as if it had been sent entirely 
by the client.  But, what happens when for un-stored fields?

I'll observe that Luke has a Reconstruct and Edit function which displays the 
indexed values for each field for the selected Document when stored values 
aren't available... it iterates the entire inverted index and intersects each 
term position vector with the target Document ID via TermPositions.skipTo(id).

While that would be too slow to do on a per-update basis, it might be feasible 
for an update function if it cached a list of partially defined Documents and 
only at the end (at closing or whenever the list grew past a defined maximum) 
did a bulk intersection to find indexed values which are not overridden with 
new values, with just a single traversal of the index in Term then updated 
DocID order.  Once done the reconstructed Documents could be added and the 
prior versions deleted.

The roadblocks come up when re-adding the indexed values to the index: while 
the updater can create a new untokenized unstored Field for each indexed value 
so it is literally re-added, in that case there is no way to externally specify 
the position offset to match the original.  DocumentWriter and the classes it 
relies on are package-private and final, so no way to interpose there.  But an 
effective hack might be to set the reconstructed Fields to tokenized but 
specify for those fields a special Analyzer which acts like Keyword Analyzer 
but looks up the position offset in a table created by the update mechanism and 
returns it with the token.  A little convoluted but probably doable if someone 
had the time and inclination.

- J.J.



Re: adding modes to the add command

2007-01-11 Thread Yonik Seeley

On 1/11/07, J.J. Larrea [EMAIL PROTECTED] wrote:

I'll observe that Luke has a Reconstruct and Edit function which displays the 
indexed values for each field for the selected Document when stored values aren't 
available... it iterates the entire inverted index and intersects each term position 
vector with the target Document ID via TermPositions.skipTo(id).


Right, and that's *very* slow for a large index.  IMO, it would be
better to add a restriction that if you want the ability to update
fields on an existing document, then all the fields must be stored.

One might be able to do some magic with ParallelIndex, but that too
comes with it's own set of problems, namely keeping the indicies in
sync.

-Yonik


adding modes to the add command

2007-01-10 Thread Ryan McKinley

How do you all feel about adding various modes to the add command?

Something like:
mode=add or replace document (default, the current behavior)
mode=add or replace fields
mode=add fields
mode=add distinct fields

The reason i ask is that i would like to frequently update a few
fields without having to know anything about the other fields.
Ideally it would be from a command like this:

addFromSQL
 mode=add or replace fields
 connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx
 driver=com.mysql.jdbc.Driver
 multifieldSeperator=\n 
 SELECT * FROM my_stats_table
/addFromSQL

(I'll save the discussion of addFromSQL ... for the next email)

To give you examples of how i think these modes would behave,
consider you have a database with the following document:

doc
 field name=idXYZ/field
 field name=count10/field
 field name=catA/field
 field name=catB/field
 field name=catC/field
/doc



AFTER:
add mode=add or replace fields
doc
 field name=idXYZ/field
 field name=catC/field
 field name=catD/field
/doc
/add

You would have:
doc
 field name=idXYZ/field
 field name=count10/field
 field name=catC/field
 field name=catD/field
/doc

--

AFTER:
add mode=add fields
doc
 field name=idXYZ/field
 field name=catC/field
 field name=catD/field
/doc
/add

You would have:
doc
 field name=idXYZ/field
 field name=count10/field
 field name=catA/field
 field name=catB/field
 field name=catC/field
 field name=catC/field
 field name=catD/field
/doc

-

AFTER:
add mode=add distinct fields
doc
 field name=idXYZ/field
 field name=catC/field
 field name=catD/field
/doc
/add

You would have:
doc
 field name=idXYZ/field
 field name=count10/field
 field name=catA/field
 field name=catB/field
 field name=catC/field
 field name=catD/field
/doc

I *think* it should even have the same thing if it were given:
add mode=add distinct fields
doc
 field name=idXYZ/field
 field name=catC/field
 field name=catC/field
 field name=catC/field
 field name=catC/field
 field name=catD/field
/doc
/add

---

Although I am suggesting this for my immediate need (update from SQL),
it seems like this would also be useful for anyone building a
javascript 'tagging' library.  The client could use the add distinct
fields mode without worrying about the rest of the document.

Essentially, this would ask the server to load and merge documents.
If it were added, i think it would go in
DirectUpdateHandler.addDoc(AddupdateCommand) [line 312] before
checking dupes and overwrite stuff.  To accuratly load the 'current'
document, it would have to first check the updateHandler.searcher (in
case the document is in the updating index) then check the 'real'
index.  BUT, checking the DirectUpdateHandler.searcher every time you
add a document would not be great because the searcher is closed each
time.

Thoughts?

ryan


Re: adding modes to the add command

2007-01-10 Thread Bertrand Delacretaz

On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote:

How do you all feel about adding various modes to the add command?

Something like:
 mode=add or replace document (default, the current behavior)
 mode=add or replace fields
 mode=add fields
 mode=add distinct fields...


Although this is useful as you explain, I like the current simplicity
of the Solr HTTP/XML interface very much.

The more options we add, the harder it becomes to understand and test
the interface.

So, IMHO, it would be good for this functionality to be provided in a
plugin, disabled by default. IIUC the modifications the the Solr core
are minimal, these can included in the core, but (again IMHO, this is
debatable for sure) the public interface to this should be provided by
a special plugin.

-Bertrand