Re: adding modes to the add command
On 1/11/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Jan 11, 2007, at 1:29 AM, Ryan McKinley wrote: mode=add or replace fields mode=add fields mode=add distinct fields The reason i ask is that i would like to frequently update a few fields without having to know anything about the other fields. If you can implement a way to do this efficiently with Lucene, you will be my (and many others) hero! I'm still new to this, so *please* don't mistake my ambition for hubris! I understand lucene needs to add all document fields at the same time. I don't have any magic idea to change that. BUT it seems like solr is a good place to offer a syntax that lets users ignore feel like they are just updating a field rather then loading all fields and reindexing. In Lucene to update a document the operation is really a delete followed by an add. You will need to add the complete document as there is no such update only a field semantics in Lucene. If all fields are stored, the implementation could simply pull them all into memory on the Solr side and add the document as if it had been sent entirely by the client. But, what happens when for un- stored fields? for the unstored fields, is it possible to read the tokens (and all info) and then write them back directly? Does lucene let you do this directly? or do would i need to write a Tokenizer that takes the old list of tokens and re-tokenizes them? I guess this would require a slightly different DocumentBuilder for 'updated' fields where you would skip the analyzers defined by the schemaField.
Re: adding modes to the add command
At 6:43 AM -0500 1/11/07, Erik Hatcher wrote: If all fields are stored, the implementation could simply pull them all into memory on the Solr side and add the document as if it had been sent entirely by the client. But, what happens when for un-stored fields? I'll observe that Luke has a Reconstruct and Edit function which displays the indexed values for each field for the selected Document when stored values aren't available... it iterates the entire inverted index and intersects each term position vector with the target Document ID via TermPositions.skipTo(id). While that would be too slow to do on a per-update basis, it might be feasible for an update function if it cached a list of partially defined Documents and only at the end (at closing or whenever the list grew past a defined maximum) did a bulk intersection to find indexed values which are not overridden with new values, with just a single traversal of the index in Term then updated DocID order. Once done the reconstructed Documents could be added and the prior versions deleted. The roadblocks come up when re-adding the indexed values to the index: while the updater can create a new untokenized unstored Field for each indexed value so it is literally re-added, in that case there is no way to externally specify the position offset to match the original. DocumentWriter and the classes it relies on are package-private and final, so no way to interpose there. But an effective hack might be to set the reconstructed Fields to tokenized but specify for those fields a special Analyzer which acts like Keyword Analyzer but looks up the position offset in a table created by the update mechanism and returns it with the token. A little convoluted but probably doable if someone had the time and inclination. - J.J.
Re: adding modes to the add command
On 1/11/07, J.J. Larrea [EMAIL PROTECTED] wrote: I'll observe that Luke has a Reconstruct and Edit function which displays the indexed values for each field for the selected Document when stored values aren't available... it iterates the entire inverted index and intersects each term position vector with the target Document ID via TermPositions.skipTo(id). Right, and that's *very* slow for a large index. IMO, it would be better to add a restriction that if you want the ability to update fields on an existing document, then all the fields must be stored. One might be able to do some magic with ParallelIndex, but that too comes with it's own set of problems, namely keeping the indicies in sync. -Yonik
adding modes to the add command
How do you all feel about adding various modes to the add command? Something like: mode=add or replace document (default, the current behavior) mode=add or replace fields mode=add fields mode=add distinct fields The reason i ask is that i would like to frequently update a few fields without having to know anything about the other fields. Ideally it would be from a command like this: addFromSQL mode=add or replace fields connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx driver=com.mysql.jdbc.Driver multifieldSeperator=\n SELECT * FROM my_stats_table /addFromSQL (I'll save the discussion of addFromSQL ... for the next email) To give you examples of how i think these modes would behave, consider you have a database with the following document: doc field name=idXYZ/field field name=count10/field field name=catA/field field name=catB/field field name=catC/field /doc AFTER: add mode=add or replace fields doc field name=idXYZ/field field name=catC/field field name=catD/field /doc /add You would have: doc field name=idXYZ/field field name=count10/field field name=catC/field field name=catD/field /doc -- AFTER: add mode=add fields doc field name=idXYZ/field field name=catC/field field name=catD/field /doc /add You would have: doc field name=idXYZ/field field name=count10/field field name=catA/field field name=catB/field field name=catC/field field name=catC/field field name=catD/field /doc - AFTER: add mode=add distinct fields doc field name=idXYZ/field field name=catC/field field name=catD/field /doc /add You would have: doc field name=idXYZ/field field name=count10/field field name=catA/field field name=catB/field field name=catC/field field name=catD/field /doc I *think* it should even have the same thing if it were given: add mode=add distinct fields doc field name=idXYZ/field field name=catC/field field name=catC/field field name=catC/field field name=catC/field field name=catD/field /doc /add --- Although I am suggesting this for my immediate need (update from SQL), it seems like this would also be useful for anyone building a javascript 'tagging' library. The client could use the add distinct fields mode without worrying about the rest of the document. Essentially, this would ask the server to load and merge documents. If it were added, i think it would go in DirectUpdateHandler.addDoc(AddupdateCommand) [line 312] before checking dupes and overwrite stuff. To accuratly load the 'current' document, it would have to first check the updateHandler.searcher (in case the document is in the updating index) then check the 'real' index. BUT, checking the DirectUpdateHandler.searcher every time you add a document would not be great because the searcher is closed each time. Thoughts? ryan
Re: adding modes to the add command
On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote: How do you all feel about adding various modes to the add command? Something like: mode=add or replace document (default, the current behavior) mode=add or replace fields mode=add fields mode=add distinct fields... Although this is useful as you explain, I like the current simplicity of the Solr HTTP/XML interface very much. The more options we add, the harder it becomes to understand and test the interface. So, IMHO, it would be good for this functionality to be provided in a plugin, disabled by default. IIUC the modifications the the Solr core are minimal, these can included in the core, but (again IMHO, this is debatable for sure) the public interface to this should be provided by a special plugin. -Bertrand