[
https://issues.apache.org/jira/browse/SOLR-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Noble Paul updated SOLR-828:
----------------------------
Description:
This is same as SOLR-139. A new issue is opened so that the UpdateProcessor
approach is highlighted and we can easily focus on that solution.
The new {{UpdateProcessor}} called ({{UpdateableIndexProcessor}}) must be
inserted before {{RunUpdateProcessor}}.
* The {{UpdateProcessor}} must add an update method.
* the {{AddUpdateCommand}} has a new boolean field append. If append= true
multivalued fields will be appended else old ones are removed and new ones are
added
* The schema must have a {{<uniqueKey>}}
* {{UpdateableIndexProcessor}} registers {{postCommit/postOptimize}} listeners.
h1.Implementation
{{UpdateableIndexProcessor}} uses a DB (JDBC / Berkley DB java?) to store the
data. Each document will be a row in the DB . The uniqueKey of the document
will be used as the primary key. The data will be written as a BLOB into a DB
column . The format will be NamedListCodec serialized format. The
NamedListCodec in the current form is inefficient but it is possible to
enhance it (SOLR-810)
The schema of the table would be
DATA : LONGVARBINARY : A NamedListCodec Serialized data
COMMITTED:BOOL
BOOST:DOUBLE
FIELD_BOOSTS:VARBINARY A NamedListCodec serialized boosts of each fields
h1.Implementation of various methods
h2.{{processAdd()}}
{{UpdateableIndexProcessor}} writes the serialized document to the DB
(COMMITTED=false) . Call next {{UpdateProcessor#add()}}
h2.{{processDelete()}}
{{UpdateableIndexProcessor}} gets the Searcher from a core query and find the
documents which matches the query and delete from the data table . If it is a
delete by id delete the document with that id from data table. Call next
{{UpdateProcessor}}
h2.{{processCommit()}}
Call next {{UpdateProcessor}}
h2.on {{postCommit/postOmptimize}}
{{UpdateableIndexProcessor}} gets all the documents from the data table which
is committed =false. If the document is present in the main index it is marked
as COMMITTED=true, else it is deleted because a deletebyquery would have
deleted it .
h2.{{processUpdate()}}
{{UpdateableIndexProcessor}} check the document first in data table. If it is
present read the document . If it is not present , read all the missing fields
from there, and the backup document is prepared
The single valued fields are used from the incoming document (if present)
others are fillled from backup doc . If append=true all the multivalues values
from backup document are added to the incoming document else the values from
backup document is not used if they are present in incoming document also.
{{processAdd()}} is called on the next {{UpdateProcessor}}
h2. new {{BackupIndexRequestHandler}} registered automatically at {{/backup}}
This exposes the data present in the backup indexes. The user must be able to
get any document by id by invoking {{/backup?id=<value>}} (multiple id values
can be sent eg:id=1&id=2&id=4). This helps the user to query the backup index
and construct the new doc if he wishes to do so. The
{{BackupIndexRequestHandler}} does a commit on *temp.backup.index* .It first
searches the *temp.backup.index* with the id .If the document is not found,
then it searches the *backup.index* . If it finds the document(s) it is returned
h2.Next steps
The datastore can be optimized by not storing the stored fields in the DB. That
can be another iteration
was:
This is same as SOLR-139. A new issue is opened so that the UpdateProcessor
approach is highlighted and we can easily focus on that solution.
The new {{UpdateProcessor}} called ({{UpdateableIndexProcessor}}) must be
inserted before {{RunUpdateProcessor}}.
* The {{UpdateProcessor}} must add an update method.
* the {{AddUpdateCommand}} has a new boolean field append. If append= true
multivalued fields will be appended else old ones are removed and new ones are
added
* The schema must have a {{<uniqueKey>}}
* {{UpdateableIndexProcessor}} registers {{postCommit/postOptimize}} listeners.
h1.Implementation
{{UpdateableIndexProcessor}} maintains two separate Lucene indexes for doing
the backup
* *temp.backup.index* : This index stores (not indexed) all the fields (except
uniquekey which is stored and indexed) in the document
* *backup.index* : This index stores (not indexed) all the fields (except
uniquekey which is stored and indexed) which are not stored in the main index
and the fields which are targets of copyField.
h1.Implementation of various methods
h2.{{processAdd()}}
{{UpdateableIndexProcessor}} writes the document to *temp.backup.index* . Call
next {{UpdateProcessor}}
h2.{{processDelete()}}
{{UpdateableIndexProcessor}} gets the Searcher from a core query and find the
documents which matches the query and delete from *backup.index* . if it is a
delete by id delete the document with that id from *temp.backup.index* . Call
next {{UpdateProcessor}}
h2.{{processCommit()}}
Call next {{UpdateProcessor}}
h2.on {{postCommit/postOmptimize}}
{{UpdateableIndexProcessor}} commits the *temp.backup.index* . Gets all the
documents from the *temp.backup.index* one by one . If the document is present
in the main index it is copied to *backup.index* , else it is thrown away
because a deletebyquery would have deleted it .Finally it commits the
*backup.index*. *temp.backup.index* is destroyed after that. A new
*temp.backup.index* is recreated when new documents are added
h2.{{processUpdate()}}
{{UpdateableIndexProcessor}} commits the *temp.backup.index* . Check the
document first in *temp.backup.index* . If it is present read the document . If
it is not present , check in *backup.index* .If it is present there , get the
searcher from the main index and read all the missing fields from there, and
the backup document is prepared
The single valued fields are used from the incoming document (if present)
others are fillled from backup doc . If append=true all the multivalues values
from backup document are added to the incoming document else the values from
backup document is not used if they are present in incoming document also.
h2. new {{BackupIndexRequestHandler}} registered automatically at {{/backup}}
This exposes the data present in the backup indexes. The user must be able to
get any document by id by invoking {{/backup?id=<value>}} (multiple id values
can be sent eg:id=1&id=2&id=4). This helps the user to query the backup index
and construct the new doc if he wishes to do so. The
{{BackupIndexRequestHandler}} does a commit on *temp.backup.index* .It first
searches the *temp.backup.index* with the id .If the document is not found,
then it searches the *backup.index* . If it finds the document(s) it is returned
Issue Type: New Feature (was: Improvement)
The old approach is more work compared to the DB approach. It was not good for
very fast updates/commits
> A RequestProcessor to support updates
> -------------------------------------
>
> Key: SOLR-828
> URL: https://issues.apache.org/jira/browse/SOLR-828
> Project: Solr
> Issue Type: New Feature
> Reporter: Noble Paul
> Fix For: 1.4
>
>
> This is same as SOLR-139. A new issue is opened so that the UpdateProcessor
> approach is highlighted and we can easily focus on that solution.
> The new {{UpdateProcessor}} called ({{UpdateableIndexProcessor}}) must be
> inserted before {{RunUpdateProcessor}}.
> * The {{UpdateProcessor}} must add an update method.
> * the {{AddUpdateCommand}} has a new boolean field append. If append= true
> multivalued fields will be appended else old ones are removed and new ones
> are added
> * The schema must have a {{<uniqueKey>}}
> * {{UpdateableIndexProcessor}} registers {{postCommit/postOptimize}}
> listeners.
> h1.Implementation
> {{UpdateableIndexProcessor}} uses a DB (JDBC / Berkley DB java?) to store the
> data. Each document will be a row in the DB . The uniqueKey of the document
> will be used as the primary key. The data will be written as a BLOB into a DB
> column . The format will be NamedListCodec serialized format. The
> NamedListCodec in the current form is inefficient but it is possible to
> enhance it (SOLR-810)
> The schema of the table would be
> DATA : LONGVARBINARY : A NamedListCodec Serialized data
> COMMITTED:BOOL
> BOOST:DOUBLE
> FIELD_BOOSTS:VARBINARY A NamedListCodec serialized boosts of each fields
> h1.Implementation of various methods
> h2.{{processAdd()}}
> {{UpdateableIndexProcessor}} writes the serialized document to the DB
> (COMMITTED=false) . Call next {{UpdateProcessor#add()}}
> h2.{{processDelete()}}
> {{UpdateableIndexProcessor}} gets the Searcher from a core query and find the
> documents which matches the query and delete from the data table . If it is a
> delete by id delete the document with that id from data table. Call next
> {{UpdateProcessor}}
> h2.{{processCommit()}}
> Call next {{UpdateProcessor}}
> h2.on {{postCommit/postOmptimize}}
> {{UpdateableIndexProcessor}} gets all the documents from the data table which
> is committed =false. If the document is present in the main index it is
> marked as COMMITTED=true, else it is deleted because a deletebyquery would
> have deleted it .
> h2.{{processUpdate()}}
> {{UpdateableIndexProcessor}} check the document first in data table. If it is
> present read the document . If it is not present , read all the missing
> fields from there, and the backup document is prepared
> The single valued fields are used from the incoming document (if present)
> others are fillled from backup doc . If append=true all the multivalues
> values from backup document are added to the incoming document else the
> values from backup document is not used if they are present in incoming
> document also.
> {{processAdd()}} is called on the next {{UpdateProcessor}}
> h2. new {{BackupIndexRequestHandler}} registered automatically at {{/backup}}
> This exposes the data present in the backup indexes. The user must be able to
> get any document by id by invoking {{/backup?id=<value>}} (multiple id values
> can be sent eg:id=1&id=2&id=4). This helps the user to query the backup index
> and construct the new doc if he wishes to do so. The
> {{BackupIndexRequestHandler}} does a commit on *temp.backup.index* .It first
> searches the *temp.backup.index* with the id .If the document is not found,
> then it searches the *backup.index* . If it finds the document(s) it is
> returned
> h2.Next steps
> The datastore can be optimized by not storing the stored fields in the DB.
> That can be another iteration
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.