Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex
Hi Erick, thanks a lot! This looks like a good idea: Our queries with the changeable fields fits the join-idea from https://issues.apache.org/jira/browse/SOLR-2272 because - we do not need relevance ranking - we can separate in a conjunction of a query with the changeable fields and our other stable fields So we can use something like q=stablefields:query1fq={!join from=changeable_fields_doc_id to:stable_fields_doc_id}changeablefields:query2 Only disprofit from the solution with ParallelReader is, that our stored fields and vector terms will be divided on two lucene-docs, which is ok in our use-case. Best regards Karsten in context: http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html Original-Nachricht Datum: Wed, 3 Aug 2011 22:11:08 -0400 Von: Erick Erickson erickerick...@gmail.com An: solr-user@lucene.apache.org Betreff: Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex Hmmm, the only thing that comes to mind is the join feature being added to Solr 4.x, but I confess I'm not entirely familiar with that functionality so can't tell if it really solver your problem. Other than that I'm out of ideas, but the again it's late and I'm tired so maybe I'm not being very creative G... Best Erick On Aug 3, 2011 11:40 AM, karsten-s...@gmx.de wrote:
Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex
How are these fields used? Because if they're not used for searching, you could put them in their own core and rebuild that index at your whim, then querying that core when you need the relationship information. If you have a DB backing your system, you could perhaps store the info there and query that (but I like the second core better G).. But if you could use a separate index just for the relationships, you wouldn't have to deal with the slow re-indexing of all the docs... Best Erick On Mon, Aug 1, 2011 at 4:12 AM, karsten-s...@gmx.de wrote: Hi lucene/solr-folk, Issue: Our documents are stable except for two fields which are used for linking between the docs. So we like to update this two fields in a batch once a month (possible once a week). We can not reindex all docs once a month, because we are using XeLDA in some fields for stemming (morphological analysis), and XeLDA is slow. We have 14 Mio docs (less than 100GByte Main-Index and 3 GByte for this two changable fields). In the next half year we will migrating our search engine from verity K2 to solr; so we could wait for solr 4.0 ( btw any news about http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html ? ). Solution? Our issue is exactly the purpose of ParallelReader. But Solr do not support ParallelReader (for a good reason: http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624 ). So I see two possible ways to solve our issue: 1. waiting for the new Parallel incremental indexing ( https://issues.apache.org/jira/browse/LUCENE-1879 ) and hoping that solr will integrate this. Pro: - nothing to do for us except waiting. Contra: - I did not found anything of the (old) patch in current trunk. 2. Change lucene index below/without solr in a batch: a) Each month generate a new index only with our two changed fields (e.g. with DIH) b) Use FilterIndex and ParallelReader to mock a correct index c) “Merge” this mock index to a new Index (via IndexWriter.addIndexes(IndexReader...) ) Pro: - The patch for https://issues.apache.org/jira/browse/LUCENE-1812 should be a good example, how to do this. Contra: - relation between DocId and document index order is not an guaranteed feature of DIH, (e.g. we will have to split the main index to ensure that no merge will occur in/after DIH). - To run this batch, solr has to be stopped and restarted. - Even if we know, that our two field should change only for a subset of the docs, we nevertheless have to reindex this two fields for all the docs. Any comments, hints or tips? Is there a third (better) way to solve our issue? Is there already an working example of the 2. solution? Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0? Best regards Karsten
Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex
Hi Erick, our two changable fields are used for linking between documents on application level. From lucene point of view they are just two searchable fields with stored term vector for one of them. Our queries will use one of this fields and a couple of fields from the stable fields. So the question is really about updating two fields in an existing lucene index with more then fifty other fields. Best regards Karsten P.S. about our linking between documents: Our two fields called outgoingLinks and possibleIncomingLinks. Our source-documents have an abstract and a couple of metadata. We are using regular expression to find outgoing links in this abstract. This means a couple of words, which indicates 1. that the author made a reference (like in my previos work published as 'Very important Article' in Nature 2010, 12 page 7) 2. that this reference contains metadata to an other document Each of this links is transformed to a special key (2010NaturNr12Page7). On the other side, we transform the metadata to all possible keys. This key generation grows with our knowledge of possible link pattern. For the lucene indexer this is a black-box: There is a service which produce the keys for outgoing and possibleIncoming from our source (xml-)documents, this keys must be searchable in lucene/solr. P.P.S. in Context: http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html Original-Nachricht Datum: Wed, 3 Aug 2011 09:57:03 -0400 Von: Erick Erickson erickerick...@gmail.com An: solr-user@lucene.apache.org Betreff: Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex How are these fields used? Because if they're not used for searching, you could put them in their own core and rebuild that index at your whim, then querying that core when you need the relationship information. If you have a DB backing your system, you could perhaps store the info there and query that (but I like the second core better G).. But if you could use a separate index just for the relationships, you wouldn't have to deal with the slow re-indexing of all the docs... Best Erick On Mon, Aug 1, 2011 at 4:12 AM, karsten-s...@gmx.de wrote: Hi lucene/solr-folk, Issue: Our documents are stable except for two fields which are used for linking between the docs. So we like to update this two fields in a batch once a month (possible once a week). We can not reindex all docs once a month, because we are using XeLDA in some fields for stemming (morphological analysis), and XeLDA is slow. We have 14 Mio docs (less than 100GByte Main-Index and 3 GByte for this two changable fields). In the next half year we will migrating our search engine from verity K2 to solr; so we could wait for solr 4.0 ( btw any news about http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html ? ). Solution? Our issue is exactly the purpose of ParallelReader. But Solr do not support ParallelReader (for a good reason: http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624 ). So I see two possible ways to solve our issue: 1. waiting for the new Parallel incremental indexing ( https://issues.apache.org/jira/browse/LUCENE-1879 ) and hoping that solr will integrate this. Pro: - nothing to do for us except waiting. Contra: - I did not found anything of the (old) patch in current trunk. 2. Change lucene index below/without solr in a batch: a) Each month generate a new index only with our two changed fields (e.g. with DIH) b) Use FilterIndex and ParallelReader to mock a correct index c) “Merge” this mock index to a new Index (via IndexWriter.addIndexes(IndexReader...) ) Pro: - The patch for https://issues.apache.org/jira/browse/LUCENE-1812 should be a good example, how to do this. Contra: - relation between DocId and document index order is not an guaranteed feature of DIH, (e.g. we will have to split the main index to ensure that no merge will occur in/after DIH). - To run this batch, solr has to be stopped and restarted. - Even if we know, that our two field should change only for a subset of the docs, we nevertheless have to reindex this two fields for all the docs. Any comments, hints or tips? Is there a third (better) way to solve our issue? Is there already an working example of the 2. solution? Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0? Best regards Karsten
Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex
Hmmm, the only thing that comes to mind is the join feature being added to Solr 4.x, but I confess I'm not entirely familiar with that functionality so can't tell if it really solver your problem. Other than that I'm out of ideas, but the again it's late and I'm tired so maybe I'm not being very creative G... Best Erick On Aug 3, 2011 11:40 AM, karsten-s...@gmx.de wrote:
Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex
Hi lucene/solr-folk, Issue: Our documents are stable except for two fields which are used for linking between the docs. So we like to update this two fields in a batch once a month (possible once a week). We can not reindex all docs once a month, because we are using XeLDA in some fields for stemming (morphological analysis), and XeLDA is slow. We have 14 Mio docs (less than 100GByte Main-Index and 3 GByte for this two changable fields). In the next half year we will migrating our search engine from verity K2 to solr; so we could wait for solr 4.0 ( btw any news about http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html ? ). Solution? Our issue is exactly the purpose of ParallelReader. But Solr do not support ParallelReader (for a good reason: http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624 ). So I see two possible ways to solve our issue: 1. waiting for the new Parallel incremental indexing ( https://issues.apache.org/jira/browse/LUCENE-1879 ) and hoping that solr will integrate this. Pro: - nothing to do for us except waiting. Contra: - I did not found anything of the (old) patch in current trunk. 2. Change lucene index below/without solr in a batch: a) Each month generate a new index only with our two changed fields (e.g. with DIH) b) Use FilterIndex and ParallelReader to mock a correct index c) “Merge” this mock index to a new Index (via IndexWriter.addIndexes(IndexReader...) ) Pro: - The patch for https://issues.apache.org/jira/browse/LUCENE-1812 should be a good example, how to do this. Contra: - relation between DocId and document index order is not an guaranteed feature of DIH, (e.g. we will have to split the main index to ensure that no merge will occur in/after DIH). - To run this batch, solr has to be stopped and restarted. - Even if we know, that our two field should change only for a subset of the docs, we nevertheless have to reindex this two fields for all the docs. Any comments, hints or tips? Is there a third (better) way to solve our issue? Is there already an working example of the 2. solution? Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0? Best regards Karsten