Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-04 Thread karsten-solr
Hi Erick,

thanks a lot!
This looks like a good idea:
Our queries with the changeable fields fits the join-idea from
https://issues.apache.org/jira/browse/SOLR-2272
because
 - we do not need relevance ranking
 - we can separate in a conjunction of a query with the changeable fields and 
our other stable fields
So we can use something like
q=stablefields:query1fq={!join from=changeable_fields_doc_id 
to:stable_fields_doc_id}changeablefields:query2

Only disprofit from the solution with ParallelReader is, that our stored fields 
and vector terms will be divided on two lucene-docs, which is ok in our 
use-case.

Best regards
  Karsten

in context:
http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html

 Original-Nachricht 
 Datum: Wed, 3 Aug 2011 22:11:08 -0400
 Von: Erick Erickson erickerick...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Re: Update some fields for all documents: LUCENE-1879 vs. 
 ParallelReader .FilterIndex

 Hmmm, the only thing that comes to mind is the join feature being added
 to
 Solr 4.x, but I confess I'm not entirely familiar with that functionality
 so
 can't tell if it really solver your problem.
 
 Other than that I'm out of ideas, but the again it's late and I'm tired so
 maybe I'm not being very creative G...
 
 Best
 Erick
 On Aug 3, 2011 11:40 AM, karsten-s...@gmx.de wrote:


Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-03 Thread Erick Erickson
How are these fields used? Because if they're not used for searching, you could
put them in their own core and rebuild that index at your whim, then
querying that
core when you need the relationship information.

If you have a DB backing your system, you could perhaps store the info there
and query that (but I like the second core better G)..

But if you could use a separate index just for the relationships, you wouldn't
have to deal with the slow re-indexing of all the docs...

Best
Erick

On Mon, Aug 1, 2011 at 4:12 AM,  karsten-s...@gmx.de wrote:
 Hi lucene/solr-folk,

 Issue:
 Our documents are stable except for two fields which are used for linking 
 between the docs. So we like to update this two fields in a batch once a 
 month (possible once a week).
 We can not reindex all docs once a month, because we are using XeLDA in some 
 fields for stemming (morphological analysis), and XeLDA is slow. We have 14 
 Mio docs (less than 100GByte Main-Index and 3 GByte for this two changable 
 fields).
 In the next half year we will migrating our search engine from verity K2 to 
 solr; so we could wait for solr 4.0
 (
 btw any news about
 http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html
 ?
 ).

 Solution?

 Our issue is exactly the purpose of ParallelReader.
 But Solr do not support ParallelReader (for a good reason:
 http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624
 ).
 So I see two possible ways to solve our issue:
 1. waiting for the new Parallel incremental indexing
 (
 https://issues.apache.org/jira/browse/LUCENE-1879
 ) and hoping that solr will integrate this.
 Pro:
  - nothing to do for us except waiting.
 Contra:
  - I did not found anything of the (old) patch in current trunk.

 2. Change lucene index below/without solr in a batch:
   a) Each month generate a new index only with our two changed fields
      (e.g. with DIH)
   b) Use FilterIndex and ParallelReader to mock a correct index
   c) “Merge” this mock index to a new Index
      (via IndexWriter.addIndexes(IndexReader...) )
 Pro:
  - The patch for https://issues.apache.org/jira/browse/LUCENE-1812
   should be a good example, how to do this.
 Contra:
  - relation between DocId and document index order is not an guaranteed 
 feature of DIH, (e.g. we will have to split the main index to ensure that no 
 merge will occur in/after DIH).
  - To run this batch, solr has to be stopped and restarted.
  - Even if we know, that our two field should change only for a subset of the 
 docs, we nevertheless have to reindex this two fields for all the docs.

 Any comments, hints or tips?
 Is there a third (better) way to solve our issue?
 Is there already an working example of the 2. solution?
 Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0?

 Best regards
  Karsten



Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-03 Thread karsten-solr
Hi Erick,

our two changable fields are used for linking between documents on 
application level.
From lucene point of view they are just two searchable fields with stored term 
vector for one of them.
Our queries will use one of this fields and a couple of fields from the 
stable fields.

So the question is really about updating two fields in an existing lucene index 
with more then fifty other fields.

Best regards
  Karsten

P.S. about our linking between documents:
Our two fields called outgoingLinks and possibleIncomingLinks.

Our source-documents have an abstract and a couple of metadata.
We are using regular expression to find outgoing links in this abstract. This 
means a couple of words, which indicates 
 1. that the author made a reference (like in my previos work published as 
'Very important Article' in Nature 2010, 12 page 7)
 2. that this reference contains metadata to an other document

Each of this links is transformed to a special key (2010NaturNr12Page7).
On the other side, we transform the metadata to all possible keys.
This key generation grows with our knowledge of possible link pattern.
For the lucene indexer this is a black-box: There is a service which produce 
the keys for outgoing and possibleIncoming from our source (xml-)documents, 
this keys must be searchable in lucene/solr.

P.P.S. in Context:
http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html

 Original-Nachricht 
 Datum: Wed, 3 Aug 2011 09:57:03 -0400
 Von: Erick Erickson erickerick...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Re: Update some fields for all documents: LUCENE-1879 vs. 
 ParallelReader .FilterIndex

 How are these fields used? Because if they're not used for searching, you
 could
 put them in their own core and rebuild that index at your whim, then
 querying that
 core when you need the relationship information.
 
 If you have a DB backing your system, you could perhaps store the info
 there
 and query that (but I like the second core better G)..
 
 But if you could use a separate index just for the relationships, you
 wouldn't
 have to deal with the slow re-indexing of all the docs...
 
 Best
 Erick
 
 On Mon, Aug 1, 2011 at 4:12 AM,  karsten-s...@gmx.de wrote:
  Hi lucene/solr-folk,
 
  Issue:
  Our documents are stable except for two fields which are used for
 linking between the docs. So we like to update this two fields in a batch 
 once a
 month (possible once a week).
  We can not reindex all docs once a month, because we are using XeLDA in
 some fields for stemming (morphological analysis), and XeLDA is slow. We
 have 14 Mio docs (less than 100GByte Main-Index and 3 GByte for this two
 changable fields).
  In the next half year we will migrating our search engine from verity K2
 to solr; so we could wait for solr 4.0
  (
  btw any news about
 
 http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html
  ?
  ).
 
  Solution?
 
  Our issue is exactly the purpose of ParallelReader.
  But Solr do not support ParallelReader (for a good reason:
 
 http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624
  ).
  So I see two possible ways to solve our issue:
  1. waiting for the new Parallel incremental indexing
  (
  https://issues.apache.org/jira/browse/LUCENE-1879
  ) and hoping that solr will integrate this.
  Pro:
   - nothing to do for us except waiting.
  Contra:
   - I did not found anything of the (old) patch in current trunk.
 
  2. Change lucene index below/without solr in a batch:
    a) Each month generate a new index only with our two changed fields
       (e.g. with DIH)
    b) Use FilterIndex and ParallelReader to mock a correct index
    c) “Merge” this mock index to a new Index
       (via IndexWriter.addIndexes(IndexReader...) )
  Pro:
   - The patch for https://issues.apache.org/jira/browse/LUCENE-1812
    should be a good example, how to do this.
  Contra:
   - relation between DocId and document index order is not an guaranteed
 feature of DIH, (e.g. we will have to split the main index to ensure that
 no merge will occur in/after DIH).
   - To run this batch, solr has to be stopped and restarted.
   - Even if we know, that our two field should change only for a subset
 of the docs, we nevertheless have to reindex this two fields for all the
 docs.
 
  Any comments, hints or tips?
  Is there a third (better) way to solve our issue?
  Is there already an working example of the 2. solution?
  Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0?
 
  Best regards
   Karsten
 


Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-03 Thread Erick Erickson
Hmmm, the only thing that comes to mind is the join feature being added to
Solr 4.x, but I confess I'm not entirely familiar with that functionality so
can't tell if it really solver your problem.

Other than that I'm out of ideas, but the again it's late and I'm tired so
maybe I'm not being very creative G...

Best
Erick
On Aug 3, 2011 11:40 AM, karsten-s...@gmx.de wrote:


Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-01 Thread karsten-solr
Hi lucene/solr-folk,

Issue:
Our documents are stable except for two fields which are used for linking 
between the docs. So we like to update this two fields in a batch once a month 
(possible once a week). 
We can not reindex all docs once a month, because we are using XeLDA in some 
fields for stemming (morphological analysis), and XeLDA is slow. We have 14 Mio 
docs (less than 100GByte Main-Index and 3 GByte for this two changable fields).
In the next half year we will migrating our search engine from verity K2 to 
solr; so we could wait for solr 4.0 
(
btw any news about
http://lucene.472066.n3.nabble.com/Release-schedule-Lucene-4-td2256958.html
?
).

Solution?

Our issue is exactly the purpose of ParallelReader. 
But Solr do not support ParallelReader (for a good reason:
http://lucene.472066.n3.nabble.com/Vertical-Partitioning-advice-td494623.html#a494624
).
So I see two possible ways to solve our issue:
1. waiting for the new Parallel incremental indexing 
( 
https://issues.apache.org/jira/browse/LUCENE-1879
) and hoping that solr will integrate this.
Pro: 
 - nothing to do for us except waiting.
Contra: 
 - I did not found anything of the (old) patch in current trunk.

2. Change lucene index below/without solr in a batch:
   a) Each month generate a new index only with our two changed fields 
  (e.g. with DIH)
   b) Use FilterIndex and ParallelReader to mock a correct index
   c) “Merge” this mock index to a new Index
  (via IndexWriter.addIndexes(IndexReader...) )
Pro: 
 - The patch for https://issues.apache.org/jira/browse/LUCENE-1812
   should be a good example, how to do this.
Contra: 
 - relation between DocId and document index order is not an guaranteed feature 
of DIH, (e.g. we will have to split the main index to ensure that no merge will 
occur in/after DIH).
 - To run this batch, solr has to be stopped and restarted. 
 - Even if we know, that our two field should change only for a subset of the 
docs, we nevertheless have to reindex this two fields for all the docs.

Any comments, hints or tips?
Is there a third (better) way to solve our issue?
Is there already an working example of the 2. solution?
Will LUCENE-1879 (Parallel incremental indexing) be part of solr 4.0?

Best regards
  Karsten