[ 
https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846166#action_12846166
 ] 

Ted Dunning commented on SOLR-1375:
-----------------------------------

Sorry to comment late here, but when indexing in hadoop, it is really nice to 
avoid any central dependence.  It is also nice to focus the map-side join on 
items likely to match.  Thirdly, reduce side indexing is typically really 
important.

The conclusions from these three considerations vary by duplication rate.  
Using reduce-side indexing gets rid of most of the problems of duplicate 
versions of a single document (with the same sort key) since the reducer can 
scan to see whether it has the final copy handy before adding a document to the 
index.

There remain problems where we have to not index documents that already exist 
in the index or to generate a deletion list that can assist in applying the 
index update.  The former problem is usually the more severe one because it 
isn't unusual for data sources to just include a full dump of all documents and 
assume that the consumer will figure out which are new or updated.  Here you 
would like to only index new and modified documents.  

My own preference for this is to avoid the complication of the map-side join 
using Bloom filters and simply export a very simple list of stub documents that 
correspond to the documents in the index.  These stub documents should be much 
smaller than the average document (unless you are indexing tweets) which makes 
passing around great masses of stub documents not such a problem since Hadoop 
shuffle, copy and sort times are all dominated by Lucene index times.  Passing 
stub documents allows the reducer to simply iterate through all documents with 
the same key keeping the latest version or any stub that is encountered.  For 
documents without a stub, normal indexing can be done with the slight addition 
exporting a list of stub documents for the new additions.

The same thing could be done with a map-side join, but the trade-off is that 
you now need considerably more memory for the mapper to store the entire bitmap 
in memory as opposed needing (somewhat) more time to pass the stub documents 
around.  How that trade-off plays out in the real world isn't clear.  My 
personal preference is to keep heap space small since the time cost is pretty 
minimal for me.

This problem also turns up in our PDF conversion pipeline where we keep 
check-sums of each PDF that has already been converted to viewable forms.   In 
that case, the ratio of real document size to stub size is even more 
preponderate.


> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, 
> SOLR-1375.patch, SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to