[ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851637#action_12851637 ]
Jason Rutherglen commented on SOLR-1375: ---------------------------------------- {quote}Doesn't this hint at some of this stuff (haven't looked at the patch) really needing to live in Lucene index segment files merging land?{quote} Adding this to Lucene is out of the scope of what I require, however I don't have time unless it's going to be committed. > BloomFilter on a field > ---------------------- > > Key: SOLR-1375 > URL: https://issues.apache.org/jira/browse/SOLR-1375 > Project: Solr > Issue Type: New Feature > Components: update > Affects Versions: 1.4 > Reporter: Jason Rutherglen > Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, > SOLR-1375.patch, SOLR-1375.patch > > Original Estimate: 120h > Remaining Estimate: 120h > > * A bloom filter is a read only probabilistic set. Its useful > for verifying a key exists in a set, though it returns false > positives. http://en.wikipedia.org/wiki/Bloom_filter > * The use case is indexing in Hadoop and checking for duplicates > against a Solr cluster (which when using term dictionary or a > query) is too slow and exceeds the time consumed for indexing. > When a match is found, the host, segment, and term are returned. > If the same term is found on multiple servers, multiple results > are returned by the distributed process. (We'll need to add in > the core name I just realized). > * When new segments are created, and commit is called, a new > bloom filter is generated from a given field (default:id) by > iterating over the term dictionary values. There's a bloom > filter file per segment, which is managed on each Solr shard. > When segments are merged away, their corresponding .blm files is > also removed. In a future version we'll have a central server > for the bloom filters so we're not abusing the thread pool of > the Solr proxy and the networking of the Solr cluster (this will > be done sooner than later after testing this version). I held > off because the central server requires syncing the Solr > servers' files (which is like reverse replication). > * The patch uses the BloomFilter from Hadoop 0.20. I want to jar > up only the necessary classes so we don't have a giant Hadoop > jar in lib. > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html > * Distributed code is added and seems to work, I extended > TestDistributedSearch to test over multiple HTTP servers. I > chose this approach rather than the manual method used by (for > example) TermVectorComponent.testDistributed because I'm new to > Solr's distributed search and wanted to learn how it works (the > stages are confusing). Using this method, I didn't need to setup > multiple tomcat servers and manually execute tests. > * We need more of the bloom filter options passable via > solrconfig > * I'll add more test cases -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.