Re: CLucene incubation - call for a mentor
On 10/20/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Hi Ben, I can't volunteer, but you may want to check with Garrett Rooney. He stopped work on lucene4c, so he may be interested in helping you with moving CLucene under Apache Lucene. I'd love to, except that I'm already rather overextended at this point, and don't see myself being able to devote the required time to it. -garrett
Re: [PROPOSAL] index server project
Hi, The major goal is scale, right? A distributed server provides more oomph than a single-node server can. Another important goal from my point of view would be index management, like index updates during production. Stefan
Re: CLucene incubation - call for a mentor
Hi Ben, I can't volunteer, but you may want to check with Garrett Rooney. He stopped work on lucene4c, so he may be interested in helping you with moving CLucene under Apache Lucene. Otis - Original Message From: Ben van Klinken <[EMAIL PROTECTED]> To: general@lucene.apache.org Sent: Saturday, October 14, 2006 3:20:10 AM Subject: CLucene incubation - call for a mentor Hi, I am one of the developers of CLucene, a C++ port of Lucene. A long while back, CLucene was invited to join the ASF incubation program under Lucene. For various reasons this hasn't happend yet. But CLucene has still been happily progressing and interest in the project continues to increase - many open source projects (such as ht://dig and strigi) as well as many companies use CLucene. CLucene would of course do much better if we were part of the big happy family of Lucene and its sub-projects. However, I believe our main obstacle to this is the absence of an ASF mentor. So basically I'm asking this: would Apache Lucene still like to have us? If yes, would anyone be interested, or know of someone interested in being our mentor? Look forward to a response, Ben
Re: [Fwd: [PROPOSAL] index server project]
Damn Y! mail shortcut. The link to the project is in my Lucene group: http://www.simpy.com/group/363 Otis - Original Message From: Alexandru Popescu <[EMAIL PROTECTED]> To: general@lucene.apache.org Sent: Thursday, October 19, 2006 10:19:00 AM Subject: Re: [Fwd: [PROPOSAL] index server project] I am not sure this is (somehow) related, but I think I have noticed some project on a Sun contest (it was the big prize winner). I cannot retrieve it now, but hopefully somebody else will. ./alex -- .w( the_mindstorm )p. On 10/19/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi Doug, > > we discussed the need of such a tool several times internally and > developed some workarounds for nutch, so I would be definitely > interested to contribute to such a project. > Having a separated project that depends on hadoop would be the best > case for our usecases. > > Best, > Stefan > > > > Am 18.10.2006 um 23:35 schrieb Doug Cutting: > > > FYI, I just pitched a new project you might be interested in on > > [EMAIL PROTECTED] Dunno if you subscribe to that list, so I'm > > spamming you. If it sounds interesting, please reply there. My > > management at Y! is interested in this, so I'm 'in'. > > > > Doug > > > > Original Message > > Subject: [PROPOSAL] index server project > > Date: Wed, 18 Oct 2006 14:17:30 -0700 > > From: Doug Cutting <[EMAIL PROTECTED]> > > Reply-To: general@lucene.apache.org > > To: general@lucene.apache.org > > > > It seems that Nutch and Solr would benefit from a shared index serving > > infrastructure. Other Lucene-based projects might also benefit from > > this. So perhaps we should start a new project to build such a thing. > > This could start either in java/contrib, or as a separate sub-project, > > depending on interest. > > > > Here are some quick ideas about how this might work. > > > > An RPC mechanism would be used to communicate between nodes (probably > > Hadoop's). The system would be configured with a single master node > > that keeps track of where indexes are located, and a number of slave > > nodes that would maintain, search and replicate indexes. Clients > > would > > talk to the master to find out which indexes to search or update, then > > they'll talk directly to slaves to perform searches and updates. > > > > Following is an outline of how this might look. > > > > We assume that, within an index, a file with a given name is written > > only once. Index versions are sets of files, and a new version of an > > index is likely to share most files with the prior version. Versions > > are numbered. An index server should keep old versions of each index > > for a while, not immediately removing old files. > > > > public class IndexVersion { > > String Id; // unique name of the index > > int version; // the version of the index > > } > > > > public class IndexLocation { > > IndexVersion indexVersion; > > InetSocketAddress location; > > } > > > > public interface ClientToMasterProtocol { > > IndexLocation[] getSearchableIndexes(); > > IndexLocation getUpdateableIndex(String id); > > } > > > > public interface ClientToSlaveProtocol { > > // normal update > > void addDocument(String index, Document doc); > > int[] removeDocuments(String index, Term term); > > void commitVersion(String index); > > > > // batch update > > void addIndex(String index, IndexLocation indexToAdd); > > > > // search > > SearchResults search(IndexVersion i, Query query, Sort sort, int n); > > } > > > > public interface SlaveToMasterProtocol { > > // sends currently searchable indexes > > // recieves updated indexes that we should replicate/update > > public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes); > > } > > > > public interface SlaveToSlaveProtocol { > > String[] getFileSet(IndexVersion indexVersion); > > byte[] getFileContent(IndexVersion indexVersion, String file); > > // based on experience in Hadoop, we probably wouldn't really use > > // RPC to send file content, but rather HTTP. > > } > > > > The master thus maintains the set of indexes that are available for > > search, keeps track of which slave should handle changes to an > > index and > > initiates index synchronization between slaves. The master can be > > configured to replicate indexes a specified number of times. > > > > The client library can cache the current set of searchable indexes and > > periodically refresh it. Searches are broadcast to one index with > > each > > id and return merged results. The client will load-balance both > > searches and updates. > > > > Deletions could be broadcast to all slaves. That would probably be > > fast > > enough. Alternately, indexes could be partitioned by a hash of each > > document's unique id, permitting deletions to be routed to the > > appropriate slave. > > > > Does this make sense? Does it sound like it would be useful to Solr? > > To Nutch? To others? Who would be interested and able to
Re: [Fwd: [PROPOSAL] index server project]
That's distributed indexed, built on top of Sun Grid. The project won a $50K prize. - Original Message From: Alexandru Popescu <[EMAIL PROTECTED]> To: general@lucene.apache.org Sent: Thursday, October 19, 2006 10:19:00 AM Subject: Re: [Fwd: [PROPOSAL] index server project] I am not sure this is (somehow) related, but I think I have noticed some project on a Sun contest (it was the big prize winner). I cannot retrieve it now, but hopefully somebody else will. ./alex -- .w( the_mindstorm )p. On 10/19/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi Doug, > > we discussed the need of such a tool several times internally and > developed some workarounds for nutch, so I would be definitely > interested to contribute to such a project. > Having a separated project that depends on hadoop would be the best > case for our usecases. > > Best, > Stefan > > > > Am 18.10.2006 um 23:35 schrieb Doug Cutting: > > > FYI, I just pitched a new project you might be interested in on > > [EMAIL PROTECTED] Dunno if you subscribe to that list, so I'm > > spamming you. If it sounds interesting, please reply there. My > > management at Y! is interested in this, so I'm 'in'. > > > > Doug > > > > Original Message > > Subject: [PROPOSAL] index server project > > Date: Wed, 18 Oct 2006 14:17:30 -0700 > > From: Doug Cutting <[EMAIL PROTECTED]> > > Reply-To: general@lucene.apache.org > > To: general@lucene.apache.org > > > > It seems that Nutch and Solr would benefit from a shared index serving > > infrastructure. Other Lucene-based projects might also benefit from > > this. So perhaps we should start a new project to build such a thing. > > This could start either in java/contrib, or as a separate sub-project, > > depending on interest. > > > > Here are some quick ideas about how this might work. > > > > An RPC mechanism would be used to communicate between nodes (probably > > Hadoop's). The system would be configured with a single master node > > that keeps track of where indexes are located, and a number of slave > > nodes that would maintain, search and replicate indexes. Clients > > would > > talk to the master to find out which indexes to search or update, then > > they'll talk directly to slaves to perform searches and updates. > > > > Following is an outline of how this might look. > > > > We assume that, within an index, a file with a given name is written > > only once. Index versions are sets of files, and a new version of an > > index is likely to share most files with the prior version. Versions > > are numbered. An index server should keep old versions of each index > > for a while, not immediately removing old files. > > > > public class IndexVersion { > > String Id; // unique name of the index > > int version; // the version of the index > > } > > > > public class IndexLocation { > > IndexVersion indexVersion; > > InetSocketAddress location; > > } > > > > public interface ClientToMasterProtocol { > > IndexLocation[] getSearchableIndexes(); > > IndexLocation getUpdateableIndex(String id); > > } > > > > public interface ClientToSlaveProtocol { > > // normal update > > void addDocument(String index, Document doc); > > int[] removeDocuments(String index, Term term); > > void commitVersion(String index); > > > > // batch update > > void addIndex(String index, IndexLocation indexToAdd); > > > > // search > > SearchResults search(IndexVersion i, Query query, Sort sort, int n); > > } > > > > public interface SlaveToMasterProtocol { > > // sends currently searchable indexes > > // recieves updated indexes that we should replicate/update > > public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes); > > } > > > > public interface SlaveToSlaveProtocol { > > String[] getFileSet(IndexVersion indexVersion); > > byte[] getFileContent(IndexVersion indexVersion, String file); > > // based on experience in Hadoop, we probably wouldn't really use > > // RPC to send file content, but rather HTTP. > > } > > > > The master thus maintains the set of indexes that are available for > > search, keeps track of which slave should handle changes to an > > index and > > initiates index synchronization between slaves. The master can be > > configured to replicate indexes a specified number of times. > > > > The client library can cache the current set of searchable indexes and > > periodically refresh it. Searches are broadcast to one index with > > each > > id and return merged results. The client will load-balance both > > searches and updates. > > > > Deletions could be broadcast to all slaves. That would probably be > > fast > > enough. Alternately, indexes could be partitioned by a hash of each > > document's unique id, permitting deletions to be routed to the > > appropriate slave. > > > > Does this make sense? Does it sound like it would be useful to Solr? > > To Nutch? To others? Who would be interested and able to work on it? > > > > D