Re: CLucene incubation - call for a mentor

2006-10-20 Thread Garrett Rooney

On 10/20/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

Hi Ben,

I can't volunteer, but you may want to check with Garrett Rooney.  He
stopped work on lucene4c, so he may be interested in helping you with
moving CLucene under Apache Lucene.


I'd love to, except that I'm already rather overextended at this
point, and don't see myself being able to devote the required time to
it.

-garrett


Re: [PROPOSAL] index server project

2006-10-20 Thread Stefan Groschupf

Hi,

The major goal is scale, right? A distributed server provides more  
oomph

than a single-node server can.


Another important goal from my point of view would be index  
management, like index updates during production.


Stefan 


Re: CLucene incubation - call for a mentor

2006-10-20 Thread Otis Gospodnetic
Hi Ben,

I can't volunteer, but you may want to check with Garrett Rooney.  He stopped 
work on lucene4c, so he may be interested in helping you with moving CLucene 
under Apache Lucene.

Otis

- Original Message 
From: Ben van Klinken <[EMAIL PROTECTED]>
To: general@lucene.apache.org
Sent: Saturday, October 14, 2006 3:20:10 AM
Subject: CLucene incubation - call for a mentor

Hi,

I am one of the developers of CLucene, a C++ port of Lucene.

A long while back, CLucene was invited to join the ASF incubation
program under Lucene. For various reasons this hasn't happend yet. But
CLucene has still been happily progressing and interest in the project
continues to increase - many open source projects (such as ht://dig
and strigi) as well as many companies use CLucene.

CLucene would of course do much better if we were part of the big
happy family of Lucene and its sub-projects. However, I believe our
main obstacle to this is the absence of an ASF mentor.

So basically I'm asking this: would Apache Lucene still like to have
us? If yes, would anyone be interested, or know of someone interested
in being our mentor?

Look forward to a response,

Ben





Re: [Fwd: [PROPOSAL] index server project]

2006-10-20 Thread Otis Gospodnetic
Damn Y! mail shortcut.
The link to the project is in my Lucene group:  http://www.simpy.com/group/363

Otis

- Original Message 
From: Alexandru Popescu <[EMAIL PROTECTED]>
To: general@lucene.apache.org
Sent: Thursday, October 19, 2006 10:19:00 AM
Subject: Re: [Fwd: [PROPOSAL] index server project]

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.


On 10/19/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi Doug,
>
> we discussed the need of such a tool several times internally and
> developed some workarounds for nutch, so I would be definitely
> interested to contribute to such a project.
> Having a separated project that depends on hadoop would be the best
> case for our usecases.
>
> Best,
> Stefan
>
>
>
> Am 18.10.2006 um 23:35 schrieb Doug Cutting:
>
> > FYI, I just pitched a new project you might be interested in on
> > [EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm
> > spamming you.  If it sounds interesting, please reply there.  My
> > management at Y! is interested in this, so I'm 'in'.
> >
> > Doug
> >
> >  Original Message 
> > Subject: [PROPOSAL] index server project
> > Date: Wed, 18 Oct 2006 14:17:30 -0700
> > From: Doug Cutting <[EMAIL PROTECTED]>
> > Reply-To: general@lucene.apache.org
> > To: general@lucene.apache.org
> >
> > It seems that Nutch and Solr would benefit from a shared index serving
> > infrastructure.  Other Lucene-based projects might also benefit from
> > this.  So perhaps we should start a new project to build such a thing.
> > This could start either in java/contrib, or as a separate sub-project,
> > depending on interest.
> >
> > Here are some quick ideas about how this might work.
> >
> > An RPC mechanism would be used to communicate between nodes (probably
> > Hadoop's).  The system would be configured with a single master node
> > that keeps track of where indexes are located, and a number of slave
> > nodes that would maintain, search and replicate indexes.  Clients
> > would
> > talk to the master to find out which indexes to search or update, then
> > they'll talk directly to slaves to perform searches and updates.
> >
> > Following is an outline of how this might look.
> >
> > We assume that, within an index, a file with a given name is written
> > only once.  Index versions are sets of files, and a new version of an
> > index is likely to share most files with the prior version.  Versions
> > are numbered.  An index server should keep old versions of each index
> > for a while, not immediately removing old files.
> >
> > public class IndexVersion {
> >   String Id;   // unique name of the index
> >   int version; // the version of the index
> > }
> >
> > public class IndexLocation {
> >   IndexVersion indexVersion;
> >   InetSocketAddress location;
> > }
> >
> > public interface ClientToMasterProtocol {
> >   IndexLocation[] getSearchableIndexes();
> >   IndexLocation getUpdateableIndex(String id);
> > }
> >
> > public interface ClientToSlaveProtocol {
> >   // normal update
> >   void addDocument(String index, Document doc);
> >   int[] removeDocuments(String index, Term term);
> >   void commitVersion(String index);
> >
> >   // batch update
> >   void addIndex(String index, IndexLocation indexToAdd);
> >
> >   // search
> >   SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> > }
> >
> > public interface SlaveToMasterProtocol {
> >   // sends currently searchable indexes
> >   // recieves updated indexes that we should replicate/update
> >   public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> > }
> >
> > public interface SlaveToSlaveProtocol {
> >   String[] getFileSet(IndexVersion indexVersion);
> >   byte[] getFileContent(IndexVersion indexVersion, String file);
> >   // based on experience in Hadoop, we probably wouldn't really use
> >   // RPC to send file content, but rather HTTP.
> > }
> >
> > The master thus maintains the set of indexes that are available for
> > search, keeps track of which slave should handle changes to an
> > index and
> > initiates index synchronization between slaves.  The master can be
> > configured to replicate indexes a specified number of times.
> >
> > The client library can cache the current set of searchable indexes and
> > periodically refresh it.  Searches are broadcast to one index with
> > each
> > id and return merged results.  The client will load-balance both
> > searches and updates.
> >
> > Deletions could be broadcast to all slaves.  That would probably be
> > fast
> > enough.  Alternately, indexes could be partitioned by a hash of each
> > document's unique id, permitting deletions to be routed to the
> > appropriate slave.
> >
> > Does this make sense?  Does it sound like it would be useful to Solr?
> > To Nutch?  To others?  Who would be interested and able to 

Re: [Fwd: [PROPOSAL] index server project]

2006-10-20 Thread Otis Gospodnetic
That's distributed indexed, built on top of Sun Grid.  The project won a $50K 
prize.


- Original Message 
From: Alexandru Popescu <[EMAIL PROTECTED]>
To: general@lucene.apache.org
Sent: Thursday, October 19, 2006 10:19:00 AM
Subject: Re: [Fwd: [PROPOSAL] index server project]

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.


On 10/19/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi Doug,
>
> we discussed the need of such a tool several times internally and
> developed some workarounds for nutch, so I would be definitely
> interested to contribute to such a project.
> Having a separated project that depends on hadoop would be the best
> case for our usecases.
>
> Best,
> Stefan
>
>
>
> Am 18.10.2006 um 23:35 schrieb Doug Cutting:
>
> > FYI, I just pitched a new project you might be interested in on
> > [EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm
> > spamming you.  If it sounds interesting, please reply there.  My
> > management at Y! is interested in this, so I'm 'in'.
> >
> > Doug
> >
> >  Original Message 
> > Subject: [PROPOSAL] index server project
> > Date: Wed, 18 Oct 2006 14:17:30 -0700
> > From: Doug Cutting <[EMAIL PROTECTED]>
> > Reply-To: general@lucene.apache.org
> > To: general@lucene.apache.org
> >
> > It seems that Nutch and Solr would benefit from a shared index serving
> > infrastructure.  Other Lucene-based projects might also benefit from
> > this.  So perhaps we should start a new project to build such a thing.
> > This could start either in java/contrib, or as a separate sub-project,
> > depending on interest.
> >
> > Here are some quick ideas about how this might work.
> >
> > An RPC mechanism would be used to communicate between nodes (probably
> > Hadoop's).  The system would be configured with a single master node
> > that keeps track of where indexes are located, and a number of slave
> > nodes that would maintain, search and replicate indexes.  Clients
> > would
> > talk to the master to find out which indexes to search or update, then
> > they'll talk directly to slaves to perform searches and updates.
> >
> > Following is an outline of how this might look.
> >
> > We assume that, within an index, a file with a given name is written
> > only once.  Index versions are sets of files, and a new version of an
> > index is likely to share most files with the prior version.  Versions
> > are numbered.  An index server should keep old versions of each index
> > for a while, not immediately removing old files.
> >
> > public class IndexVersion {
> >   String Id;   // unique name of the index
> >   int version; // the version of the index
> > }
> >
> > public class IndexLocation {
> >   IndexVersion indexVersion;
> >   InetSocketAddress location;
> > }
> >
> > public interface ClientToMasterProtocol {
> >   IndexLocation[] getSearchableIndexes();
> >   IndexLocation getUpdateableIndex(String id);
> > }
> >
> > public interface ClientToSlaveProtocol {
> >   // normal update
> >   void addDocument(String index, Document doc);
> >   int[] removeDocuments(String index, Term term);
> >   void commitVersion(String index);
> >
> >   // batch update
> >   void addIndex(String index, IndexLocation indexToAdd);
> >
> >   // search
> >   SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> > }
> >
> > public interface SlaveToMasterProtocol {
> >   // sends currently searchable indexes
> >   // recieves updated indexes that we should replicate/update
> >   public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> > }
> >
> > public interface SlaveToSlaveProtocol {
> >   String[] getFileSet(IndexVersion indexVersion);
> >   byte[] getFileContent(IndexVersion indexVersion, String file);
> >   // based on experience in Hadoop, we probably wouldn't really use
> >   // RPC to send file content, but rather HTTP.
> > }
> >
> > The master thus maintains the set of indexes that are available for
> > search, keeps track of which slave should handle changes to an
> > index and
> > initiates index synchronization between slaves.  The master can be
> > configured to replicate indexes a specified number of times.
> >
> > The client library can cache the current set of searchable indexes and
> > periodically refresh it.  Searches are broadcast to one index with
> > each
> > id and return merged results.  The client will load-balance both
> > searches and updates.
> >
> > Deletions could be broadcast to all slaves.  That would probably be
> > fast
> > enough.  Alternately, indexes could be partitioned by a hash of each
> > document's unique id, permitting deletions to be routed to the
> > appropriate slave.
> >
> > Does this make sense?  Does it sound like it would be useful to Solr?
> > To Nutch?  To others?  Who would be interested and able to work on it?
> >
> > D