Re: [PROPOSAL] index server project

2006-10-19 Thread Yonik Seeley

On 10/19/06, Steven Parkes <[EMAIL PROTECTED]> wrote:

You mention partitioning of indexes, though mostly around delete. What
about scalability of corpus size?


Definitely in scope.  Solr already has scalability of search volume
via searchers behind of a load balancer all getting their index from a
master.  The problem comes when an index is too big to get decent
latency for a single query, and that's when you need to partiton the
index into "shards" to use google terminology.


Would partitioning be effective for
that, too?


Yes, to a certain extent.  At some point you run into network
bandwidth issues if you go deep into rankings.


What about scalability of ingest rate?


As it relates to indexing, I think nutch already has that base covered.


What are you thinking, in terms of size? Is this a 10 node thing?


I'm personally interested in perhaps 10 to 20 index shards, with
multiple replicas of each shard for HA and query load scalability.


A 1000
node thing? More? Bigger is cool, but raises a lot of issues.


Should be possible, but I won't personally be looking for that.  I
think scaling effectively will be partially in the hands of the client
and how it chooses to merge results from shards.


How
dynamic?



Can nodes come and go?


Unplanned: yes.  HA is personally key for me.
Planned (adding capacity gracefully): it would be nice.  I actually
hadn't planned it for Solr.


Are you going to assume homogeneity of
nodes?


Hardware homogeneity?  That might be out of scope... I'd start off
without worrying about it in any case.


What about add/modify/delete to search visibility latency? Close to
batch/once-a-day or real-time?


Anywhere in between I'd think.  "Realtime" latencies of minutes or
longer are normally fine.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


RE: [PROPOSAL] index server project

2006-10-19 Thread Steven Parkes
I like the idea. I'm trying to figure out, in broad strokes, the
overarching goals. Forgive me if this is obvious, I just want to be
clear.

The major goal is scale, right? A distributed server provides more oomph
than a single-node server can.

There are a number of dimensions in scale.

You mention replication of indexes, so scalability of search volume is
in scope, right?

You mention partitioning of indexes, though mostly around delete. What
about scalability of corpus size? Would partitioning be effective for
that, too?

What about scalability of ingest rate?

What are you thinking, in terms of size? Is this a 10 node thing? A 1000
node thing? More? Bigger is cool, but raises a lot of issues. How
dynamic? Can nodes come and go? Are you going to assume homogeneity of
nodes?

What about add/modify/delete to search visibility latency? Close to
batch/once-a-day or real-time?

I think it's definitely something people want. Actually, I think we
could answer these questions in different ways and for every answer,
we'd find people that would want it. But they would probably be
different people.


Re: [Fwd: [PROPOSAL] index server project]

2006-10-19 Thread Alexandru Popescu

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.


On 10/19/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:

Hi Doug,

we discussed the need of such a tool several times internally and
developed some workarounds for nutch, so I would be definitely
interested to contribute to such a project.
Having a separated project that depends on hadoop would be the best
case for our usecases.

Best,
Stefan



Am 18.10.2006 um 23:35 schrieb Doug Cutting:

> FYI, I just pitched a new project you might be interested in on
> [EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm
> spamming you.  If it sounds interesting, please reply there.  My
> management at Y! is interested in this, so I'm 'in'.
>
> Doug
>
>  Original Message 
> Subject: [PROPOSAL] index server project
> Date: Wed, 18 Oct 2006 14:17:30 -0700
> From: Doug Cutting <[EMAIL PROTECTED]>
> Reply-To: general@lucene.apache.org
> To: general@lucene.apache.org
>
> It seems that Nutch and Solr would benefit from a shared index serving
> infrastructure.  Other Lucene-based projects might also benefit from
> this.  So perhaps we should start a new project to build such a thing.
> This could start either in java/contrib, or as a separate sub-project,
> depending on interest.
>
> Here are some quick ideas about how this might work.
>
> An RPC mechanism would be used to communicate between nodes (probably
> Hadoop's).  The system would be configured with a single master node
> that keeps track of where indexes are located, and a number of slave
> nodes that would maintain, search and replicate indexes.  Clients
> would
> talk to the master to find out which indexes to search or update, then
> they'll talk directly to slaves to perform searches and updates.
>
> Following is an outline of how this might look.
>
> We assume that, within an index, a file with a given name is written
> only once.  Index versions are sets of files, and a new version of an
> index is likely to share most files with the prior version.  Versions
> are numbered.  An index server should keep old versions of each index
> for a while, not immediately removing old files.
>
> public class IndexVersion {
>   String Id;   // unique name of the index
>   int version; // the version of the index
> }
>
> public class IndexLocation {
>   IndexVersion indexVersion;
>   InetSocketAddress location;
> }
>
> public interface ClientToMasterProtocol {
>   IndexLocation[] getSearchableIndexes();
>   IndexLocation getUpdateableIndex(String id);
> }
>
> public interface ClientToSlaveProtocol {
>   // normal update
>   void addDocument(String index, Document doc);
>   int[] removeDocuments(String index, Term term);
>   void commitVersion(String index);
>
>   // batch update
>   void addIndex(String index, IndexLocation indexToAdd);
>
>   // search
>   SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> }
>
> public interface SlaveToMasterProtocol {
>   // sends currently searchable indexes
>   // recieves updated indexes that we should replicate/update
>   public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> }
>
> public interface SlaveToSlaveProtocol {
>   String[] getFileSet(IndexVersion indexVersion);
>   byte[] getFileContent(IndexVersion indexVersion, String file);
>   // based on experience in Hadoop, we probably wouldn't really use
>   // RPC to send file content, but rather HTTP.
> }
>
> The master thus maintains the set of indexes that are available for
> search, keeps track of which slave should handle changes to an
> index and
> initiates index synchronization between slaves.  The master can be
> configured to replicate indexes a specified number of times.
>
> The client library can cache the current set of searchable indexes and
> periodically refresh it.  Searches are broadcast to one index with
> each
> id and return merged results.  The client will load-balance both
> searches and updates.
>
> Deletions could be broadcast to all slaves.  That would probably be
> fast
> enough.  Alternately, indexes could be partitioned by a hash of each
> document's unique id, permitting deletions to be routed to the
> appropriate slave.
>
> Does this make sense?  Does it sound like it would be useful to Solr?
> To Nutch?  To others?  Who would be interested and able to work on it?
>
> Doug
>

~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com







Re: [Fwd: [PROPOSAL] index server project]

2006-10-19 Thread Stefan Groschupf

Hi Doug,

we discussed the need of such a tool several times internally and  
developed some workarounds for nutch, so I would be definitely  
interested to contribute to such a project.
Having a separated project that depends on hadoop would be the best  
case for our usecases.


Best,
Stefan



Am 18.10.2006 um 23:35 schrieb Doug Cutting:

FYI, I just pitched a new project you might be interested in on  
[EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm  
spamming you.  If it sounds interesting, please reply there.  My  
management at Y! is interested in this, so I'm 'in'.


Doug

 Original Message 
Subject: [PROPOSAL] index server project
Date: Wed, 18 Oct 2006 14:17:30 -0700
From: Doug Cutting <[EMAIL PROTECTED]>
Reply-To: general@lucene.apache.org
To: general@lucene.apache.org

It seems that Nutch and Solr would benefit from a shared index serving
infrastructure.  Other Lucene-based projects might also benefit from
this.  So perhaps we should start a new project to build such a thing.
This could start either in java/contrib, or as a separate sub-project,
depending on interest.

Here are some quick ideas about how this might work.

An RPC mechanism would be used to communicate between nodes (probably
Hadoop's).  The system would be configured with a single master node
that keeps track of where indexes are located, and a number of slave
nodes that would maintain, search and replicate indexes.  Clients  
would

talk to the master to find out which indexes to search or update, then
they'll talk directly to slaves to perform searches and updates.

Following is an outline of how this might look.

We assume that, within an index, a file with a given name is written
only once.  Index versions are sets of files, and a new version of an
index is likely to share most files with the prior version.  Versions
are numbered.  An index server should keep old versions of each index
for a while, not immediately removing old files.

public class IndexVersion {
  String Id;   // unique name of the index
  int version; // the version of the index
}

public class IndexLocation {
  IndexVersion indexVersion;
  InetSocketAddress location;
}

public interface ClientToMasterProtocol {
  IndexLocation[] getSearchableIndexes();
  IndexLocation getUpdateableIndex(String id);
}

public interface ClientToSlaveProtocol {
  // normal update
  void addDocument(String index, Document doc);
  int[] removeDocuments(String index, Term term);
  void commitVersion(String index);

  // batch update
  void addIndex(String index, IndexLocation indexToAdd);

  // search
  SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}

public interface SlaveToMasterProtocol {
  // sends currently searchable indexes
  // recieves updated indexes that we should replicate/update
  public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}

public interface SlaveToSlaveProtocol {
  String[] getFileSet(IndexVersion indexVersion);
  byte[] getFileContent(IndexVersion indexVersion, String file);
  // based on experience in Hadoop, we probably wouldn't really use
  // RPC to send file content, but rather HTTP.
}

The master thus maintains the set of indexes that are available for
search, keeps track of which slave should handle changes to an  
index and

initiates index synchronization between slaves.  The master can be
configured to replicate indexes a specified number of times.

The client library can cache the current set of searchable indexes and
periodically refresh it.  Searches are broadcast to one index with  
each

id and return merged results.  The client will load-balance both
searches and updates.

Deletions could be broadcast to all slaves.  That would probably be  
fast

enough.  Alternately, indexes could be partitioned by a hash of each
document's unique id, permitting deletions to be routed to the
appropriate slave.

Does this make sense?  Does it sound like it would be useful to Solr?
To Nutch?  To others?  Who would be interested and able to work on it?

Doug



~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





Re: OT: super fast MySQL full-text searching.

2006-10-19 Thread Monsur Hossain

If you're interested, we use the following pattern to do incremental
updates between a database and a Lucene index.

1) Add a field to the database table you wish to index called
"DateUpdate".  Update this date whenever a field in the table is
changed.
2) Create a new database table to store the ID of any item that is
deleted from the table above.  I'll referrer to this as the "Deleted"
table.
3) Have an indexer application that runs every X minutes, and does the
following:
   a) Load all the items from the "Deleted" table and remove them
from the Lucene index
   b) Load all the items from the main table with a "DateUpdate"
date greater than the last time the indexer application ran.  Delete
these items from the Lucene index, and then reinsert them with the
newer data.
   c) Purge all the items from the "Deleted" table.
d) Save the date of the last "DateUpdated" item you processed, 
and
use this date to load items the next time the indexer application
runs.

This is an oversimplification, since you need to consider failover
etc, and there may be other factors that dictate your search indexing
rules.  But it gives you a general idea.  I'd be curious to
hear/discuss other solutions to this.

Monsur



On 10/19/06, Scott <[EMAIL PROTECTED]> wrote:

I have tried Senna that is an embeddable fulltext search engine.
It is embbeded into MySQL.

http://qwik.jp/senna/

I inserted 1,000,000 of documents by using INSERT INTO sql,
and I can search documents by using SELECT * FROM table
 WHERE MATCH(field_name) AGAINST('search-words').
It is based on SQL, so easy to use and support incremental
update.

I don't do benchmark test yet but it's not slow.

I think that the Lucene need to support incremental update futurely.

--
Scott



OT: super fast MySQL full-text searching.

2006-10-19 Thread Scott
I have tried Senna that is an embeddable fulltext search engine.
It is embbeded into MySQL.

http://qwik.jp/senna/

I inserted 1,000,000 of documents by using INSERT INTO sql,
and I can search documents by using SELECT * FROM table
 WHERE MATCH(field_name) AGAINST('search-words').
It is based on SQL, so easy to use and support incremental
update.

I don't do benchmark test yet but it's not slow.

I think that the Lucene need to support incremental update futurely.

-- 
Scott