Re: How can i make a distribute search on Solr?

2007-09-20 Thread David Welton
 Maybe I got this wrong...but isn't this what mapreduce is meant to deal with?
 eg,

 1) get the job (a query)
 2) map it to workers ( servers that provide search results from their own
 indexing)
 3) wait for the results from all workers that reply within acceptable 
 timeframe.
 4) comb through the lot of  results from all workers, reduce them according to
 your own biz rules (eg, remove dupes, sort them by quality / priority... here 
 possibly relying on the original parameters of the query in 1)
 5) return the reduced results to the frontend.

That seems to be how Sphinx works:

http://www.sphinxsearch.com/doc.html#distributed

Of course, the details of this are far over my head for either system,
so I don't really know if that's a sensible way of doing things or
not.

Ciao,
-- 
David N. Welton
http://www.welton.it/davidw/


Re: How can i make a distribute search on Solr?

2007-09-20 Thread Yonik Seeley
On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote:
 Maybe I got this wrong...but isn't this what mapreduce is meant to deal with?

Not really... you could force a *lot* of different problems into
map-reduce (that's sort of the point... being able to automatically
parallelize a lot of different problems).  It really isn't the best
fit though, and would end up being much slower than a custom job.

Then there is the issue that the way map-reduce is implemented (like
hadoop) is also tuned for longer running batch jobs on huge data
(temporary files are used, external sorts, initial input, final output
is via files, etc).  Check out the google map-reduce paper - they
don't use it for their search side either.


Things are already progressing in the distributed search area:
https://issues.apache.org/jira/browse/SOLR-303
Hopefully I'll have time to dig into it more myself in a few weeks.

-Yonik


Re: How can i make a distribute search on Solr?

2007-09-20 Thread Norberto Meijome
On Thu, 20 Sep 2007 09:58:17 +0200
David Welton [EMAIL PROTECTED] wrote:

 That seems to be how Sphinx works:
 
 http://www.sphinxsearch.com/doc.html#distributed
 
 Of course, the details of this are far over my head for either system,
 so I don't really know if that's a sensible way of doing things or
 not.

thanks for the pointer. it does seem that it's pretty much what I had in
mind... but it doesn't seem to be based on Lucene (which I particular like,
specially for the community...) ... 

cheers,

_
{Beto|Norberto|Numard} Meijome

The freethinking of one age is the common sense of the next.
   Matthew Arnold

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: How can i make a distribute search on Solr?

2007-09-20 Thread Norberto Meijome
On Thu, 20 Sep 2007 09:53:46 -0400
Yonik Seeley [EMAIL PROTECTED] wrote:

 On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote:
  Maybe I got this wrong...but isn't this what mapreduce is meant to deal 
  with?
 
 Not really... you could force a *lot* of different problems into
 map-reduce (that's sort of the point... being able to automatically
 parallelize a lot of different problems).  It really isn't the best
 fit though, and would end up being much slower than a custom job.

good point..i wondered whether the whole sorting/whatever wasn't going to make
it far slower than something custom. I dont care about mapreduce in particular,
but yes the effect - n indexers / searches all fulfilling their part of the
overall search results.

 Then there is the issue that the way map-reduce is implemented (like
 hadoop) is also tuned for longer running batch jobs on huge data
 (temporary files are used, external sorts, initial input, final output
 is via files, etc).  

I see, didn't know this.

 Check out the google map-reduce paper - they
 don't use it for their search side either.

yeah, need to  :) 

 Things are already progressing in the distributed search area:
 https://issues.apache.org/jira/browse/SOLR-303
 Hopefully I'll have time to dig into it more myself in a few weeks.

excellent , thanks 
_
{Beto|Norberto|Numard} Meijome

He uses statistics as a drunken man uses lamp-posts ... for support rather
than illumination. Andrew Lang (1844-1912)

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


RE: How can i make a distribute search on Solr?

2007-09-19 Thread Jarvis
Thanks for your reply,
I need the Federated Search. You mean this is not yet 
supported out of the box. So I have a question that 
in this situation what can Collection Distribution used for?

Jarvis

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 19, 2007 1:47 PM
To: solr-user@lucene.apache.org
Subject: Re: How can i make a distribute search on Solr?

 
 So it means that distributed search is not a basic component in Solr
project.
 

I think you just need load balancing.  Solr is not a load balancer, you 
need to find something that works for you and configure that elsewhere. 
  Solr works fine without persistent connections, so simple round robin 
DNS but it works find.

Depending on your usage/loads/requirements it may or may not make sense 
to have your master DB in the mix.

Stu is referring to Federated Search - where each index has some of the 
data and results are combined before they are returned.  This is not yet 
supported out of the box

ryan



Re: How can i make a distribute search on Solr?

2007-09-19 Thread Norberto Meijome
On Wed, 19 Sep 2007 01:46:53 -0400
Ryan McKinley [EMAIL PROTECTED] wrote:

 Stu is referring to Federated Search - where each index has some of the 
 data and results are combined before they are returned.  This is not yet 
 supported out of the box

Maybe this is related. How does this compare to the map-reduce functionality in 
Nutch/Hadoop ? 
cheers,
B

_
{Beto|Norberto|Numard} Meijome

With sufficient thrust, pigs fly just fine. However, this is not necessarily a 
good idea. 
It is hard to be sure where they are going to land, and it could be dangerous 
sitting under them as they fly overhead.
   [RFC1925 - section 2, subsection 3]

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: How can i make a distribute search on Solr?

2007-09-19 Thread Yonik Seeley
On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote:
 On Wed, 19 Sep 2007 01:46:53 -0400
 Ryan McKinley [EMAIL PROTECTED] wrote:

  Stu is referring to Federated Search - where each index has some of the

It really should be Distributed Search I think (my mistake... I
started out calling it Federated).  I think Federated search is more
about combining search results from different data sources.

  data and results are combined before they are returned.  This is not yet
  supported out of the box

 Maybe this is related. How does this compare to the map-reduce functionality 
 in Nutch/Hadoop ?

map-reduce is more for batch jobs.  Nutch only uses map-reduce for
parallel indexing, not searching.

-Yonik


RE: How can i make a distribute search on Solr?

2007-09-19 Thread Jarvis
Nutch has two ways to make a distributed query - through HDFS(hadoop file
system)  or RPC call that is in
org.apache.nutch.searcher.DistributedSearch class.

But I think these are both not good enough.

If we use HDFS to service the user's query. Stability is a problem. We must
all do the crawl , index , query on HDFS and use mapreduce. Can we trust in
hadoop all the time?:)

If we use the RPC call in nutch . Manually separate the index is required .
We will receive reduplicate result if there is reduplicate index document on
different servers. And also the data updating and single server's error is
hard to deal with.

Thanks,
Jarvis


-Original Message-
From: Stu Hood [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 19, 2007 10:37 PM
To: solr-user@lucene.apache.org
Subject: Re: How can i make a distribute search on Solr?

Nutch implements federated search separately from their index generation.

My understanding is that MapReduce jobs generate the indexes (Nutch calls
them segments) from raw data that has been downloaded, and then makes them
available to be searched via remote procedure calls. Queries never pass
through MapReduce in any shape or form, only the raw data and indexes.

If you take a look at the org.apache.nutch.searcher.DistributedSearch
class, specifically the #Client.search method, you can see how they handle
the actual federation of results.

Thanks,
Stu


-Original Message-
From: Norberto Meijome 
Sent: Wednesday, September 19, 2007 10:23am
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Subject: Re: How can i make a distribute search on Solr?

On Wed, 19 Sep 2007 01:46:53 -0400
Ryan McKinley  wrote:

 Stu is referring to Federated Search - where each index has some of the 
 data and results are combined before they are returned.  This is not yet 
 supported out of the box

Maybe this is related. How does this compare to the map-reduce functionality
in Nutch/Hadoop ? 
cheers,
B

_
{Beto|Norberto|Numard} Meijome

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. 
It is hard to be sure where they are going to land, and it could be
dangerous sitting under them as they fly overhead.
   [RFC1925 - section 2, subsection 3]

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.



Re: How can i make a distribute search on Solr?

2007-09-19 Thread Norberto Meijome
On Wed, 19 Sep 2007 10:29:54 -0400
Yonik Seeley [EMAIL PROTECTED] wrote:

  Maybe this is related. How does this compare to the map-reduce 
  functionality in Nutch/Hadoop ?  
 
 map-reduce is more for batch jobs.  Nutch only uses map-reduce for
 parallel indexing, not searching.

I see... so in nutch all nodes have all the date indexed ? 

Thanks,
_
{Beto|Norberto|Numard} Meijome...heading to read about nutch/hadoop

Imagination is more important than knowledge.
  Albert Einstein, On Science

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


RE: How can i make a distribute search on Solr?

2007-09-19 Thread Jarvis
I think index data which stored in HDFS and generated by map-reduce function
is used for searching in NUTCH-0.9

You can see the code in org.apache.nutch.searcher.NutchBean class . :)

Jarvis

-Original Message-
From: Norberto Meijome [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 20, 2007 9:52 AM
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: How can i make a distribute search on Solr?

On Wed, 19 Sep 2007 10:29:54 -0400
Yonik Seeley [EMAIL PROTECTED] wrote:

  Maybe this is related. How does this compare to the map-reduce
functionality in Nutch/Hadoop ?  
 
 map-reduce is more for batch jobs.  Nutch only uses map-reduce for
 parallel indexing, not searching.

I see... so in nutch all nodes have all the date indexed ? 

Thanks,
_
{Beto|Norberto|Numard} Meijome...heading to read about nutch/hadoop

Imagination is more important than knowledge.
  Albert Einstein, On Science

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.



Re: How can i make a distribute search on Solr?

2007-09-19 Thread Norberto Meijome
On Thu, 20 Sep 2007 09:37:51 +0800
Jarvis [EMAIL PROTECTED] wrote:

 If we use the RPC call in nutch .
Hi,
I wasn't suggesting to use nutch in solr...I'm only a young grasshopper in this
league to be suggesting architecture stuff :) but i imagine there's nothing
wrong with using what they've built if it addresses solr's needs.

  Manually separate the index is required .

hmm i imagine this really depends on the application. In my case, this
separation of which docs go where happens @ a completely different layer.

 We will receive reduplicate result if there is reduplicate index document on
 different servers. 

Maybe I got this wrong...but isn't this what mapreduce is meant to deal with?
eg, 

1) get the job (a query)
2) map it to workers ( servers that provide search results from their own
indexing)
3) wait for the results from all workers that reply within acceptable timeframe.
4) comb through the lot of  results from all workers, reduce them according to
your own biz rules (eg, remove dupes, sort them by quality / priority... here 
possibly relying on the original parameters of the query in 1)
5) return the reduced results to the frontend.

 And also the data updating and single server's error is
 hard to deal with.

this really depends on your infrastructure + design. 

Having the indexing , searching and providing of results in different layers
should make for some interesting design options...

If each searcher (or wherever the index resides) is really a small cluster of
servers , the issue of data safety / server error is addressed @ that point.
You can also have repeated data across indexes (again, independent indexes) and
that's a more ... randomised :) way of keeping the docs safe... For example,
IIRC, googleFS keeps copies of each file in 3 servers or more...

cheers,
B
_
{Beto|Norberto|Numard} Meijome

He uses statistics as a drunken man uses lamp-posts ... for support rather
than illumination. Andrew Lang (1844-1912)

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


RE: How can i make a distribute search on Solr?

2007-09-19 Thread Jarvis
HI,
What you say is done by hadoop that support Hardware Failure、Data
Replication and some else . 
If we want to implement such a good system by ourselves without HDFS
but Solr , it's a very very complex work I think. :) 
I just want to know whether there is a component existed can do the
distributed search based on Solr.

Thanks 
Jarvis.

-Original Message-
From: Norberto Meijome [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 20, 2007 10:06 AM
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Subject: Re: How can i make a distribute search on Solr?

On Thu, 20 Sep 2007 09:37:51 +0800
Jarvis [EMAIL PROTECTED] wrote:

 If we use the RPC call in nutch .
Hi,
I wasn't suggesting to use nutch in solr...I'm only a young grasshopper in
this
league to be suggesting architecture stuff :) but i imagine there's nothing
wrong with using what they've built if it addresses solr's needs.

  Manually separate the index is required .

hmm i imagine this really depends on the application. In my case, this
separation of which docs go where happens @ a completely different layer.

 We will receive reduplicate result if there is reduplicate index document
on
 different servers. 

Maybe I got this wrong...but isn't this what mapreduce is meant to deal
with?
eg, 

1) get the job (a query)
2) map it to workers ( servers that provide search results from their own
indexing)
3) wait for the results from all workers that reply within acceptable
timeframe.
4) comb through the lot of  results from all workers, reduce them according
to
your own biz rules (eg, remove dupes, sort them by quality / priority...
here possibly relying on the original parameters of the query in 1)
5) return the reduced results to the frontend.

 And also the data updating and single server's error is
 hard to deal with.

this really depends on your infrastructure + design. 

Having the indexing , searching and providing of results in different layers
should make for some interesting design options...

If each searcher (or wherever the index resides) is really a small cluster
of
servers , the issue of data safety / server error is addressed @ that point.
You can also have repeated data across indexes (again, independent indexes)
and
that's a more ... randomised :) way of keeping the docs safe... For example,
IIRC, googleFS keeps copies of each file in 3 servers or more...

cheers,
B
_
{Beto|Norberto|Numard} Meijome

He uses statistics as a drunken man uses lamp-posts ... for support rather
than illumination. Andrew Lang (1844-1912)

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.



Re: How can i make a distribute search on Solr?

2007-09-19 Thread Mike Klaas

On 19-Sep-07, at 7:21 PM, Jarvis wrote:


HI,
What you say is done by hadoop that support Hardware Failure、Data
Replication and some else .
If we want to implement such a good system by ourselves without HDFS
but Solr , it's a very very complex work I think. :)
I just want to know whether there is a component existed can do the
distributed search based on Solr.


https://issues.apache.org/jira/browse/SOLR-303? 
page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel


regards,
-Mike

Re: How can i make a distribute search on Solr?

2007-09-19 Thread Norberto Meijome
On Thu, 20 Sep 2007 10:02:08 +0800
Jarvis [EMAIL PROTECTED] wrote:

 You can see the code in org.apache.nutch.searcher.NutchBean class . :)

thx for the pointer.

_
{Beto|Norberto|Numard} Meijome

In order to avoid being called a flirt, she always yielded easily.
  Charles, Count Talleyrand

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: How can i make a distribute search on Solr?

2007-09-19 Thread Norberto Meijome
On Thu, 20 Sep 2007 10:21:39 +0800
Jarvis [EMAIL PROTECTED] wrote:

   What you say is done by hadoop that support Hardware Failure、Data
 Replication and some else . 
   If we want to implement such a good system by ourselves without HDFS
 but Solr , it's a very very complex work I think. :) 
   I just want to know whether there is a component existed can do the
 distributed search based on Solr.

Thanks for the info.

Risking starting up  a flame war (which is not my intention :) ), what
design reasons / features are there in Solr but not in hadoop/nutch that
would make it compelling to use solr instead of h/n ? 

I know, each case is
different the feeling i got from a shortish read into h/n was that H/N is
geared towards webpage indexing, crawling,etc.  But possibly i'm missing
something...

Where Solr is , from my point of view, far more flexible. In which case, maybe
porting HDFS into Solr to add all this clustering / map/reduce options...

thanks for your time and insights :)
B
_
{Beto|Norberto|Numard} Meijome

Windows caters to everyone as though they are idiots. UNIX makes no such
assumption. It assumes you know what you are doing, and presents the challenge
of figuring it out for yourself if you don't.

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: How can i make a distribute search on Solr?

2007-09-19 Thread Venkatraman S
Along similar lines :

assuming that i have 2 indexes in the same box  , say at :
/home/abc/data/index1 and  /home/abc/data/index2,
and i want the results from both the indexes when i do a search - then how
should this be 'optimally' designed - basically these are different Solr
homes and i want the results to be clearly demarcated as coming from 2
different sources.

-Venkat

On 9/20/07, Norberto Meijome [EMAIL PROTECTED] wrote:

 On Thu, 20 Sep 2007 10:21:39 +0800
 Jarvis [EMAIL PROTECTED] wrote:

What you say is done by hadoop that support Hardware Failure、Data
  Replication and some else .
If we want to implement such a good system by ourselves without
 HDFS
  but Solr , it's a very very complex work I think. :)
I just want to know whether there is a component existed can do
 the
  distributed search based on Solr.

 Thanks for the info.

 Risking starting up  a flame war (which is not my intention :) ), what
 design reasons / features are there in Solr but not in hadoop/nutch that
 would make it compelling to use solr instead of h/n ?

 I know, each case is
 different the feeling i got from a shortish read into h/n was that H/N
 is
 geared towards webpage indexing, crawling,etc.  But possibly i'm missing
 something...

 Where Solr is , from my point of view, far more flexible. In which case,
 maybe
 porting HDFS into Solr to add all this clustering / map/reduce options...

 thanks for your time and insights :)
 B
 _
 {Beto|Norberto|Numard} Meijome

 Windows caters to everyone as though they are idiots. UNIX makes no such
 assumption. It assumes you know what you are doing, and presents the
 challenge
 of figuring it out for yourself if you don't.

 I speak for myself, not my employer. Contents may be hot. Slippery when
 wet.
 Reading disclaimers makes you go blind. Writing them is worse. You have
 been
 Warned.




--


RE: How can i make a distribute search on Solr?

2007-09-18 Thread Jarvis
Helpful information. 

So it means that distributed search is not a basic component in Solr project.

Thanks  Best Regards.

Jarvis .


-Original Message-
From: Stu Hood [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 19, 2007 12:55 PM
To: solr-user@lucene.apache.org
Subject: RE: How can i make a distribute search on Solr?

There are two federated/distributed search implementations that are still a few 
weeks away from maturity:
https://issues.apache.org/jira/browse/SOLR-255https://issues.apache.org/jira/browse/SOLR-303Any
 help in testing them would definitely be appreciated.
BUT, if you decide to roll your own, take a look at the following wiki page for 
details on the complexity of the task:
http://wiki.apache.org/solr/FederatedSearch
Good luck!


Thanks,
Stu


-Original Message-
From: ¹ý¼Ñ 
Sent: Wednesday, September 19, 2007 12:24am
To: solr-user@lucene.apache.org
Subject: How can i make a distribute search on Solr?

Hi everyone,

I successfully do the Collection Distribution on two Linux servers - one
master with one slave and sync the index data.

How can I make a search request to master server and receive the
response by all slave servers? OR it should be manually controlled?



Thanks  Best Regards.



Jarvis .



Re: How can i make a distribute search on Solr?

2007-09-18 Thread Ryan McKinley


So it means that distributed search is not a basic component in Solr project.



I think you just need load balancing.  Solr is not a load balancer, you 
need to find something that works for you and configure that elsewhere. 
 Solr works fine without persistent connections, so simple round robin 
DNS but it works find.


Depending on your usage/loads/requirements it may or may not make sense 
to have your master DB in the mix.


Stu is referring to Federated Search - where each index has some of the 
data and results are combined before they are returned.  This is not yet 
supported out of the box


ryan