Re: GIT does not support empty directories

2010-04-16 Thread Ted Dunning
Put a readme file in the directory and be done with it.

On Fri, Apr 16, 2010 at 8:40 AM, Robert Muir rcm...@gmail.com wrote:

 I don't like the idea of complicating lucene/solr's build system any more
 than it already is, unless its absolutely necessary. its already too
 complicated.

 Instead of adding more hacks, what is actually broken (git) is what should
 be fixed, as the link states:

 Currently the design of the git index (staging area) only permits *files*
 to
 be listed, and nobody competent enough to make the change to allow empty
 directories has cared enough about this situation to remedy it.

 On Fri, Apr 16, 2010 at 11:14 AM, Smiley, David W. dsmi...@mitre.org
 wrote:

  Seriously.
  I sympathize with your point that git should support empty directories.
   But as a practical matter, it's easy to make the ant build tolerant of
  them.
 
  ~ David Smiley
  
  From: Robert Muir [rcm...@gmail.com]
  Sent: Friday, April 16, 2010 6:53 AM
  To: solr-dev@lucene.apache.org
  Subject: Re: GIT does not support empty directories
 
  Seriously? We should hack our ant files around the bugs in every crappy
  source control system that comes out?
 
  Fix Git.
 
  On Thu, Apr 15, 2010 at 10:55 PM, Smiley, David W. dsmi...@mitre.org
  wrote:
 
   I've run into this too.  I don't think this needs to be documented, I
  think
   it needs to be *fixed* -- that is, the relevant ant tasks need to not
  assume
   these directories exist and create them if not.
  
   ~ David Smiley
  
   -Original Message-
   From: Lance Norskog [mailto:goks...@gmail.com]
   Sent: Wednesday, April 14, 2010 11:14 PM
   To: solr-dev
   Subject: GIT does not support empty directories
  
   There are some empty directories in the Solr source tree, both in 1.4
   and the trunk.
  
   example/work
   example/webapp
   example/logs
  
   Git does not support empty directories:
  
  
 
 https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F
  
   And so, when you check out from the Apache GIT repository, these empty
   directories do not appear and 'ant example' and 'ant run-example'
   fail. There is no 'how to use the solr git stuff' wiki page; that
   seems like the right place to document this. I'm not git-smart enough
   to write that page.
   --
   Lance Norskog
   goks...@gmail.com
  
 
 
 
  --
  Robert Muir
  rcm...@gmail.com
 



 --
 Robert Muir
 rcm...@gmail.com



Re: GIT does not support empty directories

2010-04-16 Thread Ted Dunning
That is where I learned the trick.

On Fri, Apr 16, 2010 at 1:05 PM, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-04-16 21:33, Ted Dunning wrote:
  Put a readme file in the directory and be done with it.

 That's a trick I used with CVS 15 years ago ... these newfangled gizmos
 aren't so smart after all ;)


Re: Implementing near duplicate detection algorithm using IDF statistics

2010-03-24 Thread Ted Dunning
For reference, you can get a rental copy of this article for less than the
cost of the full PDF download here:


http://www.deepdyve.com/lp/association-for-computing-machinery/collection-statistics-for-fast-duplicate-document-detection-0o7i3Sx0Wd

(joining the ACM is also a good thing to do)

(and yes, this is licensed by the ACM)

On Wed, Mar 24, 2010 at 2:28 AM, Thomas Heigl thomas.he...@systemone.atwrote:

 Hello,

 For my current project I need to implement an index-time mechanism to
 detect (near) duplicate documents. The TextProfileSignature available
 out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright
 but does not use global collection statistics in deciding which terms
 will be used for calculating the signature.
 Most state-of-the-art hash-based duplication detection algorithms make
 use of this information to improve precision and recall (e.g.

 http://portal.acm.org/citation.cfm?id=506311dl=GUIDEcoll=GUIDECFID=83187370CFTOKEN=47052122
 )

 Is it possible to access collection statistics - especially IDF values
 for all non-discarded terms in the current document - from within an
 implementation of the Signature class?

 Kind regards,

 Thomas

 --
 DDI Thomas Heigl
 Software Engineer
 
 System One
 Gesellschaft für technologiegestützte
 Kommunikationsprozesse m.b.H.
 Stiftgasse 6/2/6
 thomas.he...@systemone.at
 http://www.systemone.at
 Powered by Open-Xchange.com



Re: Branding Solr+Lucene

2010-03-22 Thread Ted Dunning
On Mon, Mar 22, 2010 at 11:30 AM, Yonik Seeley ysee...@gmail.com wrote:

 On Mon, Mar 22, 2010 at 2:20 PM, Ryan McKinley ryan...@gmail.com wrote:
  I'm confused... what is the need for a new name?  The only place where
  there is a conflict is in the top level svn tree...

 Agree, no need to re-brand.


I don't see any need to rebrand.  The artifacts will still be called Lucene
and Solr, regardless.  In SVN, the natural thing to do is go with the
momentum that puts solr under lucene.  Insofar as the TLP, I would think
that it would still be Lucene with two delivered artifacts.


Re: rough outline of where Solr's going

2010-03-16 Thread Ted Dunning
The key word here is end-user.

On Tue, Mar 16, 2010 at 10:57 AM, Kevin Osborn osbo...@yahoo.com wrote:

 I definitely agree with Chris here. Although Lucene and Solr are highly
 related, the version numbering should communicate whether Solr has changed
 in a significant or minor way to the end-user. A minor change in Lucene
 could cause major changes in Solr. And vice-versa, a major change in Lucene
 could actually result in very little change for the Solr end user.



[jira] Commented: (SOLR-1375) BloomFilter on a field

2010-03-16 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846166#action_12846166
 ] 

Ted Dunning commented on SOLR-1375:
---

Sorry to comment late here, but when indexing in hadoop, it is really nice to 
avoid any central dependence.  It is also nice to focus the map-side join on 
items likely to match.  Thirdly, reduce side indexing is typically really 
important.

The conclusions from these three considerations vary by duplication rate.  
Using reduce-side indexing gets rid of most of the problems of duplicate 
versions of a single document (with the same sort key) since the reducer can 
scan to see whether it has the final copy handy before adding a document to the 
index.

There remain problems where we have to not index documents that already exist 
in the index or to generate a deletion list that can assist in applying the 
index update.  The former problem is usually the more severe one because it 
isn't unusual for data sources to just include a full dump of all documents and 
assume that the consumer will figure out which are new or updated.  Here you 
would like to only index new and modified documents.  

My own preference for this is to avoid the complication of the map-side join 
using Bloom filters and simply export a very simple list of stub documents that 
correspond to the documents in the index.  These stub documents should be much 
smaller than the average document (unless you are indexing tweets) which makes 
passing around great masses of stub documents not such a problem since Hadoop 
shuffle, copy and sort times are all dominated by Lucene index times.  Passing 
stub documents allows the reducer to simply iterate through all documents with 
the same key keeping the latest version or any stub that is encountered.  For 
documents without a stub, normal indexing can be done with the slight addition 
exporting a list of stub documents for the new additions.

The same thing could be done with a map-side join, but the trade-off is that 
you now need considerably more memory for the mapper to store the entire bitmap 
in memory as opposed needing (somewhat) more time to pass the stub documents 
around.  How that trade-off plays out in the real world isn't clear.  My 
personal preference is to keep heap space small since the time cost is pretty 
minimal for me.

This problem also turns up in our PDF conversion pipeline where we keep 
check-sums of each PDF that has already been converted to viewable forms.   In 
that case, the ratio of real document size to stub size is even more 
preponderate.


 BloomFilter on a field
 --

 Key: SOLR-1375
 URL: https://issues.apache.org/jira/browse/SOLR-1375
 Project: Solr
  Issue Type: New Feature
  Components: update
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, 
 SOLR-1375.patch, SOLR-1375.patch

   Original Estimate: 120h
  Remaining Estimate: 120h

 * A bloom filter is a read only probabilistic set. Its useful
 for verifying a key exists in a set, though it returns false
 positives. http://en.wikipedia.org/wiki/Bloom_filter 
 * The use case is indexing in Hadoop and checking for duplicates
 against a Solr cluster (which when using term dictionary or a
 query) is too slow and exceeds the time consumed for indexing.
 When a match is found, the host, segment, and term are returned.
 If the same term is found on multiple servers, multiple results
 are returned by the distributed process. (We'll need to add in
 the core name I just realized). 
 * When new segments are created, and commit is called, a new
 bloom filter is generated from a given field (default:id) by
 iterating over the term dictionary values. There's a bloom
 filter file per segment, which is managed on each Solr shard.
 When segments are merged away, their corresponding .blm files is
 also removed. In a future version we'll have a central server
 for the bloom filters so we're not abusing the thread pool of
 the Solr proxy and the networking of the Solr cluster (this will
 be done sooner than later after testing this version). I held
 off because the central server requires syncing the Solr
 servers' files (which is like reverse replication). 
 * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
 up only the necessary classes so we don't have a giant Hadoop
 jar in lib.
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
 * Distributed code is added and seems to work, I extended
 TestDistributedSearch to test over multiple HTTP servers. I
 chose this approach rather than the manual method used by (for
 example) TermVectorComponent.testDistributed because I'm new to
 Solr's distributed search

[jira] Commented: (SOLR-1814) select count(distinct fieldname) in SOLR

2010-03-10 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843632#action_12843632
 ] 

Ted Dunning commented on SOLR-1814:
---


Trove is GPL.

The Mahout project has a partial set of replacements for Trove collections in 
case you want to go forward with this.  Our plan is to consider breaking out 
the collections package from Mahout at some point in case you don't want to 
drag along the rest of Mahout.


 select count(distinct fieldname) in SOLR
 

 Key: SOLR-1814
 URL: https://issues.apache.org/jira/browse/SOLR-1814
 Project: Solr
  Issue Type: New Feature
  Components: SearchComponents - other
Affects Versions: 1.4, 1.5, 1.6, 2.0
Reporter: Marcus Herou
 Fix For: 1.4, 1.5, 1.6, 2.0

 Attachments: CountComponent.java


 I have seen questions on the mailinglist about having the functionality for 
 counting distinct on a field. We at Tailsweep as well want to that in for 
 example our blogsearch.
 Example:
 You had 1345 hits on 244 blogs
 The 244 part is not possible in SOLR today (correct me if I am wrong). So 
 I've written a component which does this. Attaching it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835963#action_12835963
 ] 

Ted Dunning commented on SOLR-1724:
---


Will this http access also allow a cluster with incrementally updated cores to 
replicate a core after a node failure?

 Real Basic Core Management with Zookeeper
 -

 Key: SOLR-1724
 URL: https://issues.apache.org/jira/browse/SOLR-1724
 Project: Solr
  Issue Type: New Feature
  Components: multicore
Affects Versions: 1.4
Reporter: Jason Rutherglen
 Fix For: 1.5

 Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
 hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
 SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
 SOLR-1724.patch, SOLR-1724.patch


 Though we're implementing cloud, I need something real soon I can
 play with and deploy. So this'll be a patch that only deploys
 new cores, and that's about it. The arch is real simple:
 On Zookeeper there'll be a directory that contains files that
 represent the state of the cores of a given set of servers which
 will look like the following:
 /production/cores-1.txt
 /production/cores-2.txt
 /production/core-host-1-actual.txt (ephemeral node per host)
 Where each core-N.txt file contains:
 hostname,corename,instanceDir,coredownloadpath
 coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
 etc
 and
 core-host-actual.txt contains:
 hostname,corename,instanceDir,size
 Everytime a new core-N.txt file is added, the listening host
 finds it's entry in the list and begins the process of trying to
 match the entries. Upon completion, it updates it's
 /core-host-1-actual.txt file to it's completed state or logs an error.
 When all host actual files are written (without errors), then a
 new core-1-actual.txt file is written which can be picked up by
 another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: SolrCloud - Using collections, slices and shards in the wild

2010-02-10 Thread Ted Dunning
This is analogous to the multiple data sources that we have at
Deepdyvehttp://www.deepdyve.com.
In a fully sharded and balanced environment, I have found it *much* more
efficient to put all data sources into a single collection and use a filter
to select one or the other.  The rationale is that data sources are
distributed in size according to a rough long-tail distribution.  For the
largest ones, the filters are about as efficient as a separate index because
they are such a large fraction of the index.  For the small ones, the
filtered query is so fast that other issues form the bottleneck anyway.  The
operational economies of not managing hundreds of indexes and the much
better load balancing makes the integrated solution massively better for
me.  We currently use Katta and this system works really, really well.

One big difference in our environments is that for me, the dominant query
pattern involves most data sources while for you, the dominant pattern will
likely involve a single data source.

On Tue, Feb 9, 2010 at 9:02 PM, Jon Gifford jon.giff...@gmail.com wrote:

 1) Support one index per customer, and many customers (thus, many
 independent indices)




-- 
Ted Dunning, CTO
DeepDyve


Re: SolrCloud - Using collections, slices and shards in the wild

2010-02-10 Thread Ted Dunning
Yeah my decision likely would have been different if it weren't for the
fact that Katta is industrial strength already.

On Wed, Feb 10, 2010 at 12:59 PM, Jon Gifford jon.giff...@gmail.com wrote:

  We currently use Katta and this system works really, really well.

 Katta does look nice, but the SolrCloud stuff seems simpler and closer
 to what I need. We shall see :-)




-- 
Ted Dunning, CTO
DeepDyve


Re: priority queue in query component

2010-02-09 Thread Ted Dunning
Katta has a very flexible and usable option for this even in the absence of
replicas.

The idea is that shards may report results, may report failure, may report
late, may never report or may have a transport layer issue.  All kinds of
behavior should be handled.

What is done with katta is that each search has a deadline and a partial
results policy.  At any time, if all results have been received, a complete
set of results is returned.  If a deadline is reached, then the policy is
interrogated with the results so far.  The policy has the option to return a
failure, partial results (with timeouts reported on missing shards) or to
set a new deadline and possibly a new policy (so that the number of missing
results gets more relaxed as time passes).  The policy is also called each
time a new result is received or failure is noted.

Transport layer issues and explicit error returns are handled by the
framework.  Any time one of these is encountered, the search is immediately
dispatched to a replica of the shard if one exists.  In that case, that
query may have a late start and may not return by the deadline, depending on
policy.  If no replica is available that has not been queried, an error
result is recorded for that shard.

Note that Katta even supports fail-fast in this scenario since the partial
result policy can return a new deadline for all partial results that have no
hard failures and can return a failure if it notes any shard failures.

On Tue, Feb 9, 2010 at 5:25 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 The SolrCloud branch now has load balancing and fail-over amongst
 shard replicas.
 Partial results aren't available yet (if there are no up replicas for
 a shard), but that is planned.

 -Yonik
 http://www.lucidimagination.com


 On Tue, Feb 9, 2010 at 8:21 AM, Jan Høydahl / Cominvent
 jan@cominvent.com wrote:
  Isn't that OK as long as there is the option of allowing partial results
 if you really want?
  Keeping the logic simple has its benefits. Let client be responsible for
 query resubmit strategy, and let load balancer (or shard manager) be
 responsible for marking a node/shard as dead/inresponsive and choosing
 another for the next query.
 
  --
  Jan Høydahl  - search architect
  Cominvent AS - www.cominvent.com
 
  On 9. feb. 2010, at 04.36, Lance Norskog wrote:
 
  At this point, Distributed Search does not support any recovery if
  when one or more shards fail. If any fail or time out, the whole query
  fails.
 
  On Sat, Feb 6, 2010 at 9:34 AM, mike anderson saidthero...@gmail.com
 wrote:
  so if we received the response from shard2 before shard1, we would
 just
  queue it up and wait for the response to shard1.
 
  This crossed my mind, but my concern was how to handle the case when
 shard1
  never responds. Is this something I need to worry about?
 
  -mike
 
  On Sat, Feb 6, 2010 at 11:33 AM, Yonik Seeley 
 yo...@lucidimagination.comwrote:
 
  It seems like changing an element in a priority queue breaks the
  invariants, and hence it's not doable with a priority queue and with
  the current strategy of adding sub-responses as they are received.
 
  One way to continue using a priority queue would be to add
  sub-responses to the queue in the preferred order... so if we received
  the response from shard2 before shard1, we would just queue it up and
  wait for the response to shard1.
 
  -Yonik
  http://www.lucidimagination.com
 
 
  On Sat, Feb 6, 2010 at 10:35 AM, mike anderson 
 saidthero...@gmail.com
  wrote:
  I have a need to favor documents from one shard over another when
  duplicates
  occur. I found this code in the query component:
 
   String prevShard = uniqueDoc.put(id, srsp.getShard());
   if (prevShard != null) {
 // duplicate detected
 numFound--;
 
 // For now, just always use the first encountered since we
  can't
  currently
 // remove the previous one added to the priority queue.
  If we
  switched
 // to the Java5 PriorityQueue, this would be easier.
 continue;
 // make which duplicate is used deterministic based on
 shard
 // if (prevShard.compareTo(srsp.shard) = 0) {
 //  TODO: remove previous from priority queue
 //  continue;
 // }
   }
 
 
  Is there a ticket open for this issue? What would it take to fix?
 
  Thanks,
  Mike
 
 
 
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 
 




-- 
Ted Dunning, CTO
DeepDyve


Re: priority queue in query component

2010-02-06 Thread Ted Dunning
It also seems like it should be possible to include the shard number in the
ordering of responses.

On Sat, Feb 6, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 It seems like changing an element in a priority queue breaks the
 invariants, and hence it's not doable with a priority queue and with
 the current strategy of adding sub-responses as they are received.

 One way to continue using a priority queue would be to add
 sub-responses to the queue in the preferred order... so if we received
 the response from shard2 before shard1, we would just queue it up and
 wait for the response to shard1.

 -Yonik
 http://www.lucidimagination.com


 On Sat, Feb 6, 2010 at 10:35 AM, mike anderson saidthero...@gmail.com
 wrote:
  I have a need to favor documents from one shard over another when
 duplicates
  occur. I found this code in the query component:
 
   String prevShard = uniqueDoc.put(id, srsp.getShard());
   if (prevShard != null) {
 // duplicate detected
 numFound--;
 
 // For now, just always use the first encountered since we
 can't
  currently
 // remove the previous one added to the priority queue.  If we
  switched
 // to the Java5 PriorityQueue, this would be easier.
 continue;
 // make which duplicate is used deterministic based on shard
 // if (prevShard.compareTo(srsp.shard) = 0) {
 //  TODO: remove previous from priority queue
 //  continue;
 // }
   }
 
 
  Is there a ticket open for this issue? What would it take to fix?
 
  Thanks,
  Mike
 




-- 
Ted Dunning, CTO
DeepDyve


[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-02 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828961#action_12828961
 ] 

Ted Dunning commented on SOLR-1301:
---

{quote}
Based on these observation, I have few questions. (I am a beginner to the 
Hadoop  Solr world. So, please forgive me if my questions are silly):
1. As per above observation, SOLR-1045 patch is functionally better 
(performance I have not verified yet ). Can anyone tell me, whats the actual 
advantage SOLR-1301 patch offers over SOLR-1045 patch?
2. If both the jira issues are trying to solve the same problem, do we really 
need 2 separate issues?
{quote}

In the katta community, the recommended practice started with SOLR-1045 (what I 
call map-side indexing) behavior, but I think that the consensus now is that 
SOLR-1301 behavior (what I call reduce side indexing) is much, much better.  
This is not necessarily the obvious result given your observations.  There are 
some operational differences between katta and SOLR that might make the 
conclusions different, but what I have observed is the following:

a) index merging is a really bad idea that seems very attractive to begin with 
because it is actually pretty expensive and doesn't solve the real problems of 
bad document distribution across shards.  It is much better to simply have lots 
of shards per machine (aka micro-sharding) and use one reducer per shard.  For 
large indexes, this gives entirely acceptable performance.  On a pretty small 
cluster, we can index 50-100 million large documents in multiple ways in 2-3 
hours.  Index merging gives you no benefit compared to reduce side indexing and 
just increases code complexity.

b) map-side indexing leaves you with indexes that are heavily skewed by being 
composed of of documents from a single input split.  At retrieval time, this 
means that different shards have very different term frequency profiles and 
very different numbers of relevant documents.  This makes lots of statistics 
very difficult including term frequency computation, term weighting and 
determining the number of documents to retrieve.  Map-side merge virtually 
guarantees that you have to do two cluster queries, one to gather term 
frequency statistics and another to do the actual query.  With reduce side 
indexing, you can provide strong probabilistic bounds on how different the 
statistics in each shard can be so you can use local term statistics and you 
can depend on the score distribution being this same which radically decreases 
the number of documents you need to retrieve from each shard.

c) reduce-side indexing improves the balance of computation during retrieval.  
If (as is the rule) some document subset is hotter than other document subset 
due, say to data-source boosting or recency boosting, you will have very bad 
cluster utilization with skewed shards from map-side indexing while all shards 
will cost about the same for any query leading to good cluster utilization and 
faster queries with reduce-side indexing.

d) with reduce-side indexing has properties that can be mathematically stated 
and proved.  Map-side indexing only has comparable properties if you make 
unrealistic assumptions about your original data.

e) micro-sharding allows very simple and very effective use of multiple cores 
on multiple machines in a search cluster.  This can be very difficult to do 
with large shards or a single index.

Now, as you say, these advantages may evaporate if you are looking to produce a 
single output index.  That seems, however, to contradict the whole point of 
scaling.   If you need to scale indexing, presumably you also need to scale 
search speed and throughput.  As such you probably want to have many shards 
rather than few.  Conversely, if you can stand to search a single index, then 
you probably can stand to index on a single machine. 

Another thing to think about is the fact SOLR doesn't yet do micro-sharding or 
clustering very well and, in particular, doesn't handle multiple shards per 
core.  That will be changing before long, however, and it is very dangerous to 
design for the past rather than the future.

In case, you didn't notice, I strongly suggest you stick with reduce-side 
indexing.

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop

[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-29 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806547#action_12806547
 ] 

Ted Dunning commented on SOLR-1301:
---


It is critical to put indexes in the task local area on both local and hdfs 
storage areas not just because of task cleanup, but also because a task may be 
run more than once.  Hadoop handles all the race conditions that would 
otherwise happen as a result.


 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-21 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803371#action_12803371
 ] 

Ted Dunning commented on SOLR-1724:
---

{quote}
... I agree, I'm not really into ephemeral
ZK nodes for Solr hosts/nodes. The reason is contact with ZK is
highly superficial and can be intermittent. 
{quote}
I have found that when I was having trouble with ZK connectivity, the problems 
were simply surfacing issues that I had anyway.  You do have to configure the 
ZK client to not have long pauses (that is incompatible with SOLR how?) and you 
may need to adjust the timeouts on the ZK side.  More importantly, any issues 
with ZK connectivity will have their parallels with any other heartbeat 
mechanism and replicating a heartbeat system that tries to match ZK for 
reliability is going to be a significant  source of very nasty bugs.  Better to 
not rewrite that already works.  Keep in mind that ZK *connection* issues are 
not the same as session expiration.  Katta has a fairly important set of 
bugfixes now to make that distinction and ZK will soon handle connection loss 
on its own. 

It isn't a bad idea to keep shards around for a while if a node goes down.  
That can seriously decrease the cost of momentary outages such as for a 
software upgrade.  The idea is that when the node comes back, it can advertise 
availability of some shards and replication of those shards should cease.



 Real Basic Core Management with Zookeeper
 -

 Key: SOLR-1724
 URL: https://issues.apache.org/jira/browse/SOLR-1724
 Project: Solr
  Issue Type: New Feature
  Components: multicore
Affects Versions: 1.4
Reporter: Jason Rutherglen
 Fix For: 1.5

 Attachments: SOLR-1724.patch


 Though we're implementing cloud, I need something real soon I can
 play with and deploy. So this'll be a patch that only deploys
 new cores, and that's about it. The arch is real simple:
 On Zookeeper there'll be a directory that contains files that
 represent the state of the cores of a given set of servers which
 will look like the following:
 /production/cores-1.txt
 /production/cores-2.txt
 /production/core-host-1-actual.txt (ephemeral node per host)
 Where each core-N.txt file contains:
 hostname,corename,instanceDir,coredownloadpath
 coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
 etc
 and
 core-host-actual.txt contains:
 hostname,corename,instanceDir,size
 Everytime a new core-N.txt file is added, the listening host
 finds it's entry in the list and begins the process of trying to
 match the entries. Upon completion, it updates it's
 /core-host-1-actual.txt file to it's completed state or logs an error.
 When all host actual files are written (without errors), then a
 new core-1-actual.txt file is written which can be picked up by
 another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Case-insensitive searches and facet case

2010-01-20 Thread Ted Dunning
To amplify this correct answer, use one field for searching (querying).
This would be lower cased.  Then use a second field for faceting (case
preserved).  The only gotcha here is that your original data may have
inconsistent casing.  My usual answer for that is to either impose a
conventional case pattern (which takes you back to one field if you like) or
to do a spelling corrector analysis to find the most common case pattern for
each unique lower cased string.  Then during indexing, I impose that pattern
on the facet field.

On Wed, Jan 20, 2010 at 9:46 AM, Erik Hatcher erik.hatc...@gmail.comwrote:

 Is there a way to maintain case-impartiality whilst allowing facets to be
 returned 'case-preserved'?


 Yes, use different fields.  Generally facet fields are string which will
 maintain exact case.  You can leverage the copyField capabilities in
 schema.xml to clone a field and analyze it differently.




-- 
Ted Dunning, CTO
DeepDyve


Re: Solr Cloud wiki and branch notes

2010-01-17 Thread Ted Dunning
Jason V and Jason R have done just that.

Great idea.  Cool work.  But a unified management interface would *really*
be nice.

On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki a...@getopt.org wrote:

 Well, then if we don't intend to support updates in this iteration then
 perhaps there is no need to change anything in Solr, just extend Katta to
 run Solr searchers ... :P




-- 
Ted Dunning, CTO
DeepDyve


Re: Solr Cloud wiki and branch notes

2010-01-17 Thread Ted Dunning
+1

Hadoop still calls it a copy of a block if you have replication factor of
1.  Why not?

(for that matter, I still call it an integer if it has a value of 1)

On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki a...@getopt.org wrote:

 I originally started off with replica too... but there may only be
 one copy of a physical shard, it seemed strange to call it a replica.


 Yeah .. it's a replica with a replication factor of 1 :)




-- 
Ted Dunning, CTO
DeepDyve


Re: Solr Cloud wiki and branch notes

2010-01-17 Thread Ted Dunning
Control is easily retained if you make pluggable the selection of shards to
which you want to do the horizontal broadcast.  The shard management layer
shouldn't know or care what query you are doing and in most cases it should
just use the trivial all-shards selection policy.

On Sun, Jan 17, 2010 at 7:34 AM, Yonik Seeley yo...@lucidimagination.comwrote:

  I would argue that the current model has been adopted out of necessity,
 and
  not because of the users' preference.

 I think it's both - I've seen quite a few people that really wanted to
 partition by time for example (and they made some compelling cases for
 doing so).  Seems like a good goal would be to support the customer
 having various levels of control.




-- 
Ted Dunning, CTO
DeepDyve


Re: Solr Cloud wiki and branch notes

2010-01-16 Thread Ted Dunning
My experience with Katta is that very quickly my developers adopted index as
the aggregate of all the shards which is exactly what Andrzej is proposing.
Confusion with the index contains shards, nodes host shards terminology
has been minimal.

On Sat, Jan 16, 2010 at 11:40 AM, Andrzej Bialecki a...@getopt.org wrote:

 We're currently using collection.  Notice how you had to add
 (global) to clarify what you meant.  I fear that a sentence like what
 index are you querying would need constant clarification.


 I avoided the word collection, because Solr deploys various cores under
 collectionX names, leading users to assume that core == collection.
 Global index is two words but it's unambiguous. I'm fine with the
 collection if we clarify the definition and avoid using this term for
 other stuff.




-- 
Ted Dunning, CTO
DeepDyve


[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-16 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801296#action_12801296
 ] 

Ted Dunning commented on SOLR-1724:
---

{quote}
We actually started out that way... (when a node went down there wasn't really 
any trace it ever existed) but have been moving away from it.
ZK may not just be a reflection of the cluster but may also control certain 
aspects of the cluster that you want persistent. For example, marking a node as 
disabled (i.e. don't use it). One could create APIs on the node to enable and 
disable and have that reflected in ZK, but it seems like more work than simply 
saying change this znode.
{quote}

I see this as a  conflation of two or three goals that leads to trouble.  All 
of the goals are worthy and important, but the conflation of them leads to 
difficult problems.  Taken separately, the goals are easily met.

One goal is the reflection of current cluster state.  That is most reliably 
done using ephemeral files roughly as I described.

Another goal is the reflection of constraints or desired state of the cluster.  
This is best handled as you describe, with permanent files since you don't want 
this desired state to disappear when a node disappears.

The real issue is making sure that the source of whatever information is most 
directly connected to the physical manifestation of that information.  
Moreover, it is important in some cases (node state, for instance) that the 
state stay correct even when the source of that state loses control by 
crashing, hanging or becoming otherwise indisposed.  Inserting an intermediary 
into this chain of control is a bad idea.  Replicating ZK's rather well 
implemented ephemeral state mechanism with ad hoc heartbeats is also a bad idea 
(remember how *many* bugs there have been in hadoop relative to heartbeats and 
the name node?).

A somewhat secondary issue is whether the cluster master has to be involved in 
every query.  That seems like a really bad bottleneck to me and Katta provides 
a proof of existence that this is not necessary.

After trying several options in production, what I find is best is that the 
master lay down a statement of desired state and the nodes publish their status 
in a different and ephemeral fashion.  The master can record a history or there 
may be general directions such as your disabled list however you like but that 
shouldn't be mixed into the node status because you otherwise get into a 
situation where ephemeral files can no longer be used for what they are good at.



 Real Basic Core Management with Zookeeper
 -

 Key: SOLR-1724
 URL: https://issues.apache.org/jira/browse/SOLR-1724
 Project: Solr
  Issue Type: New Feature
  Components: multicore
Affects Versions: 1.4
Reporter: Jason Rutherglen
 Fix For: 1.5


 Though we're implementing cloud, I need something real soon I can
 play with and deploy. So this'll be a patch that only deploys
 new cores, and that's about it. The arch is real simple:
 On Zookeeper there'll be a directory that contains files that
 represent the state of the cores of a given set of servers which
 will look like the following:
 /production/cores-1.txt
 /production/cores-2.txt
 /production/core-host-1-actual.txt (ephemeral node per host)
 Where each core-N.txt file contains:
 hostname,corename,instanceDir,coredownloadpath
 coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
 etc
 and
 core-host-actual.txt contains:
 hostname,corename,instanceDir,size
 Everytime a new core-N.txt file is added, the listening host
 finds it's entry in the list and begins the process of trying to
 match the entries. Upon completion, it updates it's
 /core-host-1-actual.txt file to it's completed state or logs an error.
 When all host actual files are written (without errors), then a
 new core-1-actual.txt file is written which can be picked up by
 another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Ted Dunning
This can also be a big performance win.  Jason Venner reports significant
index and cluster start time improvements by indexing to local disk, zipping
and then uploading the resulting zip file.  Hadoop has significant file open
overhead so moving one zip file wins big over many index component files.
There is a secondary bandwidth win as well.

On Fri, Jan 15, 2010 at 8:34 AM, Andrzej Bialecki (JIRA) j...@apache.orgwrote:


 HDFS doesn't support enough POSIX to support writing Lucene indexes
 directly to HDFS - for this reason indexes are always created on local
 storage of each node, and then after closing they are copied to HDFS.





Re: [jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Ted Dunning
The reason I would a major speed win when expect indexing to local disk and
copying later is that you get much more efficient reading of documents with
normal hadoop mechanisms.  Throwing documents to the various Solr master
indexers is bound to be slower than having 20 machines reading at local disk
speeds if only because of network delays.

On Fri, Jan 15, 2010 at 12:09 PM, Grant Ingersoll gsing...@apache.orgwrote:

 I can see why that is a win over the existing, but I still don't get why it
 wouldn't be faster just to index to a suite of Solr master indexers and save
 all this file slogging around.  But, I guess that is a separate patch all
 together.




Re: [jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Ted Dunning
We index comparable amounts of data in a few hours.

On Fri, Jan 15, 2010 at 1:08 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 That and on-demand expandability, so I
 can reindex 2 terabytes of data in half a day vs weeks or more
 with 4 Solr masters has compelling advantages.




-- 
Ted Dunning, CTO
DeepDyve


[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-15 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801051#action_12801051
 ] 

Ted Dunning commented on SOLR-1724:
---


Katta had some interesting issues in the design of this.

These are discussed here: http://oss.101tec.com/jira/browse/KATTA-43

The basic design consideration is that failure of a node needs to automagically 
update the ZK state accordingly.  This allows all important updates to files to 
go one direction as well.


 Real Basic Core Management with Zookeeper
 -

 Key: SOLR-1724
 URL: https://issues.apache.org/jira/browse/SOLR-1724
 Project: Solr
  Issue Type: New Feature
  Components: multicore
Affects Versions: 1.4
Reporter: Jason Rutherglen
 Fix For: 1.5


 Though we're implementing cloud, I need something real soon I can
 play with and deploy. So this'll be a patch that only deploys
 new cores, and that's about it. The arch is real simple:
 On Zookeeper there'll be a directory that contains files that
 represent the state of the cores of a given set of servers which
 will look like the following:
 /production/cores-1.txt
 /production/cores-2.txt
 /production/core-host-1-actual.txt (ephemeral node per host)
 Where each core-N.txt file contains:
 hostname,corename,instanceDir,coredownloadpath
 coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
 etc
 and
 core-host-actual.txt contains:
 hostname,corename,instanceDir,size
 Everytime a new core-N.txt file is added, the listening host
 finds it's entry in the list and begins the process of trying to
 match the entries. Upon completion, it updates it's
 /core-host-1-actual.txt file to it's completed state or logs an error.
 When all host actual files are written (without errors), then a
 new core-1-actual.txt file is written which can be picked up by
 another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Solr Cloud wiki and branch notes

2010-01-15 Thread Ted Dunning
On Fri, Jan 15, 2010 at 4:36 PM, Andrzej Bialecki a...@getopt.org wrote:

 My 0.02 PLN on the subject ...


Polish currency seems pretty strong lately.  There are a lot of good ideas
for this small sum.



 Terminology

 * (global) search index
 * index shard:
 * partitioning:
 * search node:
 * search host:
 * Shard Manager:


I think that these terms are excellent.


 The replication and load balancing is a problem with many existing
 solutions, and this one in particular reminds me strongly of the Hadoop
 HDFS. In fact, early on during the development of Hadoop [1] I wondered
 whether we could reuse HDFS to manage Lucene indexes instead of opaque
 blocks of fixed size. It turned out to be infeasible, but the model of
 Namenode/Datanode still looks useful in our case, too.


I have seen the analogy with hadoop in managing a Katta cluster.  The
randomized assignment provides very many of the same robustness benefits as
a map-reduce architecture provides for parallel computing.


 I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper
 that we could reuse in our design. The following is just a straightforward
 port of the Namenode/Datanode concept.

 Let's imagine a component called ShardManager that is responsible for
 managing the following data:

 * list of shard ID-s that together form the complete search index,
 * for each shard ID, list of search nodes that serve this shard.
 * issuing replication requests
 * maintaining the partitioning function (see below), so that updates are
 directed to correct shards
 * maintaining heartbeat to check for dead nodes
 * providing search clients with a list of nodes to query in order to obtain
 all results from the search index.


I think that this is close.

I think that the list of search nodes that serve each shard should be
maintained by the nodes themselves.  Moreover, ZK provides the ability to
have this list magically update if the node dies.

This means that the need for heartbeats virtually disappears.

In addition, I think that a substrate like ZK should be used to provide
search clients with the information about which nodes have which shards and
the clients should themselves decide how to cover the set of shards with a
list of nodes.  This means that the ShardManager is *completely* out of the
real-time pathway.



 ... I believe most of the above functionality could be facilitated by
 Zookeeper, including the election of the node that runs the ShardManager.


Absolutely.


 Updates
 ---
 We need a partitioning schema that splits documents more or less evenly
 among shards, and at the same time allows us to split or merge unbalanced
 shards. The simplest function that we could imagine is the following:

hash(docId) % numShards

 though this has the disadvantage that any larger update will affect
 multiple shards, thus creating an avalanche of replication requests ... so a
 sequential model would be probably better, where ranges of docIds are
 assigned to shards.


A hybrid is quite possible:

hash(floor(docId / sequence-size)) % numShards

this gives sequential assignment of sequence-size documents at a time.
Sequence-size should be small to distribute query results and update loads
across all nodes.  Sequence size should be large to avoid replication of all
shards after a focussed update.  Balance is necessary.


 Now, if any particular shard is too unbalanced, e.g. too large, it could be
 further split in two halves, and the ShardManager would have to record this
 exception. This is a very similar process to a region split in HBase, or a
 page split in btree DBs. Conversely, shards that are too small could be
 joined. This is the icing on the cake, so we can leave it for later.


Leaving for later is a great idea.  With relatively small shards, I am able
to parallelize indexing to the point that a terabyte or so of documents
index in a few hours.  Combined with a small sequence-size in the shard
distribution function so that all shards grow together, it is easy to plan
for 3x growth or more without the need to shard splitting.  With a complete
index being so cheap, I can afford to simply reindex from scratch with a
different shard count if I feel like it.



 Search
 --
 There should be a component sometimes referred to as query integrator (or
 search front-end) that is the entry and exit point for user search requests.
 On receiving a search request this component gets a list of randomly
 selected nodes from SearchManager to contact (the list containing all shards
 that form the global index), sends the query and integrates partial results
 (under a configurable policy for timeouts/early termination), and sends back
 the assembled results to the user.


Yes in outline.

A few details:

I think that the shard cover computation should, in fact, be done on the
client side.  One reason is that the node/shard state is relatively static
and if all clients retrieve the full state this is cachable and simple.

Another 

Re: SolrCloud logical shards

2010-01-14 Thread Ted Dunning
Logical shard sounds good as the collection of all identical physical
shards

Another concept from Katta that is AFAIK missing from the Solr lexicon is
the distinction between node and shard.  In Katta, a node is a server worker
instance that contains and queries physical shards.  There is usually one
node per physical server, but not always.  In Katta an important performance
and reliability optimization is that nodes do not contain identical shard
sets.  That is, shards are assigned randomly even when replicated.  This
improves robustness, code simplicity and load balancing.

On Thu, Jan 14, 2010 at 9:08 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 Should we use logical shard for this, or does anyone have any better ideas?




-- 
Ted Dunning, CTO
DeepDyve


Re: SolrCloud logical shards

2010-01-14 Thread Ted Dunning
I think that most of these complications go away to a remarkable degree if
you combine katta style random assignment of small shards.

The major simplifications there include:

- no need to move individual documents, nor to split or merge shards, no
need for search-server to search-server communications

- search servers do search and nothing else

- placement, balance, replication and query balancing policy is factored out
of all real-time paths

- real-time updates can be accommodated in the same framework with minimal
changes to the shard management layer

- the shard management is completely agnostic to the actual search
semantics.

On Thu, Jan 14, 2010 at 9:46 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 I'm actually starting to lean toward slice instead of logical shard.
 In the future we'll want to enable overlapping shards I think (due to
 an Amazon Dynamo type of replication, or due to merging shards, etc),
 and a separate word for a logical slice of the index seems desirable.




-- 
Ted Dunning, CTO
DeepDyve


Re: SolrCloud logical shards

2010-01-14 Thread Ted Dunning
I have found that users of the system like to use index as the composite of
all nodes/shards/slices that is searched in response to a query.  It is the
ultimate logical entity.   Really, this is the same abstraction that users
of Lucene have.  They really don't want to care that a Lucene index is made
up of several files and even possibly several indexes in various states of
merging.  The same should really be true of a parallel system, but more so.


On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
yo...@lucidimagination.comwrote:

 On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
  On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
  yo...@lucidimagination.com wrote:
  I'm actually starting to lean toward slice instead of logical shard.

 Alternate terminology could be index for the actual physical lucene
 lindex (and also enough of the URL that unambiguously identifies it),
 and then shard could be the logical entity.

 But I've kind of gotten used to thinking of shards as the actual
 physical queryable things...

 -Yonik
 http://www.lucidimagination.com




-- 
Ted Dunning, CTO
DeepDyve


Re: SolrCloud logical shards

2010-01-14 Thread Ted Dunning
Shard has the interesting additional implication that it is part of a
composite index made up of many sub-indexes.

A lucene index could be a complete index or a shard.  I would presume the
same of what might be called a core.

On Thu, Jan 14, 2010 at 3:21 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 Uri,

  core to represent a single index and shard to be
  represented by a single core

 Can you elaborate on what you mean, isn't a core a single index
 too? It seems like shard was used to represent a remote index
 (perhaps?). Though here I'd prefer remote core, because to the
 uninitiated Solr outsider it's immediately obvious (i.e. they
 need only know what a core is, in the Solr glossary or term
 dictionary).

 In Google vernacular, which is where the name shard came from, a
 shard is basically a local sub-index
 http://research.google.com/archive/googlecluster.html where
 there would be many shards per server. However that's a
 digression at this point.

 I personally prefer relatively straightforward names, that are
 self-evident, rather than inventing new language for fairly
 simple concepts. Slice, even though it comes from our buddy
 Yonik, probably doesn't make any immediate sense to external
 users when compared with the word shard. Of course software
 projects have a tendency to create their own words to somewhat
 mystify users into believing in some sort of magic occurring
 underneath. If that's what we're after, it's cool, I mean that
 makes sense. And I don't mean to be derogatory here however this
 is an open source project created in part to educate users on
 search and be made easily accessible as possible, to the
 greatest number of users possible. I think Doug did a create job
 of this when Lucene started with amazingly succinct code for
 fairly complex concepts (eg, anti-mystification of search).

 Jason

 On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness ubon...@gmail.com wrote:
  Although Jason has some valid points here, I'm with Yonik here. I do
 believe
  that we've gotten used to the terms core to represent a single index
 and
  shard to be represented by a single core. A node seems to indicate a
  machine or a JVM. Changing any of these (informal perhaps) definitions
 will
  only cause confusion. That's why I think a slice is a good solution
 now...
  first it's a new term to a new view of the index (logical shard AFAIK
 don't
  really exists yet) so people won't need to get used to it, but it's also
  descriptive and intuitive. I do like Jason's idea about having a protocol
  attached to the URL's.
 
  Cheers,
  Uri
 
  Jason Rutherglen wrote:
 
  But I've kind of gotten used to thinking of shards as the
  actual physical queryable things...
 
 
  I think a mistake was made referring to Solr cores as shards.
  It's the same thing with 2 different names. Slices adds yet
  another name which seems to imply the same thing yet again. I'd
  rather see disambiguation here, and call them cores (partially
  because that's what's in the code and on the wiki), and cores
  only. It's a Solr specific term, it's going to be confused with
  microprocessor cores, but at least there's only one name, which
  as search people, we know creates fewer posting lists :).
 
  Logical groupings of cores can occur, which can be aptly named
  core groups. This way I can submit a query to a core group, and
  it's reasonable to assume I'm hitting N cores. Further, cores
  could point to a logical or physical entity via a URL. (As a
  side note, I've always found it odd that the shards param to
  RequestHandler lacks the protocol, what if I want to use HTTPS
  for example?).
 
  So there could be http://host/solr/core1 (physical),
  core://megacorename (logical),
  coregroup://supergreatcoregroupname (a group of cores) in the
  shards parameter (whose name can perhaps be changed for
  clarity in a future release). Then people can mix and match and
  we won't have many different XML elements floating around. We'd
  have a simple list of URLs that are transposed into a real
  physical network request.
 
 
  On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
  yo...@lucidimagination.com wrote:
 
 
  On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
  yo...@lucidimagination.com wrote:
 
 
  On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
  yo...@lucidimagination.com wrote:
 
 
  I'm actually starting to lean toward slice instead of logical
  shard.
 
 
  Alternate terminology could be index for the actual physical lucene
  lindex (and also enough of the URL that unambiguously identifies it),
  and then shard could be the logical entity.
 
  But I've kind of gotten used to thinking of shards as the actual
  physical queryable things...
 
  -Yonik
  http://www.lucidimagination.com
 
 
 
 
 




-- 
Ted Dunning, CTO
DeepDyve


Re: SolrCloud logical shards

2010-01-14 Thread Ted Dunning
My definition of right is simple and modularized with minimal conceptual
upheaval.

As such, simply giving SOLR a good shard manager that broadcasts queries
without respect to content is a preferable solution than something very
fancy.

On Thu, Jan 14, 2010 at 4:31 PM, Lance Norskog goks...@gmail.com wrote:

 Logical-to-physical mapping should not assume that the logical has an
 integral number of the physical. Overlapping and partial physical
 shards should be addressable as a logical shard. If you're going to do
 something this major, do it right.




-- 
Ted Dunning, CTO
DeepDyve