Re: GIT does not support empty directories
Put a readme file in the directory and be done with it. On Fri, Apr 16, 2010 at 8:40 AM, Robert Muir rcm...@gmail.com wrote: I don't like the idea of complicating lucene/solr's build system any more than it already is, unless its absolutely necessary. its already too complicated. Instead of adding more hacks, what is actually broken (git) is what should be fixed, as the link states: Currently the design of the git index (staging area) only permits *files* to be listed, and nobody competent enough to make the change to allow empty directories has cared enough about this situation to remedy it. On Fri, Apr 16, 2010 at 11:14 AM, Smiley, David W. dsmi...@mitre.org wrote: Seriously. I sympathize with your point that git should support empty directories. But as a practical matter, it's easy to make the ant build tolerant of them. ~ David Smiley From: Robert Muir [rcm...@gmail.com] Sent: Friday, April 16, 2010 6:53 AM To: solr-dev@lucene.apache.org Subject: Re: GIT does not support empty directories Seriously? We should hack our ant files around the bugs in every crappy source control system that comes out? Fix Git. On Thu, Apr 15, 2010 at 10:55 PM, Smiley, David W. dsmi...@mitre.org wrote: I've run into this too. I don't think this needs to be documented, I think it needs to be *fixed* -- that is, the relevant ant tasks need to not assume these directories exist and create them if not. ~ David Smiley -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Wednesday, April 14, 2010 11:14 PM To: solr-dev Subject: GIT does not support empty directories There are some empty directories in the Solr source tree, both in 1.4 and the trunk. example/work example/webapp example/logs Git does not support empty directories: https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F And so, when you check out from the Apache GIT repository, these empty directories do not appear and 'ant example' and 'ant run-example' fail. There is no 'how to use the solr git stuff' wiki page; that seems like the right place to document this. I'm not git-smart enough to write that page. -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: GIT does not support empty directories
That is where I learned the trick. On Fri, Apr 16, 2010 at 1:05 PM, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-16 21:33, Ted Dunning wrote: Put a readme file in the directory and be done with it. That's a trick I used with CVS 15 years ago ... these newfangled gizmos aren't so smart after all ;)
Re: Implementing near duplicate detection algorithm using IDF statistics
For reference, you can get a rental copy of this article for less than the cost of the full PDF download here: http://www.deepdyve.com/lp/association-for-computing-machinery/collection-statistics-for-fast-duplicate-document-detection-0o7i3Sx0Wd (joining the ACM is also a good thing to do) (and yes, this is licensed by the ACM) On Wed, Mar 24, 2010 at 2:28 AM, Thomas Heigl thomas.he...@systemone.atwrote: Hello, For my current project I need to implement an index-time mechanism to detect (near) duplicate documents. The TextProfileSignature available out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright but does not use global collection statistics in deciding which terms will be used for calculating the signature. Most state-of-the-art hash-based duplication detection algorithms make use of this information to improve precision and recall (e.g. http://portal.acm.org/citation.cfm?id=506311dl=GUIDEcoll=GUIDECFID=83187370CFTOKEN=47052122 ) Is it possible to access collection statistics - especially IDF values for all non-discarded terms in the current document - from within an implementation of the Signature class? Kind regards, Thomas -- DDI Thomas Heigl Software Engineer System One Gesellschaft für technologiegestützte Kommunikationsprozesse m.b.H. Stiftgasse 6/2/6 thomas.he...@systemone.at http://www.systemone.at Powered by Open-Xchange.com
Re: Branding Solr+Lucene
On Mon, Mar 22, 2010 at 11:30 AM, Yonik Seeley ysee...@gmail.com wrote: On Mon, Mar 22, 2010 at 2:20 PM, Ryan McKinley ryan...@gmail.com wrote: I'm confused... what is the need for a new name? The only place where there is a conflict is in the top level svn tree... Agree, no need to re-brand. I don't see any need to rebrand. The artifacts will still be called Lucene and Solr, regardless. In SVN, the natural thing to do is go with the momentum that puts solr under lucene. Insofar as the TLP, I would think that it would still be Lucene with two delivered artifacts.
Re: rough outline of where Solr's going
The key word here is end-user. On Tue, Mar 16, 2010 at 10:57 AM, Kevin Osborn osbo...@yahoo.com wrote: I definitely agree with Chris here. Although Lucene and Solr are highly related, the version numbering should communicate whether Solr has changed in a significant or minor way to the end-user. A minor change in Lucene could cause major changes in Solr. And vice-versa, a major change in Lucene could actually result in very little change for the Solr end user.
[jira] Commented: (SOLR-1375) BloomFilter on a field
[ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846166#action_12846166 ] Ted Dunning commented on SOLR-1375: --- Sorry to comment late here, but when indexing in hadoop, it is really nice to avoid any central dependence. It is also nice to focus the map-side join on items likely to match. Thirdly, reduce side indexing is typically really important. The conclusions from these three considerations vary by duplication rate. Using reduce-side indexing gets rid of most of the problems of duplicate versions of a single document (with the same sort key) since the reducer can scan to see whether it has the final copy handy before adding a document to the index. There remain problems where we have to not index documents that already exist in the index or to generate a deletion list that can assist in applying the index update. The former problem is usually the more severe one because it isn't unusual for data sources to just include a full dump of all documents and assume that the consumer will figure out which are new or updated. Here you would like to only index new and modified documents. My own preference for this is to avoid the complication of the map-side join using Bloom filters and simply export a very simple list of stub documents that correspond to the documents in the index. These stub documents should be much smaller than the average document (unless you are indexing tweets) which makes passing around great masses of stub documents not such a problem since Hadoop shuffle, copy and sort times are all dominated by Lucene index times. Passing stub documents allows the reducer to simply iterate through all documents with the same key keeping the latest version or any stub that is encountered. For documents without a stub, normal indexing can be done with the slight addition exporting a list of stub documents for the new additions. The same thing could be done with a map-side join, but the trade-off is that you now need considerably more memory for the mapper to store the entire bitmap in memory as opposed needing (somewhat) more time to pass the stub documents around. How that trade-off plays out in the real world isn't clear. My personal preference is to keep heap space small since the time cost is pretty minimal for me. This problem also turns up in our PDF conversion pipeline where we keep check-sums of each PDF that has already been converted to viewable forms. In that case, the ratio of real document size to stub size is even more preponderate. BloomFilter on a field -- Key: SOLR-1375 URL: https://issues.apache.org/jira/browse/SOLR-1375 Project: Solr Issue Type: New Feature Components: update Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch Original Estimate: 120h Remaining Estimate: 120h * A bloom filter is a read only probabilistic set. Its useful for verifying a key exists in a set, though it returns false positives. http://en.wikipedia.org/wiki/Bloom_filter * The use case is indexing in Hadoop and checking for duplicates against a Solr cluster (which when using term dictionary or a query) is too slow and exceeds the time consumed for indexing. When a match is found, the host, segment, and term are returned. If the same term is found on multiple servers, multiple results are returned by the distributed process. (We'll need to add in the core name I just realized). * When new segments are created, and commit is called, a new bloom filter is generated from a given field (default:id) by iterating over the term dictionary values. There's a bloom filter file per segment, which is managed on each Solr shard. When segments are merged away, their corresponding .blm files is also removed. In a future version we'll have a central server for the bloom filters so we're not abusing the thread pool of the Solr proxy and the networking of the Solr cluster (this will be done sooner than later after testing this version). I held off because the central server requires syncing the Solr servers' files (which is like reverse replication). * The patch uses the BloomFilter from Hadoop 0.20. I want to jar up only the necessary classes so we don't have a giant Hadoop jar in lib. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html * Distributed code is added and seems to work, I extended TestDistributedSearch to test over multiple HTTP servers. I chose this approach rather than the manual method used by (for example) TermVectorComponent.testDistributed because I'm new to Solr's distributed search
[jira] Commented: (SOLR-1814) select count(distinct fieldname) in SOLR
[ https://issues.apache.org/jira/browse/SOLR-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843632#action_12843632 ] Ted Dunning commented on SOLR-1814: --- Trove is GPL. The Mahout project has a partial set of replacements for Trove collections in case you want to go forward with this. Our plan is to consider breaking out the collections package from Mahout at some point in case you don't want to drag along the rest of Mahout. select count(distinct fieldname) in SOLR Key: SOLR-1814 URL: https://issues.apache.org/jira/browse/SOLR-1814 Project: Solr Issue Type: New Feature Components: SearchComponents - other Affects Versions: 1.4, 1.5, 1.6, 2.0 Reporter: Marcus Herou Fix For: 1.4, 1.5, 1.6, 2.0 Attachments: CountComponent.java I have seen questions on the mailinglist about having the functionality for counting distinct on a field. We at Tailsweep as well want to that in for example our blogsearch. Example: You had 1345 hits on 244 blogs The 244 part is not possible in SOLR today (correct me if I am wrong). So I've written a component which does this. Attaching it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835963#action_12835963 ] Ted Dunning commented on SOLR-1724: --- Will this http access also allow a cluster with incrementally updated cores to replicate a core after a node failure? Real Basic Core Management with Zookeeper - Key: SOLR-1724 URL: https://issues.apache.org/jira/browse/SOLR-1724 Project: Solr Issue Type: New Feature Components: multicore Affects Versions: 1.4 Reporter: Jason Rutherglen Fix For: 1.5 Attachments: commons-lang-2.4.jar, gson-1.4.jar, hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch Though we're implementing cloud, I need something real soon I can play with and deploy. So this'll be a patch that only deploys new cores, and that's about it. The arch is real simple: On Zookeeper there'll be a directory that contains files that represent the state of the cores of a given set of servers which will look like the following: /production/cores-1.txt /production/cores-2.txt /production/core-host-1-actual.txt (ephemeral node per host) Where each core-N.txt file contains: hostname,corename,instanceDir,coredownloadpath coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, etc and core-host-actual.txt contains: hostname,corename,instanceDir,size Everytime a new core-N.txt file is added, the listening host finds it's entry in the list and begins the process of trying to match the entries. Upon completion, it updates it's /core-host-1-actual.txt file to it's completed state or logs an error. When all host actual files are written (without errors), then a new core-1-actual.txt file is written which can be picked up by another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: SolrCloud - Using collections, slices and shards in the wild
This is analogous to the multiple data sources that we have at Deepdyvehttp://www.deepdyve.com. In a fully sharded and balanced environment, I have found it *much* more efficient to put all data sources into a single collection and use a filter to select one or the other. The rationale is that data sources are distributed in size according to a rough long-tail distribution. For the largest ones, the filters are about as efficient as a separate index because they are such a large fraction of the index. For the small ones, the filtered query is so fast that other issues form the bottleneck anyway. The operational economies of not managing hundreds of indexes and the much better load balancing makes the integrated solution massively better for me. We currently use Katta and this system works really, really well. One big difference in our environments is that for me, the dominant query pattern involves most data sources while for you, the dominant pattern will likely involve a single data source. On Tue, Feb 9, 2010 at 9:02 PM, Jon Gifford jon.giff...@gmail.com wrote: 1) Support one index per customer, and many customers (thus, many independent indices) -- Ted Dunning, CTO DeepDyve
Re: SolrCloud - Using collections, slices and shards in the wild
Yeah my decision likely would have been different if it weren't for the fact that Katta is industrial strength already. On Wed, Feb 10, 2010 at 12:59 PM, Jon Gifford jon.giff...@gmail.com wrote: We currently use Katta and this system works really, really well. Katta does look nice, but the SolrCloud stuff seems simpler and closer to what I need. We shall see :-) -- Ted Dunning, CTO DeepDyve
Re: priority queue in query component
Katta has a very flexible and usable option for this even in the absence of replicas. The idea is that shards may report results, may report failure, may report late, may never report or may have a transport layer issue. All kinds of behavior should be handled. What is done with katta is that each search has a deadline and a partial results policy. At any time, if all results have been received, a complete set of results is returned. If a deadline is reached, then the policy is interrogated with the results so far. The policy has the option to return a failure, partial results (with timeouts reported on missing shards) or to set a new deadline and possibly a new policy (so that the number of missing results gets more relaxed as time passes). The policy is also called each time a new result is received or failure is noted. Transport layer issues and explicit error returns are handled by the framework. Any time one of these is encountered, the search is immediately dispatched to a replica of the shard if one exists. In that case, that query may have a late start and may not return by the deadline, depending on policy. If no replica is available that has not been queried, an error result is recorded for that shard. Note that Katta even supports fail-fast in this scenario since the partial result policy can return a new deadline for all partial results that have no hard failures and can return a failure if it notes any shard failures. On Tue, Feb 9, 2010 at 5:25 AM, Yonik Seeley yo...@lucidimagination.comwrote: The SolrCloud branch now has load balancing and fail-over amongst shard replicas. Partial results aren't available yet (if there are no up replicas for a shard), but that is planned. -Yonik http://www.lucidimagination.com On Tue, Feb 9, 2010 at 8:21 AM, Jan Høydahl / Cominvent jan@cominvent.com wrote: Isn't that OK as long as there is the option of allowing partial results if you really want? Keeping the logic simple has its benefits. Let client be responsible for query resubmit strategy, and let load balancer (or shard manager) be responsible for marking a node/shard as dead/inresponsive and choosing another for the next query. -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 9. feb. 2010, at 04.36, Lance Norskog wrote: At this point, Distributed Search does not support any recovery if when one or more shards fail. If any fail or time out, the whole query fails. On Sat, Feb 6, 2010 at 9:34 AM, mike anderson saidthero...@gmail.com wrote: so if we received the response from shard2 before shard1, we would just queue it up and wait for the response to shard1. This crossed my mind, but my concern was how to handle the case when shard1 never responds. Is this something I need to worry about? -mike On Sat, Feb 6, 2010 at 11:33 AM, Yonik Seeley yo...@lucidimagination.comwrote: It seems like changing an element in a priority queue breaks the invariants, and hence it's not doable with a priority queue and with the current strategy of adding sub-responses as they are received. One way to continue using a priority queue would be to add sub-responses to the queue in the preferred order... so if we received the response from shard2 before shard1, we would just queue it up and wait for the response to shard1. -Yonik http://www.lucidimagination.com On Sat, Feb 6, 2010 at 10:35 AM, mike anderson saidthero...@gmail.com wrote: I have a need to favor documents from one shard over another when duplicates occur. I found this code in the query component: String prevShard = uniqueDoc.put(id, srsp.getShard()); if (prevShard != null) { // duplicate detected numFound--; // For now, just always use the first encountered since we can't currently // remove the previous one added to the priority queue. If we switched // to the Java5 PriorityQueue, this would be easier. continue; // make which duplicate is used deterministic based on shard // if (prevShard.compareTo(srsp.shard) = 0) { // TODO: remove previous from priority queue // continue; // } } Is there a ticket open for this issue? What would it take to fix? Thanks, Mike -- Lance Norskog goks...@gmail.com -- Ted Dunning, CTO DeepDyve
Re: priority queue in query component
It also seems like it should be possible to include the shard number in the ordering of responses. On Sat, Feb 6, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote: It seems like changing an element in a priority queue breaks the invariants, and hence it's not doable with a priority queue and with the current strategy of adding sub-responses as they are received. One way to continue using a priority queue would be to add sub-responses to the queue in the preferred order... so if we received the response from shard2 before shard1, we would just queue it up and wait for the response to shard1. -Yonik http://www.lucidimagination.com On Sat, Feb 6, 2010 at 10:35 AM, mike anderson saidthero...@gmail.com wrote: I have a need to favor documents from one shard over another when duplicates occur. I found this code in the query component: String prevShard = uniqueDoc.put(id, srsp.getShard()); if (prevShard != null) { // duplicate detected numFound--; // For now, just always use the first encountered since we can't currently // remove the previous one added to the priority queue. If we switched // to the Java5 PriorityQueue, this would be easier. continue; // make which duplicate is used deterministic based on shard // if (prevShard.compareTo(srsp.shard) = 0) { // TODO: remove previous from priority queue // continue; // } } Is there a ticket open for this issue? What would it take to fix? Thanks, Mike -- Ted Dunning, CTO DeepDyve
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828961#action_12828961 ] Ted Dunning commented on SOLR-1301: --- {quote} Based on these observation, I have few questions. (I am a beginner to the Hadoop Solr world. So, please forgive me if my questions are silly): 1. As per above observation, SOLR-1045 patch is functionally better (performance I have not verified yet ). Can anyone tell me, whats the actual advantage SOLR-1301 patch offers over SOLR-1045 patch? 2. If both the jira issues are trying to solve the same problem, do we really need 2 separate issues? {quote} In the katta community, the recommended practice started with SOLR-1045 (what I call map-side indexing) behavior, but I think that the consensus now is that SOLR-1301 behavior (what I call reduce side indexing) is much, much better. This is not necessarily the obvious result given your observations. There are some operational differences between katta and SOLR that might make the conclusions different, but what I have observed is the following: a) index merging is a really bad idea that seems very attractive to begin with because it is actually pretty expensive and doesn't solve the real problems of bad document distribution across shards. It is much better to simply have lots of shards per machine (aka micro-sharding) and use one reducer per shard. For large indexes, this gives entirely acceptable performance. On a pretty small cluster, we can index 50-100 million large documents in multiple ways in 2-3 hours. Index merging gives you no benefit compared to reduce side indexing and just increases code complexity. b) map-side indexing leaves you with indexes that are heavily skewed by being composed of of documents from a single input split. At retrieval time, this means that different shards have very different term frequency profiles and very different numbers of relevant documents. This makes lots of statistics very difficult including term frequency computation, term weighting and determining the number of documents to retrieve. Map-side merge virtually guarantees that you have to do two cluster queries, one to gather term frequency statistics and another to do the actual query. With reduce side indexing, you can provide strong probabilistic bounds on how different the statistics in each shard can be so you can use local term statistics and you can depend on the score distribution being this same which radically decreases the number of documents you need to retrieve from each shard. c) reduce-side indexing improves the balance of computation during retrieval. If (as is the rule) some document subset is hotter than other document subset due, say to data-source boosting or recency boosting, you will have very bad cluster utilization with skewed shards from map-side indexing while all shards will cost about the same for any query leading to good cluster utilization and faster queries with reduce-side indexing. d) with reduce-side indexing has properties that can be mathematically stated and proved. Map-side indexing only has comparable properties if you make unrealistic assumptions about your original data. e) micro-sharding allows very simple and very effective use of multiple cores on multiple machines in a search cluster. This can be very difficult to do with large shards or a single index. Now, as you say, these advantages may evaporate if you are looking to produce a single output index. That seems, however, to contradict the whole point of scaling. If you need to scale indexing, presumably you also need to scale search speed and throughput. As such you probably want to have many shards rather than few. Conversely, if you can stand to search a single index, then you probably can stand to index on a single machine. Another thing to think about is the fact SOLR doesn't yet do micro-sharding or clustering very well and, in particular, doesn't handle multiple shards per core. That will be changing before long, however, and it is very dangerous to design for the past rather than the future. In case, you didn't notice, I strongly suggest you stick with reduce-side indexing. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806547#action_12806547 ] Ted Dunning commented on SOLR-1301: --- It is critical to put indexes in the task local area on both local and hdfs storage areas not just because of task cleanup, but also because a task may be run more than once. Hadoop handles all the race conditions that would otherwise happen as a result. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803371#action_12803371 ] Ted Dunning commented on SOLR-1724: --- {quote} ... I agree, I'm not really into ephemeral ZK nodes for Solr hosts/nodes. The reason is contact with ZK is highly superficial and can be intermittent. {quote} I have found that when I was having trouble with ZK connectivity, the problems were simply surfacing issues that I had anyway. You do have to configure the ZK client to not have long pauses (that is incompatible with SOLR how?) and you may need to adjust the timeouts on the ZK side. More importantly, any issues with ZK connectivity will have their parallels with any other heartbeat mechanism and replicating a heartbeat system that tries to match ZK for reliability is going to be a significant source of very nasty bugs. Better to not rewrite that already works. Keep in mind that ZK *connection* issues are not the same as session expiration. Katta has a fairly important set of bugfixes now to make that distinction and ZK will soon handle connection loss on its own. It isn't a bad idea to keep shards around for a while if a node goes down. That can seriously decrease the cost of momentary outages such as for a software upgrade. The idea is that when the node comes back, it can advertise availability of some shards and replication of those shards should cease. Real Basic Core Management with Zookeeper - Key: SOLR-1724 URL: https://issues.apache.org/jira/browse/SOLR-1724 Project: Solr Issue Type: New Feature Components: multicore Affects Versions: 1.4 Reporter: Jason Rutherglen Fix For: 1.5 Attachments: SOLR-1724.patch Though we're implementing cloud, I need something real soon I can play with and deploy. So this'll be a patch that only deploys new cores, and that's about it. The arch is real simple: On Zookeeper there'll be a directory that contains files that represent the state of the cores of a given set of servers which will look like the following: /production/cores-1.txt /production/cores-2.txt /production/core-host-1-actual.txt (ephemeral node per host) Where each core-N.txt file contains: hostname,corename,instanceDir,coredownloadpath coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, etc and core-host-actual.txt contains: hostname,corename,instanceDir,size Everytime a new core-N.txt file is added, the listening host finds it's entry in the list and begins the process of trying to match the entries. Upon completion, it updates it's /core-host-1-actual.txt file to it's completed state or logs an error. When all host actual files are written (without errors), then a new core-1-actual.txt file is written which can be picked up by another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Case-insensitive searches and facet case
To amplify this correct answer, use one field for searching (querying). This would be lower cased. Then use a second field for faceting (case preserved). The only gotcha here is that your original data may have inconsistent casing. My usual answer for that is to either impose a conventional case pattern (which takes you back to one field if you like) or to do a spelling corrector analysis to find the most common case pattern for each unique lower cased string. Then during indexing, I impose that pattern on the facet field. On Wed, Jan 20, 2010 at 9:46 AM, Erik Hatcher erik.hatc...@gmail.comwrote: Is there a way to maintain case-impartiality whilst allowing facets to be returned 'case-preserved'? Yes, use different fields. Generally facet fields are string which will maintain exact case. You can leverage the copyField capabilities in schema.xml to clone a field and analyze it differently. -- Ted Dunning, CTO DeepDyve
Re: Solr Cloud wiki and branch notes
Jason V and Jason R have done just that. Great idea. Cool work. But a unified management interface would *really* be nice. On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki a...@getopt.org wrote: Well, then if we don't intend to support updates in this iteration then perhaps there is no need to change anything in Solr, just extend Katta to run Solr searchers ... :P -- Ted Dunning, CTO DeepDyve
Re: Solr Cloud wiki and branch notes
+1 Hadoop still calls it a copy of a block if you have replication factor of 1. Why not? (for that matter, I still call it an integer if it has a value of 1) On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki a...@getopt.org wrote: I originally started off with replica too... but there may only be one copy of a physical shard, it seemed strange to call it a replica. Yeah .. it's a replica with a replication factor of 1 :) -- Ted Dunning, CTO DeepDyve
Re: Solr Cloud wiki and branch notes
Control is easily retained if you make pluggable the selection of shards to which you want to do the horizontal broadcast. The shard management layer shouldn't know or care what query you are doing and in most cases it should just use the trivial all-shards selection policy. On Sun, Jan 17, 2010 at 7:34 AM, Yonik Seeley yo...@lucidimagination.comwrote: I would argue that the current model has been adopted out of necessity, and not because of the users' preference. I think it's both - I've seen quite a few people that really wanted to partition by time for example (and they made some compelling cases for doing so). Seems like a good goal would be to support the customer having various levels of control. -- Ted Dunning, CTO DeepDyve
Re: Solr Cloud wiki and branch notes
My experience with Katta is that very quickly my developers adopted index as the aggregate of all the shards which is exactly what Andrzej is proposing. Confusion with the index contains shards, nodes host shards terminology has been minimal. On Sat, Jan 16, 2010 at 11:40 AM, Andrzej Bialecki a...@getopt.org wrote: We're currently using collection. Notice how you had to add (global) to clarify what you meant. I fear that a sentence like what index are you querying would need constant clarification. I avoided the word collection, because Solr deploys various cores under collectionX names, leading users to assume that core == collection. Global index is two words but it's unambiguous. I'm fine with the collection if we clarify the definition and avoid using this term for other stuff. -- Ted Dunning, CTO DeepDyve
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801296#action_12801296 ] Ted Dunning commented on SOLR-1724: --- {quote} We actually started out that way... (when a node went down there wasn't really any trace it ever existed) but have been moving away from it. ZK may not just be a reflection of the cluster but may also control certain aspects of the cluster that you want persistent. For example, marking a node as disabled (i.e. don't use it). One could create APIs on the node to enable and disable and have that reflected in ZK, but it seems like more work than simply saying change this znode. {quote} I see this as a conflation of two or three goals that leads to trouble. All of the goals are worthy and important, but the conflation of them leads to difficult problems. Taken separately, the goals are easily met. One goal is the reflection of current cluster state. That is most reliably done using ephemeral files roughly as I described. Another goal is the reflection of constraints or desired state of the cluster. This is best handled as you describe, with permanent files since you don't want this desired state to disappear when a node disappears. The real issue is making sure that the source of whatever information is most directly connected to the physical manifestation of that information. Moreover, it is important in some cases (node state, for instance) that the state stay correct even when the source of that state loses control by crashing, hanging or becoming otherwise indisposed. Inserting an intermediary into this chain of control is a bad idea. Replicating ZK's rather well implemented ephemeral state mechanism with ad hoc heartbeats is also a bad idea (remember how *many* bugs there have been in hadoop relative to heartbeats and the name node?). A somewhat secondary issue is whether the cluster master has to be involved in every query. That seems like a really bad bottleneck to me and Katta provides a proof of existence that this is not necessary. After trying several options in production, what I find is best is that the master lay down a statement of desired state and the nodes publish their status in a different and ephemeral fashion. The master can record a history or there may be general directions such as your disabled list however you like but that shouldn't be mixed into the node status because you otherwise get into a situation where ephemeral files can no longer be used for what they are good at. Real Basic Core Management with Zookeeper - Key: SOLR-1724 URL: https://issues.apache.org/jira/browse/SOLR-1724 Project: Solr Issue Type: New Feature Components: multicore Affects Versions: 1.4 Reporter: Jason Rutherglen Fix For: 1.5 Though we're implementing cloud, I need something real soon I can play with and deploy. So this'll be a patch that only deploys new cores, and that's about it. The arch is real simple: On Zookeeper there'll be a directory that contains files that represent the state of the cores of a given set of servers which will look like the following: /production/cores-1.txt /production/cores-2.txt /production/core-host-1-actual.txt (ephemeral node per host) Where each core-N.txt file contains: hostname,corename,instanceDir,coredownloadpath coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, etc and core-host-actual.txt contains: hostname,corename,instanceDir,size Everytime a new core-N.txt file is added, the listening host finds it's entry in the list and begins the process of trying to match the entries. Upon completion, it updates it's /core-host-1-actual.txt file to it's completed state or logs an error. When all host actual files are written (without errors), then a new core-1-actual.txt file is written which can be picked up by another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (SOLR-1301) Solr + Hadoop
This can also be a big performance win. Jason Venner reports significant index and cluster start time improvements by indexing to local disk, zipping and then uploading the resulting zip file. Hadoop has significant file open overhead so moving one zip file wins big over many index component files. There is a secondary bandwidth win as well. On Fri, Jan 15, 2010 at 8:34 AM, Andrzej Bialecki (JIRA) j...@apache.orgwrote: HDFS doesn't support enough POSIX to support writing Lucene indexes directly to HDFS - for this reason indexes are always created on local storage of each node, and then after closing they are copied to HDFS.
Re: [jira] Commented: (SOLR-1301) Solr + Hadoop
The reason I would a major speed win when expect indexing to local disk and copying later is that you get much more efficient reading of documents with normal hadoop mechanisms. Throwing documents to the various Solr master indexers is bound to be slower than having 20 machines reading at local disk speeds if only because of network delays. On Fri, Jan 15, 2010 at 12:09 PM, Grant Ingersoll gsing...@apache.orgwrote: I can see why that is a win over the existing, but I still don't get why it wouldn't be faster just to index to a suite of Solr master indexers and save all this file slogging around. But, I guess that is a separate patch all together.
Re: [jira] Commented: (SOLR-1301) Solr + Hadoop
We index comparable amounts of data in a few hours. On Fri, Jan 15, 2010 at 1:08 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: That and on-demand expandability, so I can reindex 2 terabytes of data in half a day vs weeks or more with 4 Solr masters has compelling advantages. -- Ted Dunning, CTO DeepDyve
[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper
[ https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801051#action_12801051 ] Ted Dunning commented on SOLR-1724: --- Katta had some interesting issues in the design of this. These are discussed here: http://oss.101tec.com/jira/browse/KATTA-43 The basic design consideration is that failure of a node needs to automagically update the ZK state accordingly. This allows all important updates to files to go one direction as well. Real Basic Core Management with Zookeeper - Key: SOLR-1724 URL: https://issues.apache.org/jira/browse/SOLR-1724 Project: Solr Issue Type: New Feature Components: multicore Affects Versions: 1.4 Reporter: Jason Rutherglen Fix For: 1.5 Though we're implementing cloud, I need something real soon I can play with and deploy. So this'll be a patch that only deploys new cores, and that's about it. The arch is real simple: On Zookeeper there'll be a directory that contains files that represent the state of the cores of a given set of servers which will look like the following: /production/cores-1.txt /production/cores-2.txt /production/core-host-1-actual.txt (ephemeral node per host) Where each core-N.txt file contains: hostname,corename,instanceDir,coredownloadpath coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, etc and core-host-actual.txt contains: hostname,corename,instanceDir,size Everytime a new core-N.txt file is added, the listening host finds it's entry in the list and begins the process of trying to match the entries. Upon completion, it updates it's /core-host-1-actual.txt file to it's completed state or logs an error. When all host actual files are written (without errors), then a new core-1-actual.txt file is written which can be picked up by another process that can create a new core proxy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Solr Cloud wiki and branch notes
On Fri, Jan 15, 2010 at 4:36 PM, Andrzej Bialecki a...@getopt.org wrote: My 0.02 PLN on the subject ... Polish currency seems pretty strong lately. There are a lot of good ideas for this small sum. Terminology * (global) search index * index shard: * partitioning: * search node: * search host: * Shard Manager: I think that these terms are excellent. The replication and load balancing is a problem with many existing solutions, and this one in particular reminds me strongly of the Hadoop HDFS. In fact, early on during the development of Hadoop [1] I wondered whether we could reuse HDFS to manage Lucene indexes instead of opaque blocks of fixed size. It turned out to be infeasible, but the model of Namenode/Datanode still looks useful in our case, too. I have seen the analogy with hadoop in managing a Katta cluster. The randomized assignment provides very many of the same robustness benefits as a map-reduce architecture provides for parallel computing. I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper that we could reuse in our design. The following is just a straightforward port of the Namenode/Datanode concept. Let's imagine a component called ShardManager that is responsible for managing the following data: * list of shard ID-s that together form the complete search index, * for each shard ID, list of search nodes that serve this shard. * issuing replication requests * maintaining the partitioning function (see below), so that updates are directed to correct shards * maintaining heartbeat to check for dead nodes * providing search clients with a list of nodes to query in order to obtain all results from the search index. I think that this is close. I think that the list of search nodes that serve each shard should be maintained by the nodes themselves. Moreover, ZK provides the ability to have this list magically update if the node dies. This means that the need for heartbeats virtually disappears. In addition, I think that a substrate like ZK should be used to provide search clients with the information about which nodes have which shards and the clients should themselves decide how to cover the set of shards with a list of nodes. This means that the ShardManager is *completely* out of the real-time pathway. ... I believe most of the above functionality could be facilitated by Zookeeper, including the election of the node that runs the ShardManager. Absolutely. Updates --- We need a partitioning schema that splits documents more or less evenly among shards, and at the same time allows us to split or merge unbalanced shards. The simplest function that we could imagine is the following: hash(docId) % numShards though this has the disadvantage that any larger update will affect multiple shards, thus creating an avalanche of replication requests ... so a sequential model would be probably better, where ranges of docIds are assigned to shards. A hybrid is quite possible: hash(floor(docId / sequence-size)) % numShards this gives sequential assignment of sequence-size documents at a time. Sequence-size should be small to distribute query results and update loads across all nodes. Sequence size should be large to avoid replication of all shards after a focussed update. Balance is necessary. Now, if any particular shard is too unbalanced, e.g. too large, it could be further split in two halves, and the ShardManager would have to record this exception. This is a very similar process to a region split in HBase, or a page split in btree DBs. Conversely, shards that are too small could be joined. This is the icing on the cake, so we can leave it for later. Leaving for later is a great idea. With relatively small shards, I am able to parallelize indexing to the point that a terabyte or so of documents index in a few hours. Combined with a small sequence-size in the shard distribution function so that all shards grow together, it is easy to plan for 3x growth or more without the need to shard splitting. With a complete index being so cheap, I can afford to simply reindex from scratch with a different shard count if I feel like it. Search -- There should be a component sometimes referred to as query integrator (or search front-end) that is the entry and exit point for user search requests. On receiving a search request this component gets a list of randomly selected nodes from SearchManager to contact (the list containing all shards that form the global index), sends the query and integrates partial results (under a configurable policy for timeouts/early termination), and sends back the assembled results to the user. Yes in outline. A few details: I think that the shard cover computation should, in fact, be done on the client side. One reason is that the node/shard state is relatively static and if all clients retrieve the full state this is cachable and simple. Another
Re: SolrCloud logical shards
Logical shard sounds good as the collection of all identical physical shards Another concept from Katta that is AFAIK missing from the Solr lexicon is the distinction between node and shard. In Katta, a node is a server worker instance that contains and queries physical shards. There is usually one node per physical server, but not always. In Katta an important performance and reliability optimization is that nodes do not contain identical shard sets. That is, shards are assigned randomly even when replicated. This improves robustness, code simplicity and load balancing. On Thu, Jan 14, 2010 at 9:08 AM, Yonik Seeley yo...@lucidimagination.comwrote: Should we use logical shard for this, or does anyone have any better ideas? -- Ted Dunning, CTO DeepDyve
Re: SolrCloud logical shards
I think that most of these complications go away to a remarkable degree if you combine katta style random assignment of small shards. The major simplifications there include: - no need to move individual documents, nor to split or merge shards, no need for search-server to search-server communications - search servers do search and nothing else - placement, balance, replication and query balancing policy is factored out of all real-time paths - real-time updates can be accommodated in the same framework with minimal changes to the shard management layer - the shard management is completely agnostic to the actual search semantics. On Thu, Jan 14, 2010 at 9:46 AM, Yonik Seeley yo...@lucidimagination.comwrote: I'm actually starting to lean toward slice instead of logical shard. In the future we'll want to enable overlapping shards I think (due to an Amazon Dynamo type of replication, or due to merging shards, etc), and a separate word for a logical slice of the index seems desirable. -- Ted Dunning, CTO DeepDyve
Re: SolrCloud logical shards
I have found that users of the system like to use index as the composite of all nodes/shards/slices that is searched in response to a query. It is the ultimate logical entity. Really, this is the same abstraction that users of Lucene have. They really don't want to care that a Lucene index is made up of several files and even possibly several indexes in various states of merging. The same should really be true of a parallel system, but more so. On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley yo...@lucidimagination.com wrote: I'm actually starting to lean toward slice instead of logical shard. Alternate terminology could be index for the actual physical lucene lindex (and also enough of the URL that unambiguously identifies it), and then shard could be the logical entity. But I've kind of gotten used to thinking of shards as the actual physical queryable things... -Yonik http://www.lucidimagination.com -- Ted Dunning, CTO DeepDyve
Re: SolrCloud logical shards
Shard has the interesting additional implication that it is part of a composite index made up of many sub-indexes. A lucene index could be a complete index or a shard. I would presume the same of what might be called a core. On Thu, Jan 14, 2010 at 3:21 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Uri, core to represent a single index and shard to be represented by a single core Can you elaborate on what you mean, isn't a core a single index too? It seems like shard was used to represent a remote index (perhaps?). Though here I'd prefer remote core, because to the uninitiated Solr outsider it's immediately obvious (i.e. they need only know what a core is, in the Solr glossary or term dictionary). In Google vernacular, which is where the name shard came from, a shard is basically a local sub-index http://research.google.com/archive/googlecluster.html where there would be many shards per server. However that's a digression at this point. I personally prefer relatively straightforward names, that are self-evident, rather than inventing new language for fairly simple concepts. Slice, even though it comes from our buddy Yonik, probably doesn't make any immediate sense to external users when compared with the word shard. Of course software projects have a tendency to create their own words to somewhat mystify users into believing in some sort of magic occurring underneath. If that's what we're after, it's cool, I mean that makes sense. And I don't mean to be derogatory here however this is an open source project created in part to educate users on search and be made easily accessible as possible, to the greatest number of users possible. I think Doug did a create job of this when Lucene started with amazingly succinct code for fairly complex concepts (eg, anti-mystification of search). Jason On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness ubon...@gmail.com wrote: Although Jason has some valid points here, I'm with Yonik here. I do believe that we've gotten used to the terms core to represent a single index and shard to be represented by a single core. A node seems to indicate a machine or a JVM. Changing any of these (informal perhaps) definitions will only cause confusion. That's why I think a slice is a good solution now... first it's a new term to a new view of the index (logical shard AFAIK don't really exists yet) so people won't need to get used to it, but it's also descriptive and intuitive. I do like Jason's idea about having a protocol attached to the URL's. Cheers, Uri Jason Rutherglen wrote: But I've kind of gotten used to thinking of shards as the actual physical queryable things... I think a mistake was made referring to Solr cores as shards. It's the same thing with 2 different names. Slices adds yet another name which seems to imply the same thing yet again. I'd rather see disambiguation here, and call them cores (partially because that's what's in the code and on the wiki), and cores only. It's a Solr specific term, it's going to be confused with microprocessor cores, but at least there's only one name, which as search people, we know creates fewer posting lists :). Logical groupings of cores can occur, which can be aptly named core groups. This way I can submit a query to a core group, and it's reasonable to assume I'm hitting N cores. Further, cores could point to a logical or physical entity via a URL. (As a side note, I've always found it odd that the shards param to RequestHandler lacks the protocol, what if I want to use HTTPS for example?). So there could be http://host/solr/core1 (physical), core://megacorename (logical), coregroup://supergreatcoregroupname (a group of cores) in the shards parameter (whose name can perhaps be changed for clarity in a future release). Then people can mix and match and we won't have many different XML elements floating around. We'd have a simple list of URLs that are transposed into a real physical network request. On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley yo...@lucidimagination.com wrote: I'm actually starting to lean toward slice instead of logical shard. Alternate terminology could be index for the actual physical lucene lindex (and also enough of the URL that unambiguously identifies it), and then shard could be the logical entity. But I've kind of gotten used to thinking of shards as the actual physical queryable things... -Yonik http://www.lucidimagination.com -- Ted Dunning, CTO DeepDyve
Re: SolrCloud logical shards
My definition of right is simple and modularized with minimal conceptual upheaval. As such, simply giving SOLR a good shard manager that broadcasts queries without respect to content is a preferable solution than something very fancy. On Thu, Jan 14, 2010 at 4:31 PM, Lance Norskog goks...@gmail.com wrote: Logical-to-physical mapping should not assume that the logical has an integral number of the physical. Overlapping and partial physical shards should be addressable as a logical shard. If you're going to do something this major, do it right. -- Ted Dunning, CTO DeepDyve