Re: Organizing the Lucene meetup (Was: ApacheCon US)
There is an initial schedule online at: http://wiki.apache.org/lucene-java/LuceneAtApacheConUs2009 Isabel I still plan to do the Katta introduction. Is someone officially maintain the page or should I just go ahead and remove the question mark myself? Stefan
Re: [ACUS09] IMPORTANT SPEAKER CONFIRMATION MESSAGE
Sorry I'm a day late, but I can confirm I can do a 20 min Katta Intro. On Jul 16, 2009, at 12:37 AM, Michael Busch wrote: I confirm I'm coming and that I'd like to give the talk below. Alternatively we could also split the talk up into two separate talks "Lucene Basics" and "New Features in Lucene and Advanced Indexing Techniques". -. Intro to Katta (Stefan?) (20 mins)
Re: [APACHECON] Planning
Hi Grant, sorry I lost track here, is there a list of excepted presentations somewhere? Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com On Jun 17, 2009, at 8:42 AM, Grant Ingersoll wrote: Note, you may not have permission to view that page. Sorry. Not my call. Also note that it is _MY_ understanding that airfare is no longer covered as part of the speaker package. Maybe others can confirm this. I'm not sure how this effects people's willingness to speak, but it is a downer in my mind. However, the ASF does have a Travel Assistance Committee that people can apply to for assistance. I don't know the details of that. On Jun 17, 2009, at 10:42 AM, Grant Ingersoll wrote: OK, we've been alloted 2 days for Lucene: http://wiki.apache.org/concom-planning/ParticipatingPmcs . More later on info about the Calls for Presentations (CFPs) Now we need to figure out what we are going to do. Also, we need, asap, a description that satisfies: In order to get registration open ASAP, we need a promo-text for each track. If you're copied on this email, I'll be nagging you for a text for your project (listed below). If there's someone else I should nag instead, please let me know. If you know who I should be nagging for the last three tracks below, please let me know that too. What should the promo text look like? We need 150-200 words, explaining -what the track will cover (outline is fine, you don't need to have abstracts and bios if you're not ready for that), -who the intended audience is, and -why people will want to attend/what they'll get out of it. If you're planning something amazing, cool, new or exciting, we want some information about that. Is there going to be a panel discussion with some of the central project members telling people what to expect in the widely-anticipated next release? How about a hands-on masterclass with that really tricky part of the project that everyone has trouble with? Or everything you need to know to decide which technologies to use in which situations, and how to get the most out of your limited resources?
ScaleCamp: get together the night before Hadoop Summit
Hi All, We are planing a community event the night before the Hadoop Summit. This "BarCamp" (http://en.wikipedia.org/wiki/BarCamp) event will be held at the same venue as the Summit (Santa Clara Marriott). Refreshments will be served to encourage socializing. To initialize conversations for the social part of the evening we are offering people the opportunity to present an experience report of their project (within a 15 min presentation). We have 12 slots in 3 parallel tracks max. The focus should be on projects leveraging technologies from the Hadoop eco-system. Please join us and mingle with the rest of the Hadoop community. To find out more about this event and signup please visit : http://www.scaleunlimited.com/events/scale_camp Please submit your presentation here: http://www.scaleunlimited.com/about-us/contact Stefan P.S. Please spread the word! P.P.S Apologies for the cross posting.
Re: [ANNOUNCE] Katta 0.5 released
Hi Steve, I dont like sitting a build system, so I like convention over configuration. Maven sounds good for that, but after many years being a maven fan I just could not understand why essential plugins are still so buggy and why I have to spend so much energy when I want to customize something. Obviously maven describes the project, it is not a build script. Also I like dependency management. Gradle will be a great build tool. It has conventions over configuration it uses java syntax (groovy) to write the build scrip, has dependency management etc. It is actually really cool but we adapted it too early. It had bugs that blocked our productivity. Now we are back at ant and use ivy for dependency management. Ivy isnt great documented but works pretty solid for us. Ant is solid though I dont like to writing scripts in a declarative language - xml and also ants multi projects build capabilities aren't the greatest. Anyhow we decided for ant since it is a solid working horse. Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com On Apr 9, 2009, at 9:50 AM, Steven A Rowe wrote: Oops, just saw on the wiki that "Gradle" (never heard of it before) is the build system (former build system, I gather from the release announcement) - I'm still interested in why the switch was made, though. - Steve On 4/9/2009 at 12:22 PM, Steven A Rowe wrote: On 4/9/2009 at 3:16 AM, Stefan Groschupf wrote: Release 0.5 of Katta is now available. Congratulations on the release! [...] switched to Ant and Ivy as a build system [...] AFAICT, the build previously was performed with Maven 2 - is there a public discussion available anywhere concerning this switch? (I looked and couldn't find anything.) If there's no public discussion available, can you say a few words about the rationale behind the switch?
[ANNOUNCE] Katta 0.5 released
(...apologies for the cross posting...) Release 0.5 of Katta is now available. Katta - Lucene in the cloud. http://katta.sourceforge.net This release fixes bugs from 0.4, including one that sorted the results wrong under load. 0.5 also upgrades to Zookeeper to version 3.1., Lucene to version 2.4.1 and hadoop 0.19.0. The new API supports Lucene Query objects instead of just Strings, adds support for Amazon EC2, switched to Ant and Ivy as a build system and some more minor improvements. Also, we improved our online documentation and added sample code that illustrates how to create a sharded Lucene index with Hadoop. See changes at http://oss.101tec.com/jira/browse/KATTA?report=com.atlassian.jira.plugin.system.project:changelog-panel Binary distribution is available at https://sourceforge.net/projects/katta/ Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
[ANN] katta-0.1.0 release - distribute lucene indexes in a grid
After 5 month work we are happy to announce the first developer preview release of katta. This release contains all functionality to serve a large, sharded lucene index on many servers. Katta is standing on the shoulders of the giants lucene, hadoop and zookeeper. Main features: + Plays well with Hadoop + Apache Version 2 License. + Node failure tolerance + Master failover + Shard replication + Plug-able network topologies (Shard - Distribution and Selection Polices) + Node load balancing at client Please give katta a test drive and give us some feedback! Download: http://sourceforge.net/project/platformdownload.php?group_id=225750 website: http://katta.sourceforge.net/ Getting started in less than 3 min: http://katta.wiki.sourceforge.net/Getting+started Installation on a grid: http://katta.wiki.sourceforge.net/Installation Katta presentation today (09/17/08) at hadoop user, yahoo mission college: http://upcoming.yahoo.com/event/1075456/ * slides will be available online later Many thanks for the hard work: Johannes Zillmann, Marko Bauhardt, Martin Schaaf (101tec) I apologize the cross posting. Yours, the Katta Team. ~~~ 101tec Inc., Menlo Park, California http://www.101tec.com
Re: Lucene-based Distributed Index Leveraging Hadoop
Hi, In terms of which project best fits my needs my gut feeling is that dlucene is pretty close. It supports incremental updates, and doesn't build in dependencies on systems like HDFS or Terracotta (I don't yet understand all the implications of those systems so would rather keep things simple if possible). Upgrades... The way we solve this with katta is that we simply deploy a new small index and use * in the client instead of a fixed index name. Than once a night we merge all the small indexes (since this slows down things) together to a big new index. To solve the problem of duplicate documents each document gets a timestamp and in the client we do a simple dedub based on a key and use always the latest document with the latest time stamp. Dependencies... Katta is independent of those technologies, it is lucene, zookeeper and hadoop RPI (instead of RMI, http or Apache Mina). Though we support loading index shards from a hadoop file system, but you also can load them from a mounted remote hdd NAS or what ever you like The obvious drawback being that dlucene doesn't seem to be an active public project. Mark need to answer this but dlucene is checked in to the katta svn and I saw Marko checking in changes to dlucene. There was a discussion between Mark and me to bring dlucene and katta together and I really would love to see that happen but unfortunately we had a lot of pressure from our customer to deliver something so we had to focus on other things. More developers getting involved would clearly help here.. :-) Thanks for the reply Stefan. I'll certainly be taking a look through the code for Katta since no doubt there's a lot to learn in there. Katta will be deployed into a production system of our customer in less than 4 weeks - so we working hard to iron out issues. However katta is running since 6 weeks in a 10 node test environment with heavy load. Stefan
Re: Lucene-based Distributed Index Leveraging Hadoop
Hi All, Hi Mark, It was interesting to hear Mark Butler present his implementation of Distributed Lucene at the Hadoop User Group meeting in London on Tuesday. There's obviously been quite a bit of discussion on the subject, and lots of interested parties. Mark, not sure if you're on this list but thanks for sharing. Is there any material published about this? I would be very interested to see Marks slides and hear about the discussion. Is this the forum to ask about open projects? I'm interested in joining a project as long as it's goals aren't too distant to what I'm looking for. Based mostly on gut feeling I'd rather go for a stand-alone project that wasn't dependent on HDFS/Hadoop, but willing to be convinced otherwise. Rich, as you know there are a couple project in this area solar, compass, dlucene and katta and since all are open source I guess the easiest way to be involved is to join the mailing lists. I only can speak for katta - we are very interested in getting more people involved to get other perspective. There is quite some activity in our project since our project is part of a upcoming production system, but low traffic in mailing list (So far all developers work in the same room). You can find our mailing list on our source forge page: http://katta.wiki.sourceforge.net/ Please keep in mind that katta is very young and compass or solr might be more interesting if you need something working now, though they might have different goals and focus than dlucene or katta. Stefan Groschupf
Re: Lucene Performance and usage alternatives
An alternative is always to distribute the index to a set of servers. If you need to scale I guess this is the only long term perspective. You can do your own home grown lucene distribution or look into existing one. I'm currently working on katta (http://katta.wiki.sourceforge.net/) - there is no release yet but we are in the QA and test cycles. But there are other as well - solar for example provides distribution as well. Stefan On Aug 5, 2008, at 7:21 AM, ezer wrote: I just made a program using the java api of Lucene. Its is working fine for my actually index size. But i am worried about performance with an biger index and simultaneous users access. 1) I am worried with the fact of having to make the program in java. I searched for alternative like the C Port, but i saw that the version used its a little old an no much people seem to use that. 2) I also thinking in compiling the code with cgj to generate native code and not use the jvm. Anybody tried it ? Can be an advantage that could aproximate to the performance of a C program ? 3) I wont use an application server, i will call the program directly from a php page, is there any architecture model suggested for doing that? I mean for preview many users accessing to the program. The fact of initiating one isntance each time someone do a query and opening the index should not degrade the performance? -- View this message in context: http://www.nabble.com/Lucene-Performance-and-usage-alternatives-tp18832162p18832162.html Sent from the Lucene - General mailing list archive at Nabble.com. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: Lucene-based Distributed Index Leveraging Hadoop
Should we start from scratch or with a code contribution? Someone still want to contribute its implementation? I just noticed - to late though - Ning already contributed the code to hadoop. So I guess my question should be rephrased what is the idea of moving this into a own project?
Re: Lucene-based Distributed Index Leveraging Hadoop
Hi All, we are also very much interested in such a system and actually have to realize such a system for an project within the next 3 month. I would prefer to work on a open source solution instead of doing another one behind closed doors, though we would need to start coding pretty soon. We have 3 fulltime developers we could contribute for this time to such a project. I'm happy to do all the organisational work like setting up the complete infrastructure etc to get it started. I suggest we start with an sourceforge project since this is fast to setup and if we qualify for apache as an lucene or hadoop subproject migrate there later, or is it easy to start a apache incubator project? We might just need a nice name for the project. Doug, any idea? :-) Should we start from scratch or with a code contribution? Someone still want to contribute its implementation? Thanks. Stefan On Feb 6, 2008, at 10:57 AM, Ning Li wrote: There have been several proposals for a Lucene-based distributed index architecture. 1) Doug Cutting's "Index Server Project Proposal" at http://www.mail-archive.com/general@lucene.apache.org/msg00338.html 2) Solr's "Distributed Search" at http://wiki.apache.org/solr/DistributedSearch 3) Mark Butler's "Distributed Lucene" at http://wiki.apache.org/hadoop/DistributedLucene We have also been working on a Lucene-based distributed index architecture. Our design differs from the above proposals in the way it leverages Hadoop as much as possible. In particular, HDFS is used to reliably store Lucene instances, Map/Reduce is used to analyze documents and update Lucene instances in parallel, and Hadoop's IPC framework is used. Our design is geared for applications that require a highly scalable index and where batch updates to each Lucene instance are acceptable (verses finer-grained document at a time updates). We have a working implementation of our design and are in the process of evaluating its performance. An overview of our design is provided below. We welcome feedback and would like to know if you are interested in working on it. If so, we would be happy to make the code publicly available. At the same time, we would like to collaborate with people working on existing proposals and see if we can consolidate our efforts. TERMINOLOGY A distributed "index" is partitioned into "shards". Each shard corresponds to a Lucene instance and contains a disjoint subset of the documents in the index. Each shard is stored in HDFS and served by one or more "shard servers". Here we only talk about a single distributed index, but in practice multiple indexes can be supported. A "master" keeps track of the shard servers and the shards being served by them. An "application" updates and queries the global index through an "index client". An index client communicates with the shard servers to execute a query. KEY RPC METHODS This section lists the key RPC methods in our design. To simplify the discussion, some of their parameters have been omitted. On the Shard Servers // Execute a query on this shard server's Lucene instance. // This method is called by an index client. SearchResults search(Query query); On the Master // Tell the master to update the shards, i.e., Lucene instances. // This method is called by an index client. boolean updateShards(Configuration conf); // Ask the master where the shards are located. // This method is called by an index client. LocatedShards getShardLocations(); // Send a heartbeat to the master. This method is called by a // shard server. In the response, the master informs the // shard server when to switch to a newer version of the index. ShardServerCommand sendHeartbeat(); QUERYING THE INDEX To query the index, an application sends a search request to an index client. The index client then calls the shard server search() method for each shard of the index, merges the results and returns them to the application. The index client caches the mapping between shards and shard servers by periodically calling the master's getShardLocations() method. UPDATING THE INDEX USING MAP/REDUCE To update the index, an application sends an update request to an index client. The index client then calls the master's updateShards() method, which schedules a Map/Reduce job to update the index. The Map/Reduce job updates the shards in parallel and copies the new index files of each shard (i.e., Lucene instance) to HDFS. The updateShards() method includes a "configuration", which provides information for updating the shards. More specifically, the configuration includes the following information: - Input path. This provides the location of updated documents, e.g., HDFS files or directories, or HBase tables. - Input formatter. This specifies how to format the input documents. - Analysis. This defines the analyzer to use on the input. The analyzer det
Re: [PROPOSAL] index server project
Hi, do people think we are already in a stage where we can setup some basic infrastructure like mailing list and wiki and move the discussion to the new mailing list. Maybe setup a incubator project? I would be happy to help with such basic tasks. Stefan Am 31.10.2006 um 22:03 schrieb Yonik Seeley: On 10/30/06, Doug Cutting <[EMAIL PROTECTED]> wrote: Yonik Seeley wrote: > On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote: >> We assume that, within an index, a file with a given name is written >> only once. > > Is this necessary, and will we need the lockless patch (that avoids > renaming or rewriting *any* files), or is Lucene's current index > behavior sufficient? It's not strictly required, but it would make index synchronization a lot simpler. Yes, I was assuming the lockless patch would be committed to Lucene before this project gets very far. Something more than that would be required in order to keep old versions, but this could be as simple as a Directory subclass that refuses to remove files for a time. Or a snapshot (hard links) mechanism. Lucene would also need a way to open a specific index version (rather than just the latest), but I guess that could also be hacked into Directory by hiding later "segments" files (assumes lockless is committed). > It's unfortunate the master needs to be involved on every document add. That should not normally be the case. Ahh... I had assumed that "id" in the following method was document id: IndexLocation getUpdateableIndex(String id); I see now it's index id. But what is index id exactly? Looking at the example API you laid down, it must be a single physical index (as opposed to a logical index). In which case, is it entirely up to the client to manage multi-shard indicies? For example, if we had a "photo" index broken up into 3 shards, each shard would have a separate index id and it would be up to the client to know this, and to query across the different "photo0", "photo1", "photo2" indicies. The master would have no clue those indicies were related. Hmmm, that doesn't work very well for deletes though. It seems like there should be the concept of a logical index, that is composed of multiple shards, and each shard has multiple copies. Or were you thinking that a cluster would only contain a single logical index, and hence all different index ids are simply different shards of that single logical index? That would seem to be consistent with ClientToMasterProtocol .getSearchableIndexes() lacking an id argument. I was not imagining a real-time system, where the next query after a document is added would always include that document. Is that a requirement? That's harder. Not real-time, but it would be nice if we kept it close to what Lucene can currently provide. Most people seem fine with a latency of minutes. At this point I'm mostly trying to see if this functionality would meet the needs of Solr, Nutch and others. It depends on the project scope and how extensible things are. It seems like the master would be a WAR, capable of running stand- alone. What about index servers (slaves)? Would this project include just the interfaces to be implemented by Solr/Nutch nodes, some common implementation code behind the interfaces in the form of a library, or also complete standalone WARs? I'd need to be able to extend the ClientToSlave protocol to add additional methods for Solr (for passing in extra parameters and returning various extra data such as facets, highlighting, etc). Must we include a notion of document identity and/or document version in the mechanism? Would that facillitate updates and coherency? It doesn't need to be in the interfaces I don't think, so it depends on the scope of the index server implementations. -Yonik ~~~ 101tec Inc. search tech for web 2.1 Menlo Park, California http://www.101tec.com
Re: [PROPOSAL] index server project
Hi, The major goal is scale, right? A distributed server provides more oomph than a single-node server can. Another important goal from my point of view would be index management, like index updates during production. Stefan
Re: [Fwd: [PROPOSAL] index server project]
Hi Doug, we discussed the need of such a tool several times internally and developed some workarounds for nutch, so I would be definitely interested to contribute to such a project. Having a separated project that depends on hadoop would be the best case for our usecases. Best, Stefan Am 18.10.2006 um 23:35 schrieb Doug Cutting: FYI, I just pitched a new project you might be interested in on [EMAIL PROTECTED] Dunno if you subscribe to that list, so I'm spamming you. If it sounds interesting, please reply there. My management at Y! is interested in this, so I'm 'in'. Doug Original Message Subject: [PROPOSAL] index server project Date: Wed, 18 Oct 2006 14:17:30 -0700 From: Doug Cutting <[EMAIL PROTECTED]> Reply-To: general@lucene.apache.org To: general@lucene.apache.org It seems that Nutch and Solr would benefit from a shared index serving infrastructure. Other Lucene-based projects might also benefit from this. So perhaps we should start a new project to build such a thing. This could start either in java/contrib, or as a separate sub-project, depending on interest. Here are some quick ideas about how this might work. An RPC mechanism would be used to communicate between nodes (probably Hadoop's). The system would be configured with a single master node that keeps track of where indexes are located, and a number of slave nodes that would maintain, search and replicate indexes. Clients would talk to the master to find out which indexes to search or update, then they'll talk directly to slaves to perform searches and updates. Following is an outline of how this might look. We assume that, within an index, a file with a given name is written only once. Index versions are sets of files, and a new version of an index is likely to share most files with the prior version. Versions are numbered. An index server should keep old versions of each index for a while, not immediately removing old files. public class IndexVersion { String Id; // unique name of the index int version; // the version of the index } public class IndexLocation { IndexVersion indexVersion; InetSocketAddress location; } public interface ClientToMasterProtocol { IndexLocation[] getSearchableIndexes(); IndexLocation getUpdateableIndex(String id); } public interface ClientToSlaveProtocol { // normal update void addDocument(String index, Document doc); int[] removeDocuments(String index, Term term); void commitVersion(String index); // batch update void addIndex(String index, IndexLocation indexToAdd); // search SearchResults search(IndexVersion i, Query query, Sort sort, int n); } public interface SlaveToMasterProtocol { // sends currently searchable indexes // recieves updated indexes that we should replicate/update public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes); } public interface SlaveToSlaveProtocol { String[] getFileSet(IndexVersion indexVersion); byte[] getFileContent(IndexVersion indexVersion, String file); // based on experience in Hadoop, we probably wouldn't really use // RPC to send file content, but rather HTTP. } The master thus maintains the set of indexes that are available for search, keeps track of which slave should handle changes to an index and initiates index synchronization between slaves. The master can be configured to replicate indexes a specified number of times. The client library can cache the current set of searchable indexes and periodically refresh it. Searches are broadcast to one index with each id and return merged results. The client will load-balance both searches and updates. Deletions could be broadcast to all slaves. That would probably be fast enough. Alternately, indexes could be partitioned by a hash of each document's unique id, permitting deletions to be routed to the appropriate slave. Does this make sense? Does it sound like it would be useful to Solr? To Nutch? To others? Who would be interested and able to work on it? Doug ~~~ 101tec Inc. search tech for web 2.1 Menlo Park, California http://www.101tec.com