Re: Replication for SolrCloud
Thanks for the suggestion, Erick. However here what we need is not a patch, is a clarification from practice perspective. I think solr replication is a great feature to scale reads, and kind of increase reliability. However, on HDFS it is not as useful as just sharding. Sharding can scale both reads and writes at same time, and doesn't have consistency concern along with replication. So I doubt Solr replication on HDFS has real meanings? I will try to reach out Mark Miller and will appreciate if he or anyone can provide more convincing points on this. Thanks, Mao On Sat, Apr 18, 2015 at 4:44 PM Erick Erickson erickerick...@gmail.com wrote: AFAIK, the HDFS replication of Solr indexes isn't something that was designed, it just came along for the ride given HDFS replication. Having a shard with 1 leader and two followers have 9 copies of the index around _is_ overkill, nobody argues that at all. I know the folks at Cloudera (who contributed the original HDFS implementation) have discussed various options around this. In the grand scheme of things, there have been other priorities without tearing into the guts of Solr and/or HDFS since disk space is relatively cheap. That said, I'm also sure that this will get some attention as priorities change. All patches welcome of course ;), But if you're inclined to work on this issue, I'd _really_ discuss it with Mark Miller etc. before investing too much effort in it. I don't quite know the tradeoffs well enough to have an opinion on the right implementation. Best Erick On Sat, Apr 18, 2015 at 1:59 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Some comments inline: On Sat, Apr 18, 2015 at 2:12 PM, gengmao geng...@gmail.com wrote: On Sat, Apr 18, 2015 at 12:20 AM Jürgen Wagner (DVT) juergen.wag...@devoteam.com wrote: Replication on the storage layer will provide a reliable storage for the index and other data of Solr. In particular, this replication does not guarantee your index files are consistent at any time as there may be intermediate states that are only partially replicated. Replication is only a convergent process, not an instant, atomic operation. With frequent changes, this becomes an issue. Firstly thanks for your reply. However I can't agree with you on this. HDFS guarantees the consistency even with replicates - you always read what you write, no partially replicated state will be read, which is guaranteed by HDFS server and client. Hence HBase can rely on HDFS for consistency and availability, without implementing another replication mechanism - if I understand correctly. Lucene index is not one file but a collection of files which are written independently. So if you replicate them out of order, Lucene might consider the index as corrupted (because of missing files). I don't think HBase works in that way. Replication inside SolrCloud as an application will not only maintain the consistency of the search-level interfaces to your indexes, but also scale in the sense of the application (query throughput). Split one shard into two shards can increase the query throughput too. Imagine a database: if you change one record, this may also result in an index change. If the record and the index are stored in different storage blocks, one will get replicated first. However, the replication target will only be consistent again when both have been replicated. So, you would have to suspend all accesses until the entire replication has completed. That's undesirable. If you replicate on the application (database management system) level, the application will employ a more fine-grained approach to replication, guaranteeing application consistency. In HBase, a region only locates on single region server at any time, which guarantee its consistency. Because your read/write always drops in one region, you won't have concern of parallel writes happens on multiple replicates of same region. The replication of HDFS is totally transparent to HBase. When a HDFS write call returns, HBase know the data is written and replicated so losing one copy of the data won't impact HBase at all. So HDFS means consistency and reliability for HBase. However, HBase doesn't use replicates (either HBase itself or HDFS's) to scale reads. If one region's is too hot for reads or write, you split that region into two regions, so that the reads and writes of that region can be distributed into two region servers. Hence HBase scales. I think this is the simplicity and beauty of HBase. Again, I am curious if SolrCloud has better reason to use replication on HDFS? As I described, HDFS provided consistency and reliability, meanwhile scalability can be achieved via sharding, even without Solr replication. That's something that has been considered and may even be in the roadmap for the
Re: JSON Facet Analytics API in Solr 5.1
Oh... and btw, I think the readability of the JSON will be less and less important going forward. Queries will grow in size anyway (due to nested facets) and the ability to quickly validate the query using some parser will be more useful and practical than relying on human eye doing the check instead. I assume that both the ES and Solr will end up having some higher level language for people to express queries and facets/aggregations in readable form (anyone remember SQL?) and this will be transformed to JSON (or other native) format down the road. In my opinion the most important thing for any non-trivial JSON based language format now is to make sure it is parser friendly and grammars can be defined easily for it. On Sun, Apr 19, 2015 at 8:09 AM, Lukáš Vlček lukas.vl...@gmail.com wrote: Late here but let me add one more thing: IIRC the recommendation for JSON is to never use data as a key in objects. One of the benefits of not using data as a keys in JSON is easier validation using JSON schema. If one wants to validate JSON query for Elasticsearch today it is necessary to implement custom parser (and grammar first of course). Lukas On Sat, Apr 18, 2015 at 11:46 PM, Yonik Seeley ysee...@gmail.com wrote: Alther minor benefit to the flatter structure means that the smart merging of multiple JSON parameters works a little better in conjunction with facets. For example, if you already had a top_genre facet, you could insert a top_author facet more easily: json.facet.top_genre.facet.top_author={type:terms, field:author, limit:5} (For anyone who doesn't know what smart merging is, see http://yonik.com/solr-json-request-api/ ) -Yonik On Sat, Apr 18, 2015 at 11:36 AM, Yonik Seeley ysee...@gmail.com wrote: Thank you everyone for the feedback! I've implemented and committed the flatter structure: https://issues.apache.org/jira/browse/SOLR-7422 So either form can now be used (and I'll be switching to the flatter method for examples when it actually reduces the levels). For those who want to try it out, I just made a 5.2-dev snapshot: https://github.com/yonik/lucene-solr/releases -Yonik
Re: Replication for SolrCloud
In simple words: HDFS is good for file-oriented replication. Solr is good for index replication. Consequently, if atomic file update operations of an application (like Solr) are not atomic on a file level, HDFS is not adequate - like for Solr with live index updates. Running Solr on HDFS (as a file system) will pose limitations due to HDFS properties. Indexing, however, still won't use Hadoop. If you produce indexes and distribute them as finalized, read-only structures (e.g., through Hadoop jobs), HDFS is fine. Solr does not need to be much aware of HDFS. The third one in the picture is records-based replication to be handled by Hbase, Cassandra or Zookeeper, depending on requirements. Cheers, Jürgen
Re: JSON Facet Analytics API in Solr 5.1
Late here but let me add one more thing: IIRC the recommendation for JSON is to never use data as a key in objects. One of the benefits of not using data as a keys in JSON is easier validation using JSON schema. If one wants to validate JSON query for Elasticsearch today it is necessary to implement custom parser (and grammar first of course). Lukas On Sat, Apr 18, 2015 at 11:46 PM, Yonik Seeley ysee...@gmail.com wrote: Alther minor benefit to the flatter structure means that the smart merging of multiple JSON parameters works a little better in conjunction with facets. For example, if you already had a top_genre facet, you could insert a top_author facet more easily: json.facet.top_genre.facet.top_author={type:terms, field:author, limit:5} (For anyone who doesn't know what smart merging is, see http://yonik.com/solr-json-request-api/ ) -Yonik On Sat, Apr 18, 2015 at 11:36 AM, Yonik Seeley ysee...@gmail.com wrote: Thank you everyone for the feedback! I've implemented and committed the flatter structure: https://issues.apache.org/jira/browse/SOLR-7422 So either form can now be used (and I'll be switching to the flatter method for examples when it actually reduces the levels). For those who want to try it out, I just made a 5.2-dev snapshot: https://github.com/yonik/lucene-solr/releases -Yonik
Re: Replication for SolrCloud
Please see my response in line: On Fri, Apr 17, 2015 at 10:59 PM Shalin Shekhar Mangar shalinman...@gmail.com wrote: Some comments inline: On Sat, Apr 18, 2015 at 2:12 PM, gengmao geng...@gmail.com wrote: On Sat, Apr 18, 2015 at 12:20 AM Jürgen Wagner (DVT) juergen.wag...@devoteam.com wrote: Replication on the storage layer will provide a reliable storage for the index and other data of Solr. In particular, this replication does not guarantee your index files are consistent at any time as there may be intermediate states that are only partially replicated. Replication is only a convergent process, not an instant, atomic operation. With frequent changes, this becomes an issue. Firstly thanks for your reply. However I can't agree with you on this. HDFS guarantees the consistency even with replicates - you always read what you write, no partially replicated state will be read, which is guaranteed by HDFS server and client. Hence HBase can rely on HDFS for consistency and availability, without implementing another replication mechanism - if I understand correctly. Lucene index is not one file but a collection of files which are written independently. So if you replicate them out of order, Lucene might consider the index as corrupted (because of missing files). I don't think HBase works in that way. Again HDFS replication is transparent to HBase. You can set HDFS replication factor to 1 and HBase will still work, but it will lose the fault tolerance to any disk failure which is provided by HDFS replicates. Also HBase doesn't directly utilize HDFS replicates. Increase HDFS replication factors won't improve HBase's scalability. To achieve better read/write throughput, split shards is the only approach. Replication inside SolrCloud as an application will not only maintain the consistency of the search-level interfaces to your indexes, but also scale in the sense of the application (query throughput). Split one shard into two shards can increase the query throughput too. Imagine a database: if you change one record, this may also result in an index change. If the record and the index are stored in different storage blocks, one will get replicated first. However, the replication target will only be consistent again when both have been replicated. So, you would have to suspend all accesses until the entire replication has completed. That's undesirable. If you replicate on the application (database management system) level, the application will employ a more fine-grained approach to replication, guaranteeing application consistency. In HBase, a region only locates on single region server at any time, which guarantee its consistency. Because your read/write always drops in one region, you won't have concern of parallel writes happens on multiple replicates of same region. The replication of HDFS is totally transparent to HBase. When a HDFS write call returns, HBase know the data is written and replicated so losing one copy of the data won't impact HBase at all. So HDFS means consistency and reliability for HBase. However, HBase doesn't use replicates (either HBase itself or HDFS's) to scale reads. If one region's is too hot for reads or write, you split that region into two regions, so that the reads and writes of that region can be distributed into two region servers. Hence HBase scales. I think this is the simplicity and beauty of HBase. Again, I am curious if SolrCloud has better reason to use replication on HDFS? As I described, HDFS provided consistency and reliability, meanwhile scalability can be achieved via sharding, even without Solr replication. That's something that has been considered and may even be in the roadmap for the Cloudera guys. See https://issues.apache.org/jira/browse/SOLR-6237 But one problem that isn't solved by HDFS replication is of near-real-time indexing where you want the documents to be available for searchers as fast as possible. SolrCloud replication supports that by replicating documents as they come in and indexing them in several replicas. A new index searcher is opened on the flushed index files as well as on the internal data structures of the index writer. If we switch to relying on HDFS replication then this will be awfully expensive. However, as Jürgen mentioned, HDFS can certainly help with replicating static indexes My understanding is near-real-time indexing is not necessary to rely on replication. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching just describes soft commit but doesn't mention replication. Also the Cloudera Search, which is Solr based on HDFS, claims near-real-time indexing however doesn't mention replication too. Quote from http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/Cloudera-Search-User-Guide/csug_introducing.html :
Re: Addtion to solr wiki editor list
Done and thanks! The Reference Guide (https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide) is another place to look, it's getting considerable attention at this point. It's curated a bit more than the Wiki however, so if you see anything wrong there please make comments on any page you see a problem on. If you haven't downloaded the PDF of the ref guide I recommend it highly! On Sun, Apr 19, 2015 at 3:06 PM, Mirko Cegledi mrkc...@gmail.com wrote: Hi there! I'd like to be added to the list of people who are able to edit the solr wiki at https://wiki.apache.org/solr. I'm working as a Java developer for a german company using Solr (and like it a lot) a lot and I would like to be able to correct things as soon as I find them without going to the IRC-channel to get things changed. My wiki name should be campfire. Thanks in advance
Addtion to solr wiki editor list
Hi there! I'd like to be added to the list of people who are able to edit the solr wiki at https://wiki.apache.org/solr. I'm working as a Java developer for a german company using Solr (and like it a lot) a lot and I would like to be able to correct things as soon as I find them without going to the IRC-channel to get things changed. My wiki name should be campfire. Thanks in advance
Re: help with schema containing nested documents
There are no nested schemas as such. It's only a superset schema that includes all the fields for parents and children. Obviously, the fields that are not common should be optional. The rest depends on what parent/child relation you are trying to setup. Whether it is explicit with block indexing or more loose with some other kind of cross-referencing. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 18 April 2015 at 06:47, Nicolae Pandrea npand...@expedia.com wrote: Hi, I need some documentation/samples on how to create a SOLR schema with nested documents. I have been looking online but could not find anything. Thank you in advance, Nick Pandrea
Re: MoreLikeThis (mlt) in sharded SolrCloud
Thanks, Anshum. Looks like there's no way for this to work in 5.1 for us so I'll just have to wait to for the fixes. Relieving to know it wasn't just me, though. --Ere 18.4.2015, 2.45, Anshum Gupta kirjoitti: The other issue that would fix half of your problems is: https://issues.apache.org/jira/browse/SOLR-7143 On Fri, Apr 17, 2015 at 4:35 PM, Anshum Gupta ans...@anshumgupta.net wrote: Ah, I meant SOLR-7418 https://issues.apache.org/jira/browse/SOLR-7418. On Fri, Apr 17, 2015 at 4:30 PM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Ere, Those seem like valid issues. I've created an issue : SOLR-7275 https://issues.apache.org/jira/browse/SOLR-7275 and will create more as I find more of those. I plan to get to them and fix over the weekend. On Wed, Apr 15, 2015 at 5:13 AM, Ere Maijala ere.maij...@helsinki.fi wrote: Hi, I'm trying to gather information on how mlt works or is supposed to work with SolrCloud and a sharded collection. I've read issues SOLR-6248, SOLR-5480 and SOLR-4414, and docs at https://wiki.apache.org/solr/MoreLikeThis, but I'm still struggling with multiple issues. I've been testing with Solr 5.1 and the Getting Started sample cloud. So, with a freshly extracted Solr, these are the steps I've done: bin/solr start -e cloud -noprompt bin/post -c gettingstarted docs/ bin/post -c gettingstarted example/exampledocs/books.json After this I've tried different variations of queries with limited success: http://localhost:8983/solr/gettingstarted/select?q={!mlt}non-existing causes java.lang.NullPointerException at org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:80) http://localhost:8983/solr/gettingstarted/select?q={!mlt}978-0641723445 causes java.lang.NullPointerException at org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:84) http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=title}978-0641723445 http://localhost:8983/solr/gettingstarted/select?q=%7B!mlt%20qf=title%7D978-0641723445 causes java.lang.NullPointerException at org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759) http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=cat}978-0641723445 http://localhost:8983/solr/gettingstarted/select?q=%7B!mlt%20qf=cat%7D978-0641723445 actually gives results http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=author,cat}978-0641723445 http://localhost:8983/solr/gettingstarted/select?q=%7B!mlt%20qf=author,cat%7D978-0641723445 again causes Java.lang.NullPointerException at org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759) I guess the actual question is, how am I supposed to use the handler to replicate behavior of non-distributed mlt that was formerly used with qt=morelikethis and the following configuration in solrconfig.xml: requestHandler name=morelikethis class=solr.MoreLikeThisHandler lst name=defaults str name=mlt.fltitle,title_short,callnumber-label,topic,language,author,publishDate/str str name=mlt.qf title^75 title_short^100 callnumber-label^400 topic^300 language^30 author^75 publishDate /str int name=mlt.mintf1/int int name=mlt.mindf1/int str name=mlt.boosttrue/str int name=mlt.count5/int int name=rows5/int /lst /requestHandler Real-life full schema and config can be found at https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf . --Ere -- Ere Maijala Kansalliskirjasto / The National Library of Finland -- Anshum Gupta -- Anshum Gupta -- Ere Maijala Kansalliskirjasto / The National Library of Finland
Re: Solr Cloud reclaiming disk space from deleted documents
I assume you don't have much free space available in your disk. Notice that during optimization (merge into a single segment) your shard replica space usage may peak to 2x-3x of it's normal size until optimization completes. Is it a problem? Not if optimization occurs over shards serially and your index is broken to many small shards. On Apr 18, 2015 1:54 AM, Rishi Easwaran rishi.easwa...@aol.com wrote: Thanks Shawn for the quick reply. Our indexes are running on SSD, so 3 should be ok. Any recommendation on bumping it up? I guess will have to run optimize for entire solr cloud and see if we can reclaim space. Thanks, Rishi. -Original Message- From: Shawn Heisey apa...@elyograg.org To: solr-user solr-user@lucene.apache.org Sent: Fri, Apr 17, 2015 6:22 pm Subject: Re: Solr Cloud reclaiming disk space from deleted documents On 4/17/2015 2:15 PM, Rishi Easwaran wrote: Running into an issue and wanted to see if anyone had some suggestions. We are seeing this with both solr 4.6 and 4.10.3 code. We are running an extremely update heavy application, with millions of writes and deletes happening to our indexes constantly. An issue we are seeing is that solr cloud reclaiming the disk space that can be used for new inserts, by cleanup up deletes. We used to run optimize periodically with our old multicore set up, not sure if that works for solr cloud. Num Docs:28762340 Max Doc:48079586 Deleted Docs:19317246 Version 1429299216227 Gen 16525463 Size 109.92 GB In our solrconfig.xml we use the following configs. indexConfig !-- Values here affect all index writers and act as a default unless overridden. -- useCompoundFilefalse/useCompoundFile maxBufferedDocs1000/maxBufferedDocs maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength mergeFactor10/mergeFactor mergePolicy class=org.apache.lucene.index.TieredMergePolicy/ mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxThreadCount3/int int name=maxMergeCount15/int /mergeScheduler ramBufferSizeMB64/ramBufferSizeMB /indexConfig This part of my response won't help the issue you wrote about, but it can affect performance, so I'm going to mention it. If your indexes are stored on regular spinning disks, reduce mergeScheduler/maxThreadCount to 1. If they are stored on SSD, then a value of 3 is OK. Spinning disks cannot do seeks (read/write head moves) fast enough to handle multiple merging threads properly. All the seek activity required will really slow down merging, which is a very bad thing when your indexing load is high. SSD disks do not have to seek, so multiple threads are OK there. An optimize is the only way to reclaim all of the disk space held by deleted documents. Over time, as segments are merged automatically, deleted doc space will be automatically recovered, but it won't be perfect, especially as segments are merged multiple times into very large segments. If you send an optimize command to a core/collection in SolrCloud, the entire collection will be optimized ... the cloud will do one shard replica (core) at a time until the entire collection has been optimized. There is no way (currently) to ask it to only optimize a single core, or to do multiple cores simultaneously, even if they are on different servers. Thanks, Shawn