DIH - LastModifiedDate - Format
Hi, I am using MySQL as the datastore and for the last_modified_date use the java.util.Date. I m seeing that the DIH doesn’t seem to pick records; Is there a date format that I should use for DIH to compare properly and pick up the records for indexing? Thanks -Peri.S *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose.
Re: update in SolrCloud through C++ client
If only availability is your concern, you can always keep a list of servers to which your C++ clients will send requests, and round robin amongst them. If one of the servers go down, you will either not be able to reach it or get a 500+ error in the HTTP response, you can take it out of circulation (and probably retry in the background with some kind of a ping every minute or so to these down servers to ascertain if they have come back and then add them to the list). This is something SolrJ does currently. This doesn't technically need any Zookeeper interaction. The biggest benefit that SolrJ provides (since 4.6 I think) though is that it finds the shard leader to send an update to using ZK and saves a hop. You can technically do this by retrieving and listening to updates using a C++ ZK client (available) and doing what SolrJ currently does. This would be good, the only drawback though, apart from the effort, is that improvements are still happening in the area of managing clusters and how its state is saved with ZK. These changes might not break your code, but at the same time you might not be able to take advantage of them without additional effort. An alternative approach is to link SolrJ into your C++ client using JNI. This has the added benefit of using the Javabin format for requests which would have some performance benefits. In short, it comes down to what performance requirements are. If indexing speed and throughput is not that big a deal, just go with having a list of servers and load balancing amongst the active ones. I would suggest you try this anyway before second guessing that you do need the optimization. If not, I would probably try the JNI route, and if that fails, using a C ZK client to read the cluster state and using that knowledge to decide where to send requests. On 14 Feb 2014 10:58, "neerajp" wrote: > Hello All, > I am using Solr for indexing my data. My client is in C++. So I make Curl > request to Solr server for indexing. > Now, I want to use indexing in SolrCloud mode using ZooKeeper for HA. I > read the wiki link of SolrCloud (http://wiki.apache.org/solr/SolrCloud). > > What I understand from wiki that we should always check solr instance > status(up & running) in solrCloud before making an update request. Can I > not > send update request to zookeeper and let the zookeeper forwards it to > appropriate replica/leader ? In the later case I need not to worry which > servers are up and running before making indexing request. > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/update-in-SolrCloud-through-C-client-tp4117340.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Luke 4.6.1 released
Hello! Luke 4.6.1 has been just released. Grab it here: https://github.com/DmitryKey/luke/releases/tag/4.6.1 fixes: loading the jar from command line is now working fine. -- Dmitry Kan Blog: http://dmitrykan.blogspot.com Twitter: twitter.com/dmitrykan
Re: Solr Hot Cpu and high load
Thanks Tri *a. Are you docs distributed evenly across shards: number of docs and size of the shards* >> Yes the size of all the shards is equal (an ignorable delta in the order of KB) and so are the # of docs *b. Is your test client querying all nodes, or all the queries go to those 2 busy nodes?* *>> *Yes all nodes are receiving exactly the same amount of queries I have one more question. Do stored fields have significant impact on performance of solr queries? Having 50% of the fields stored ( out of 100 fields) significantly worse that having 20% of the fields stored? (signficantly == orders of 100s of milliseconds assuming all fields are of the same size and type) How are stored fields retrieved in general (always from disk or loaded into memory in the first query and then going forward read from memory?) Thanks Nitin On Fri, Feb 14, 2014 at 11:45 AM, Tri Cao wrote: > 1. Yes, that's the right way to go, well, in theory at least :) > 2. Yes, queries are alway fanned to all shards and will be as slow as the > slowest shard. When I looked into > Solr distributed querying implementation a few months back, the support > for graceful degradation for things > like network failures and slow shards was not there yet. > 3. I doubt mmap settings would impact your read-only load, and it seems > you can easily > fit your index in RAM. You could try to warm the file cache to make sure > with "cat $sorl_dir > /dev/null". > > It's odd that only 2 nodes are at 100% in your set up. I would check a > couple of things: > a. Are you docs distributed evenly across shards: number of docs and size > of the shards > b. Is your test client querying all nodes, or all the queries go to those > 2 busy nodes? > > Regards, > Tri > > On Feb 14, 2014, at 10:52 AM, Nitin Sharma > wrote: > > Hell folks > > We are currently using solrcloud 4.3.1. We have 8 node solrcloud cluster > with 32 cores, 60Gb of ram and SSDs.We are using zk to manage the > solrconfig used by our collections > > We have many collections and some of them are relatively very large > compared to the other. The size of the shard of these big collections are > in the order of Gigabytes.We decided to split the bigger collection evenly > across all nodes (8 shards and 2 replicas) with maxNumShards > 1. > > We did a test with a read load only on one big collection and we still see > only 2 nodes running 100% CPU and the rest are blazing through the queries > way faster (under 30% cpu). [Despite all of them being sharded across all > nodes] > > I checked the JVM usage and found that none of the pools have high > utilization (except Survivor space which is 100%). The GC cycles are in > the order of ms and mostly doing scavenge. Mark and sweep occurs once every > 30 minutes > > Few questions: > > 1. Sharding all collections (small and large) across all nodes evenly > > distributes the load and makes the system characteristics of all machines > similar. Is this a recommended way to do ? > 2. Solr Cloud does a distributed query by default. So if a node is at > > 100% CPU does it slow down the response time for the other nodes waiting > for this query? (or does it have a timeout if it cannot get a response from > a node within x seconds?) > 3. Our collections use Mmap directory but i specifically haven't enabled > > anything related to mmaps (locked pages under ulimit ). Does it adverse > affect performance? or can lock pages even without this? > > Thanks a lot in advance. > Nitin > > -- - N
Re: SolrCloud Zookeeper disconnection/reconnection
Thanks a lot for your answer. Is there a web page, on the wiki for instance, where we could find some JVM settings or recommandations that we should used for Solr with some index configurations? Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101p4117653.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud Zookeeper disconnection/reconnection
Start with http://wiki.apache.org/solr/SolrPerformanceProblems It has a section on GC tuning and a link to some example settings. On 16 Feb 2014 21:19, "lboutros" wrote: > Thanks a lot for your answer. > > Is there a web page, on the wiki for instance, where we could find some JVM > settings or recommandations that we should used for Solr with some index > configurations? > > Ludovic. > > > > > > - > Jouve > France. > -- > View this message in context: > http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101p4117653.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Luke 4.6.1 released
Does it work with Solr? I couldn't tell what the description was from this repo and it's Solr relevance. I am sure all the long timers know, but for more recent Solr people, the additional information would be useful. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Feb 17, 2014 at 3:02 AM, Dmitry Kan wrote: > Hello! > > Luke 4.6.1 has been just released. Grab it here: > > https://github.com/DmitryKey/luke/releases/tag/4.6.1 > > fixes: > loading the jar from command line is now working fine. > > -- > Dmitry Kan > Blog: http://dmitrykan.blogspot.com > Twitter: twitter.com/dmitrykan
Re: Luke 4.6.1 released
Yes it works with Solr Bill Bell Sent from mobile > On Feb 16, 2014, at 3:38 PM, Alexandre Rafalovitch wrote: > > Does it work with Solr? I couldn't tell what the description was from > this repo and it's Solr relevance. > > I am sure all the long timers know, but for more recent Solr people, > the additional information would be useful. > > Regards, > Alex. > Personal website: http://www.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all > at once. Lately, it doesn't seem to be working. (Anonymous - via GTD > book) > > >> On Mon, Feb 17, 2014 at 3:02 AM, Dmitry Kan wrote: >> Hello! >> >> Luke 4.6.1 has been just released. Grab it here: >> >> https://github.com/DmitryKey/luke/releases/tag/4.6.1 >> >> fixes: >> loading the jar from command line is now working fine. >> >> -- >> Dmitry Kan >> Blog: http://dmitrykan.blogspot.com >> Twitter: twitter.com/dmitrykan
Re: Solr Hot Cpu and high load
Stored fields are what the Solr DocumentCache in solrconfig.xml is all about. My general feeling is that stored fields are mostly irrelevant for search speed, especially if lazy-loading is enabled. The only time stored fields come in to play is when assembling the final result list, i.e. the 10 or 20 documents that you return. That does imply disk I/O, and if you have massive fields theres also decompression to add to the CPU load. So, as usual, "it depends". Try measuring where you restrict the returned fields to whatever your field is for one set of tests, then try returning _everything_ for another? Best, Erick On Sun, Feb 16, 2014 at 12:18 PM, Nitin Sharma wrote: > Thanks Tri > > > *a. Are you docs distributed evenly across shards: number of docs and size > of the shards* > >> Yes the size of all the shards is equal (an ignorable delta in the order > of KB) and so are the # of docs > > *b. Is your test client querying all nodes, or all the queries go to those > 2 busy nodes?* > *>> *Yes all nodes are receiving exactly the same amount of queries > > > I have one more question. Do stored fields have significant impact on > performance of solr queries? Having 50% of the fields stored ( out of 100 > fields) significantly worse that having 20% of the fields stored? > (signficantly == orders of 100s of milliseconds assuming all fields are of > the same size and type) > > How are stored fields retrieved in general (always from disk or loaded into > memory in the first query and then going forward read from memory?) > > Thanks > Nitin > > > > On Fri, Feb 14, 2014 at 11:45 AM, Tri Cao wrote: > > > 1. Yes, that's the right way to go, well, in theory at least :) > > 2. Yes, queries are alway fanned to all shards and will be as slow as the > > slowest shard. When I looked into > > Solr distributed querying implementation a few months back, the support > > for graceful degradation for things > > like network failures and slow shards was not there yet. > > 3. I doubt mmap settings would impact your read-only load, and it seems > > you can easily > > fit your index in RAM. You could try to warm the file cache to make sure > > with "cat $sorl_dir > /dev/null". > > > > It's odd that only 2 nodes are at 100% in your set up. I would check a > > couple of things: > > a. Are you docs distributed evenly across shards: number of docs and size > > of the shards > > b. Is your test client querying all nodes, or all the queries go to those > > 2 busy nodes? > > > > Regards, > > Tri > > > > On Feb 14, 2014, at 10:52 AM, Nitin Sharma > > wrote: > > > > Hell folks > > > > We are currently using solrcloud 4.3.1. We have 8 node solrcloud cluster > > with 32 cores, 60Gb of ram and SSDs.We are using zk to manage the > > solrconfig used by our collections > > > > We have many collections and some of them are relatively very large > > compared to the other. The size of the shard of these big collections are > > in the order of Gigabytes.We decided to split the bigger collection > evenly > > across all nodes (8 shards and 2 replicas) with maxNumShards > 1. > > > > We did a test with a read load only on one big collection and we still > see > > only 2 nodes running 100% CPU and the rest are blazing through the > queries > > way faster (under 30% cpu). [Despite all of them being sharded across all > > nodes] > > > > I checked the JVM usage and found that none of the pools have high > > utilization (except Survivor space which is 100%). The GC cycles are in > > the order of ms and mostly doing scavenge. Mark and sweep occurs once > every > > 30 minutes > > > > Few questions: > > > > 1. Sharding all collections (small and large) across all nodes evenly > > > > distributes the load and makes the system characteristics of all machines > > similar. Is this a recommended way to do ? > > 2. Solr Cloud does a distributed query by default. So if a node is at > > > > 100% CPU does it slow down the response time for the other nodes waiting > > for this query? (or does it have a timeout if it cannot get a response > from > > a node within x seconds?) > > 3. Our collections use Mmap directory but i specifically haven't enabled > > > > anything related to mmaps (locked pages under ulimit ). Does it adverse > > affect performance? or can lock pages even without this? > > > > Thanks a lot in advance. > > Nitin > > > > > > > -- > - N >
Solr index filename doesn't match with solr vesion
Hello, I upgraded recently from solr 4.0 to solr 4.6, I check solr index folder and found there file _aars_*Lucene41*_0.doc _aars_*Lucene41*_0.pos _aars_*Lucene41*_0.tim _aars_*Lucene41*_0.tip I don't know why it don't have *Lucene46* in file name. Is there something wrong? Thanks, Tien
query parameters
in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like to use fq to force the following conditions: 1: organisations is empty and roles is empty 2: organisations contains one of the commadelimited list in variable $org 3: roles contains one of the commadelimited list in variable $r 4: rule 2 and 3 snipet of what i got (havent checked out if the is a "in" operator like in sql for the list value) explicit 10 edismax true plain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 (organisations='' roles='') or (organisations=$org roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='') (expiration:[NOW TO *] OR (*:* -expiration:*))^6 div(clicks,max(displays,1))^8
Re: Solr index filename doesn't match with solr vesion
On 2/16/2014 7:25 PM, Nguyen Manh Tien wrote: > I upgraded recently from solr 4.0 to solr 4.6, > I check solr index folder and found there file > > _aars_*Lucene41*_0.doc > _aars_*Lucene41*_0.pos > _aars_*Lucene41*_0.tim > _aars_*Lucene41*_0.tip > > I don't know why it don't have *Lucene46* in file name. This is an indication that this part of the index is using a file format introduced in Lucene 4.1. Here's what I have for one of my index segments on a Solr 4.6.1 server: _5s7_2h.del _5s7.fdt _5s7.fdx _5s7.fnm _5s7_Lucene41_0.doc _5s7_Lucene41_0.pos _5s7_Lucene41_0.tim _5s7_Lucene41_0.tip _5s7_Lucene45_0.dvd _5s7_Lucene45_0.dvm _5s7.nvd _5s7.nvm _5s7.si _5s7.tvd _5s7.tvx It shows the same pieces as your list, but I am also using docValues in my index, and those files indicate that they are using the format from Lucene 4.5. I'm not sure why there are not version numbers in *all* of the file extensions -- that happens in the Lucene layer, which is a bit of a mystery to me. Thanks, Shawn
Increasing number of SolrIndexSearcher (Leakage)?
Hello, My solr got OOM recently after i upgraded from solr 4.0 to 4.6.1. I check heap dump and found that it has many SolrIndexSearcher (SIS) objects (24), i expect only 1 SIS because we have 1 core. I make some experiment - Right after start solr, it has only 1 SolrIndexSearcher - *But after i index some docs and run softCommit or hardCommit with openSearcher=false, number of SolrIndexSearcher increase by 1* - When hard commit with openSearcher=true, nubmer of SolrIndexSearcher (SIS) doesn't increase but i foudn it log, it open new searcher, i guest old SIS closed. I don't know why number of SIS increase like this and finally cause OutOfMemory, can SolrIndexSearcher be leak? Regards, Tien
Re: Solr index filename doesn't match with solr vesion
Lucene main file formats actually don't change a lot in 4.x (or even 5.x), and the newer codecs just delegate to previous versions for most file types. The newer file types don't typically include Lucene's version in file names.For example, Lucene 4.6 codes basically delegate stored fields and term vector file format to 4.1, doc format to 4.0, etc. and only implement the new segment info/fields info formats (the .si and .fnm files).https://github.com/apache/lucene-solr/blob/lucene_solr_4_6/lucene/core/src/java/org/apache/lucene/codecs/lucene46/Lucene46Codec.java#L50Hope this helps,TriOn Feb 16, 2014, at 08:52 PM, Shawn Heisey wrote:On 2/16/2014 7:25 PM, Nguyen Manh Tien wrote:I upgraded recently from solr 4.0 to solr 4.6,I check solr index folder and found there file_aars_*Lucene41*_0.doc_aars_*Lucene41*_0.pos_aars_*Lucene41*_0.tim_aars_*Lucene41*_0.tipI don't know why it don't have *Lucene46* in file name. This is an indication that this part of the index is using a file format introduced in Lucene 4.1. Here's what I have for one of my index segments on a Solr 4.6.1 server: _5s7_2h.del _5s7.fdt _5s7.fdx _5s7.fnm _5s7_Lucene41_0.doc _5s7_Lucene41_0.pos _5s7_Lucene41_0.tim _5s7_Lucene41_0.tip _5s7_Lucene45_0.dvd _5s7_Lucene45_0.dvm _5s7.nvd _5s7.nvm _5s7.si _5s7.tvd _5s7.tvx It shows the same pieces as your list, but I am also using docValues in my index, and those files indicate that they are using the format from Lucene 4.5. I'm not sure why there are not version numbers in *all* of the file extensions -- that happens in the Lucene layer, which is a bit of a mystery to me. Thanks, Shawn
Re: Increasing number of SolrIndexSearcher (Leakage)?
On 2/16/2014 11:34 PM, Nguyen Manh Tien wrote: > My solr got OOM recently after i upgraded from solr 4.0 to 4.6.1. > I check heap dump and found that it has many SolrIndexSearcher (SIS) > objects (24), i expect only 1 SIS because we have 1 core. > > I make some experiment > - Right after start solr, it has only 1 SolrIndexSearcher > - *But after i index some docs and run softCommit or hardCommit with > openSearcher=false, number of SolrIndexSearcher increase by 1* > - When hard commit with openSearcher=true, nubmer of SolrIndexSearcher > (SIS) doesn't increase but i foudn it log, it open new searcher, i guest > old SIS closed. > > I don't know why number of SIS increase like this and finally cause > OutOfMemory, can SolrIndexSearcher be leak? It's always possible that you've hit a bug that results in a memory leak, but it is not likely. I'm running version 4.6.1 in production without any problems. A lot of other people are doing so as well. I suspect that there's a misconfiguration, a buggy JVM, or something else that's out of the ordinary. We'll need answers to a bunch of questions: What filesystem and operating system are you running on? What vendor and version is your JVM? Can you use a file sharing site or a paste website to share your full solrconfig.xml file? What servlet container are you using to run Solr? Depending on what we learn from these answers, more questions might be coming. Are there any messages at WARN or ERROR in your Solr logfile? Note that I am not referring to the logging tab in the admin UI here - you'll need to look at the actual logfile. Thanks, Shawn