Re: Relevancy : Keyword stuffing
Thank you Markus and Chris, for pointers. For SweetSpotSimilarity I am thinking perhaps a set of closed ranges exposed via similarity config is easier to maintain as data changes than making adjustments to fit a function. Another piece of info would've been handy is to know the average position info + position info for the first few occurrences for each term. This would allow perhaps higher boosting for term occurrences earlier in the doc. In my case extra keywords are towards the end of the doc,but that info does not seem to be propagated into scorer. Thanks again, Mihran On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter hossman_luc...@fucit.org wrote: You should start by checking out the SweetSpotSimilarity .. it was heavily designed arround the idea of dealing with things like excessively verbose titles, and keyword stuffing in summary text ... so you can configure your expectation for what a normal length doc is, and they will be penalized for being longer then that. similarly you can say what a 'resaonable' tf is, and docs that exceed that would't get added boost (which in conjunction with teh lengthNorm penality penalizes docs that stuff keywords) https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg -Hoss http://www.lucidworks.com/
Re: Nginx proxy for Solritas
Have a look at the requests being made to Solr while using /browse (without nginx) and that will show you what resources need to be accessible. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Mar 16, 2015, at 4:42 PM, LongY zhangyulin8...@hotmail.com wrote: Thank you for the reply. I also thought the relevant resources (CSS, images, JavaScript) need to be accessible for Nginx. I copied the velocity folder to solr-webapp/webapp folder. It didn't work. So how to allow /browse resource accessible by the Nginx rule? -- View this message in context: http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193347p4193352.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Nginx proxy for Solritas
Thanks to Erik and Shawn, I figured out the solution. * place main.css in velocity folder into /usr/share/nginx/html/solr/collection1/admin/file/ * don't forget to change the permission of main.css by sudo chmod 755 main.css * add main.css to the configuration file of Ngix: server { listen 80 default_server; listen [::]:80 default_server ipv6only=on; index main.css; server_name localhost; location ~* /solr/\w+/browse { proxy_pass http://localhost:8983; allow 127.0.0.1; denyall; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; } } That will work. Also /var/log/nginx/error.log is good for debugging. -- View this message in context: http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193347p4193415.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing db records via SolrJ
Hi, I had checked this post.I dont know whether this is possible but my query is whether I can use the configuration for DIH for indexing via SolrJ Best Regards, Sreedevi S On Mon, Mar 16, 2015 at 4:17 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, Did you see the great post http://lucidworks.com/blog/indexing-with-solrj/ ? On Mon, Mar 16, 2015 at 1:30 PM, sreedevi s sreedevi.payik...@gmail.com wrote: Hi, I am a beginner in Solr. I have a scenario, where I need to index data from my MySQL db and need to query them. I have figured out to provide my db data import configs using DIH. I also know to query my index via SolrJ. How can I do indexing via SorJ client for my db as well other than reading the db records into documents one by one? This question is in point whether is there any way I can make use of my configuration files and achieve the same. We need to use java APIs, so all indexing and querying can be done only via SolrJ. Best Regards, Sreedevi S -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: indexing db records via SolrJ
Hello, Did you see the great post http://lucidworks.com/blog/indexing-with-solrj/ ? On Mon, Mar 16, 2015 at 1:30 PM, sreedevi s sreedevi.payik...@gmail.com wrote: Hi, I am a beginner in Solr. I have a scenario, where I need to index data from my MySQL db and need to query them. I have figured out to provide my db data import configs using DIH. I also know to query my index via SolrJ. How can I do indexing via SorJ client for my db as well other than reading the db records into documents one by one? This question is in point whether is there any way I can make use of my configuration files and achieve the same. We need to use java APIs, so all indexing and querying can be done only via SolrJ. Best Regards, Sreedevi S -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr returns incorrect results after sorting
I noticed you have an ‘’ immediately preceding the geodist() asc at the very end of the query/URL; that’s supposed to be a comma since group.sort is a comma delimited list of sorts. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Mon, Mar 16, 2015 at 7:51 AM, kumarraj rajitpro2...@gmail.com wrote: Hi, I am using group.sort to internally sort the values first based on store(using function),then stock and finally distance and sort the output results based on price, but solr does not return the correct results after sorting. Below is the sample query: q=*:*start=0rows=200sort=pricecommon_double descd=321spatial=truesfield=store_locationfl=geodist(),*pt=37.1037311,-76.5104751 group.ngroups=truegroup.limit=1group.facet=truegroup.field=code_stringgroup=truegroup.sort=max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0)) desc,inStock_boolean descgeodist() asc I am expecting all the docs to be sorted by price from high to low after grouping, but i see the records not matching the order, Do you see any issues with the query or having functions in group.sort is not supported in solr? Regards, Raj -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing db records via SolrJ
On 3/16/2015 7:15 AM, sreedevi s wrote: I had checked this post.I dont know whether this is possible but my query is whether I can use the configuration for DIH for indexing via SolrJ You can use SolrJ for accessing DIH. I have code that does this, but only for full index rebuilds. It won't be particularly obvious how to do it. Writing code that can intepret DIH status and know when it finishes, succeeds, or fails is very tricky because DIH only uses human-readable status info, not machine-readable, and the info is not very consistent. I can't just share my code, because it's extremely convoluted ... but the general gist is to create a SolrQuery object, use setRequestHandler to set the handler to /dataimport or whatever your DIH handler is, and set the other parameters on the request like command to full-import and so on. Thanks, Shawn
Re: [Poll]: User need for Solr security
Hi John, ManifoldCF in Action book is publicly available to anyone : https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/ For solr integration please see : https://svn.apache.org/repos/asf/manifoldcf/integration/solr-5.x/trunk/README.txt Ahmet On Friday, March 13, 2015 2:50 AM, johnmu...@aol.com johnmu...@aol.com wrote: I would love to see record level (or even field level) restricted access in Solr / Lucene. This should be group level, LDAP like or some rule base (which can be dynamic). If the solution means having a second core, so be it. The following is the closest I found: https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security but I cannot use Manifold CF (Connector Framework). Does anyone know how Manifold does it? - MJ -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, March 12, 2015 6:51 PM To: solr-user@lucene.apache.org Subject: RE: [Poll]: User need for Solr security Jan - we don't really need any security for our products, nor for most clients. However, one client does deal with very sensitive data so we proposed to encrypt the transfer of data and the data on disk through a Lucene Directory. It won't fill all gaps but it would adhere to such a client's guidelines. I think many approaches of security in Solr/Lucene would find advocates, be it index encryption or authentication/authorization or transport security, which is now possible. I understand the reluctance of the PMC, and i agree with it, but some users would definitately benefit and it would certainly make Solr/Lucene the search platform to use for some enterprises. Markus -Original message- From:Henrique O. Santos hensan...@gmail.com Sent: Thursday 12th March 2015 23:43 To: solr-user@lucene.apache.org Subject: Re: [Poll]: User need for Solr security Hi, I’m currently working with indexes that need document level security. Based on the user logged in, query results would omit documents that this user doesn’t have access to, with LDAP integration and such. I think that would be nice to have on a future Solr release. Henrique. On Mar 12, 2015, at 7:32 AM, Jan Høydahl jan@cominvent.com wrote: Hi, Securing various Solr APIs has once again surfaced as a discussion in the developer list. See e.g. SOLR-7236 Would be useful to get some feedback from Solr users about needs in the field. Please reply to this email and let us know what security aspect(s) would be most important for your company to see supported in a future version of Solr. Examples: Local user management, AD/LDAP integration, SSL, authenticated login to Admin UI, authorization for Admin APIs, e.g. admin user vs read-only user etc -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
indexing db records via SolrJ
Hi, I am a beginner in Solr. I have a scenario, where I need to index data from my MySQL db and need to query them. I have figured out to provide my db data import configs using DIH. I also know to query my index via SolrJ. How can I do indexing via SorJ client for my db as well other than reading the db records into documents one by one? This question is in point whether is there any way I can make use of my configuration files and achieve the same. We need to use java APIs, so all indexing and querying can be done only via SolrJ. Best Regards, Sreedevi S
[ANNOUNCE] Luke 4.10.4 released
Hello, Luke 4.10.4 has been released. Download it here: https://github.com/DmitryKey/luke/releases/tag/luke-4.10.4 The release has been tested against the solr-4.10.4 based index. Changes: Trivial pom upgrade to lucene 4.10.4. Got rid of index version warning on the index summary tab Luke is now distributed as a tar.gz with the luke binary and a launcher script. There is currently luke atop apache pivot cooking in its own branch. You can try it out already for some basic index loading and search operations: https://github.com/DmitryKey/luke/tree/pivot-luke -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Solr returns incorrect results after sorting
Hi, I am using group.sort to internally sort the values first based on store(using function),then stock and finally distance and sort the output results based on price, but solr does not return the correct results after sorting. Below is the sample query: q=*:*start=0rows=200sort=pricecommon_double descd=321spatial=truesfield=store_locationfl=geodist(),*pt=37.1037311,-76.5104751 group.ngroups=truegroup.limit=1group.facet=truegroup.field=code_stringgroup=truegroup.sort=max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0)) desc,inStock_boolean descgeodist() asc I am expecting all the docs to be sorted by price from high to low after grouping, but i see the records not matching the order, Do you see any issues with the query or having functions in group.sort is not supported in solr? Regards, Raj -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Deleted Docs Issue
bq: If this operation is continuously done I would end up with a large set of deleted docs which will affect the performance of the queries I hit on this solr. No, you won't. They'll be merged away as background segments are merged. Here's a great visualization of the process, the third one down is the default TieredMergePolicy. In general, even in the case of replacing all the docs, you'll have 10% of your corpus be deleted docs. The % of deleted docs in a segment weighs quite heavily when it comest to the decision of which segment to merge (note that merging purges the deleted docs). Also in general, the results of small tests like this simply do not generalize. i.e. the number of deleted docs in a 200 doc sample size can't be extrapolated to a reasonable-sized corpus. Finally, I don't know if this is something temporary, but the implication of If total commit operations I hit are 20 is that you're committing after every batch of docs is sent to Solr. You should not do this, let your autocommit settings handle this. Here's Mike's blog: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html Best, Erick On Mon, Mar 16, 2015 at 8:51 AM, Shawn Heisey apa...@elyograg.org wrote: On 3/16/2015 9:11 AM, vicky desai wrote: I am having an issue with my solr setup. In my solr config I have set following property *mergeFactor10/mergeFactor* The mergeFactor setting is deprecated ... but you are setting it to the default value of 10 anyway, so that's not really a big deal. It's possible that mergeFactor will no longer work in 5.0, but I'm not sure on that. You should instead use the settings specific to the merge policy, which normally is TieredMergePolicy. Note that when mergeFactor is 10, you *will* end up with more than 10 segments in your index. There are multiple merge tiers, each one can have up to 10 segments before it is merged. Now consider following situation. I have* 200* documents in my index. I need to update all the 200 docs If total commit operations I hit are* 20* i.e I update batches of 10 docs merging is done after every 10th update and so the max Segment Count I can have is 10 which is fine. However even when merging happens deleted docs are not cleared and I end up with 100 deleted docs in index. If this operation is continuously done I would end up with a large set of deleted docs which will affect the performance of the queries I hit on this solr. Because there are multiple merge tiers and you cannot easily pre-determine which segments will be chosen for a particular merge, the merge behavior may not be exactly what you expect. The only guaranteed way to get rid of your deleted docs is to do an optimize operation, which forces a merge of the entire index down to a single segment. This gets rid of all deleted docs in those segments. If you index more data while you are doing the optimize, then you may end up with additional deleted docs. Thanks, Shawn
Re: Solr tlog and soft commit
Can someone please reply to these questions? Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-tlog-and-soft-commit-tp4193105p4193311.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Whole RAM consumed while Indexing.
First start by lengthening your soft and hard commit intervals substantially. Start with 6 and work backwards I'd say. Ramkumar has tuned the heck out of his installation to get the commit intervals to be that short ;). I'm betting that you'll see your RAM usage go way down, but that' s a guess until you test. Best, Erick On Sun, Mar 15, 2015 at 10:56 PM, Nitin Solanki nitinml...@gmail.com wrote: Hi Erick, You are saying correct. Something, **overlapping searchers warning messages** are coming in logs. **numDocs numbers** are changing when documents are adding at the time of indexing. Any help? On Sat, Mar 14, 2015 at 11:24 PM, Erick Erickson erickerick...@gmail.com wrote: First, the soft commit interval is very short. Very, very, very, very short. 300ms is just short of insane unless it's a typo ;). Here's a long background: https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ But the short form is that you're opening searchers every 300 ms. The hard commit is better, but every 3 seconds is still far too short IMO. I'd start with soft commits of 6 and hard commits of 6 (60 seconds), meaning that you're going to have to wait 1 minute for docs to show up unless you explicitly commit. You're throwing away all the caches configured in solrconfig.xml more than 3 times a second, executing autowarming, etc, etc, etc Changing these to longer intervals might cure the problem, but if not then, as Hoss would say, details matter. I suspect you're also seeing overlapping searchers warning messages in your log, and it;s _possible_ that what's happening is that you're just exceeding the max warming searchers and never opening a new searcher with the newly-indexed documents. But that's a total shot in the dark. How are you looking for docs (and not finding them)? Does the numDocs number in the solr admin screen change? Best, Erick On Thu, Mar 12, 2015 at 10:27 PM, Nitin Solanki nitinml...@gmail.com wrote: Hi Alexandre, *Hard Commit* is : autoCommit maxTime${solr.autoCommit.maxTime:3000}/maxTime openSearcherfalse/openSearcher /autoCommit *Soft Commit* is : autoSoftCommit maxTime${solr.autoSoftCommit.maxTime:300}/maxTime /autoSoftCommit And I am committing 2 documents each time. Is it good config for committing? Or I am good something wrong ? On Fri, Mar 13, 2015 at 8:52 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: What's your commit strategy? Explicit commits? Soft commits/hard commits (in solrconfig.xml)? Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 12 March 2015 at 23:19, Nitin Solanki nitinml...@gmail.com wrote: Hello, I have written a python script to do 2 documents indexing each time on Solr. I have 28 GB RAM with 8 CPU. When I started indexing, at that time 15 GB RAM was freed. While indexing, all RAM is consumed but **not** a single document is indexed. Why so? And it through *HTTPError: HTTP Error 503: Service Unavailable* in python script. I think it is due to heavy load on Zookeeper by which all nodes went down. I am not sure about that. Any help please.. Or anything else is happening.. And how to overcome this issue. Please assist me towards right path. Thanks.. Warm Regards, Nitin Solanki
Relevancy : Keyword stuffing
Hi all, I have a use case where the data is generated by SEO minded authors and more often than not they perfectly guess the synonym expansions for the document titles skewing results in their favor. At the moment I don't have an offline processing infrastructure to detect these (I can't punish these docs either... just have to level the playing field). I am experimenting with taking the max of the term scores, cutting off scores after certain number of terms,etc but would appreciate any hints if anyone has experience dealing with a similar use case in solr. Much appreciated, Mihran
Re: indexing db records via SolrJ
We import anywhere from five to fifty million small documents a day from a postgres database. I wrestled to get the DIH stuff to work for us for about a year and was much happier when I ditched that approach and switched to writing the few hundred lines of relatively simple code to handle directly the logic of what gets updated and how it gets queried from postgres ourselves. The DIH stuff is great for lots of cases, but if you are getting to the point of trying to hack its undocumented internals, I suspect you are better off spending a day or two of your time just writing all of the update logic yourself. We found a relatively simple combination of postgres triggers, export to csv based on those triggers, and then just calling update/csv to work best for us. -hal On 3/16/15 9:59 AM, Shawn Heisey wrote: On 3/16/2015 7:15 AM, sreedevi s wrote: I had checked this post.I dont know whether this is possible but my query is whether I can use the configuration for DIH for indexing via SolrJ You can use SolrJ for accessing DIH. I have code that does this, but only for full index rebuilds. It won't be particularly obvious how to do it. Writing code that can intepret DIH status and know when it finishes, succeeds, or fails is very tricky because DIH only uses human-readable status info, not machine-readable, and the info is not very consistent. I can't just share my code, because it's extremely convoluted ... but the general gist is to create a SolrQuery object, use setRequestHandler to set the handler to /dataimport or whatever your DIH handler is, and set the other parameters on the request like command to full-import and so on. Thanks, Shawn -- Hal Roberts Fellow Berkman Center for Internet Society Harvard University
Solr Deleted Docs Issue
Hi, I am having an issue with my solr setup. In my solr config I have set following property *mergeFactor10/mergeFactor* Now consider following situation. I have* 200* documents in my index. I need to update all the 200 docs If total commit operations I hit are* 20* i.e I update batches of 10 docs merging is done after every 10th update and so the max Segment Count I can have is 10 which is fine. However even when merging happens deleted docs are not cleared and I end up with 100 deleted docs in index. If this operation is continuously done I would end up with a large set of deleted docs which will affect the performance of the queries I hit on this solr. Can anyone please help me if I have missed a config or if this is an expected behaviour -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Deleted-Docs-Issue-tp4193292.html Sent from the Solr - User mailing list archive at Nabble.com.
thresholdTokenFrequency changes suggestion frequency..
Hi, I am not getting that why suggestion frequency goes varies from original frequency. Example - I have a word = *who* and its original frequency is *100* but when I find suggestion of it. It suggestion goes change to *50*. I think it is happening because of *thresholdTokenFrequency*. When I set the value of thresholdTokenFrequency to *0.1* then it gives different frequency for 'who' suggestion while set the value of thresholdTokenFrequency to *0.0001* then it gives something different frequency. Why so? I am not getting logic behind this.. As we know suggestion frequency is same as the index original frequency - *The spellcheck.extendedResults=true parameter provides frequency of each original term in the index (origFreq) as well as the frequency of each suggestion in the index (frequency).*
Re: Solr Deleted Docs Issue
On 3/16/2015 9:11 AM, vicky desai wrote: I am having an issue with my solr setup. In my solr config I have set following property *mergeFactor10/mergeFactor* The mergeFactor setting is deprecated ... but you are setting it to the default value of 10 anyway, so that's not really a big deal. It's possible that mergeFactor will no longer work in 5.0, but I'm not sure on that. You should instead use the settings specific to the merge policy, which normally is TieredMergePolicy. Note that when mergeFactor is 10, you *will* end up with more than 10 segments in your index. There are multiple merge tiers, each one can have up to 10 segments before it is merged. Now consider following situation. I have* 200* documents in my index. I need to update all the 200 docs If total commit operations I hit are* 20* i.e I update batches of 10 docs merging is done after every 10th update and so the max Segment Count I can have is 10 which is fine. However even when merging happens deleted docs are not cleared and I end up with 100 deleted docs in index. If this operation is continuously done I would end up with a large set of deleted docs which will affect the performance of the queries I hit on this solr. Because there are multiple merge tiers and you cannot easily pre-determine which segments will be chosen for a particular merge, the merge behavior may not be exactly what you expect. The only guaranteed way to get rid of your deleted docs is to do an optimize operation, which forces a merge of the entire index down to a single segment. This gets rid of all deleted docs in those segments. If you index more data while you are doing the optimize, then you may end up with additional deleted docs. Thanks, Shawn
RE: Relevancy : Keyword stuffing
Hello - setting (e)dismax' tie breaker to 0 or much low than default would `solve` this for now. Markus -Original message- From:Mihran Shahinian slowmih...@gmail.com Sent: Monday 16th March 2015 16:29 To: solr-user@lucene.apache.org Subject: Relevancy : Keyword stuffing Hi all, I have a use case where the data is generated by SEO minded authors and more often than not they perfectly guess the synonym expansions for the document titles skewing results in their favor. At the moment I don't have an offline processing infrastructure to detect these (I can't punish these docs either... just have to level the playing field). I am experimenting with taking the max of the term scores, cutting off scores after certain number of terms,etc but would appreciate any hints if anyone has experience dealing with a similar use case in solr. Much appreciated, Mihran
maxQueryFrequency v/s thresholdTokenFrequency
Hello Everyone, Please anybody can explain me what is the difference between maxQueryFrequency and thresholdTokenFrequency? Got the link - http://wiki.apache.org/solr/SpellCheckComponent#thresholdTokenFrequency but unable to understand.. I am very much confusing in both of them. Your help is appreciated. Warm Regards, Nitin
RE: Relevancy : Keyword stuffing
Hello - Chris' suggestion is indeed a good one but it can be tricky to properly configure the parameters. Regarding position information, you can override dismax to have it use SpanFirstQuery. It allows for setting strict boundaries from the front of the document to a given position. You can also override SpanFirstQuery to incorporate a gradient, to decrease boosting as distance from the front increases. I don't know how you ingest document bodies, but if they are unstructured HTML, you may want to install proper main content extraction if you haven't already. Having decent control over HTML is a powerful tool. You may also want to look at Lucene's BM25 implementation. It is simple to set up and easier to control. It isn't as rough a tool as TFIDF is regarding to length normalization. Plus it allows you to smooth TF, which in your case should also help. If you like to scrutinize SSS and get some proper results, you are more than welcome to share them here :) Markus -Original message- From:Mihran Shahinian slowmih...@gmail.com Sent: Monday 16th March 2015 22:41 To: solr-user@lucene.apache.org Subject: Re: Relevancy : Keyword stuffing Thank you Markus and Chris, for pointers. For SweetSpotSimilarity I am thinking perhaps a set of closed ranges exposed via similarity config is easier to maintain as data changes than making adjustments to fit a function. Another piece of info would've been handy is to know the average position info + position info for the first few occurrences for each term. This would allow perhaps higher boosting for term occurrences earlier in the doc. In my case extra keywords are towards the end of the doc,but that info does not seem to be propagated into scorer. Thanks again, Mihran On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter hossman_luc...@fucit.org wrote: You should start by checking out the SweetSpotSimilarity .. it was heavily designed arround the idea of dealing with things like excessively verbose titles, and keyword stuffing in summary text ... so you can configure your expectation for what a normal length doc is, and they will be penalized for being longer then that. similarly you can say what a 'resaonable' tf is, and docs that exceed that would't get added boost (which in conjunction with teh lengthNorm penality penalizes docs that stuff keywords) https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg -Hoss http://www.lucidworks.com/
discrepancy between LuceneQParser and ExtendedDismaxQParser
Hello, Found discrepancy between LuceneQParser and ExtendedDismaxQParser when executing following query: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451 When executing it through Solr Admin panel and placing query in q field I having following debug output for LuceneQParser -- debug: { rawquerystring: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451, querystring: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451, parsedquery: +((+MatchAllDocsQuery(*:*) -text:area) area:[100 TO 300]) +objectId:40105451, parsedquery_toString: +((+*:* -text:area) area:[100 TO 300]) +objectId: \u0001\u\u\u\u\u\u0013\u000fkk, explain: { 40105451: \n14.3511 = (MATCH) sum of:\n 0.034590416 = (MATCH) product of:\n0.06918083 = (MATCH) sum of:\n 0.06918083 = (MATCH) sum of:\n 0.06918083 = (MATCH) MatchAllDocsQuery, product of:\n 0.06918083 = queryNorm\n0.5 = coord(1/2)\n 14.316509 = (MATCH) weight(objectId: \u0001\u\u\u\u\u\u0013\u000fkk in 1109978) [DefaultSimilarity], result of:\n14.316509 = score(doc=1109978,freq=1.0), product of:\n 0.9952025 = queryWeight, product of:\n14.385524 = idf(docFreq=1, maxDocs=1301035)\n0.06918083 = queryNorm\n 14.385524 = fieldWeight in 1109978, product of:\n1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n14.385524 = idf(docFreq=1, maxDocs=1301035)\n1.0 = fieldNorm(doc=1109978)\n }, -- So, one object found which is expectable For ExtendedDismaxQParser (only difference is checkbox edismax checked) I am seeing this output -- debug: { rawquerystring: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451, querystring: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451, parsedquery: (+(+((+DisjunctionMaxQuery((text:*\\:*)) -DisjunctionMaxQuery((text:area))) area:[100 TO 300]) +objectId:40105451))/no_coord, parsedquery_toString: +(+((+(text:*\\:*) -(text:area)) area:[100 TO 300]) +objectId: \u0001\u\u\u\u\u\u0013\u000fkk), explain: {}, -- oops, no objects found! I hastened to fill https://issues.apache.org/jira/browse/SOLR-7249 (sorry, my bad) You may refer to it for additional info (not going to duplicate it here) Thanks -- Best regards, Arsen mailto:barracuda...@mail.ru
Re: Nginx proxy for Solritas
On 3/16/2015 2:42 PM, LongY wrote: Thank you for the reply. I also thought the relevant resources (CSS, images, JavaScript) need to be accessible for Nginx. I copied the velocity folder to solr-webapp/webapp folder. It didn't work. So how to allow /browse resource accessible by the Nginx rule? The /browse handler causes your browser to make requests directly to Solr on handlers other than /browse. You must figure out what those requests are and allow them in the proxy configuration. I do not know whether they are relative URLs ... I would not be terribly surprised to learn that they have port 8983 in them rather than the port 80 on your proxy. Hopefully that's not the case, or you'll really have problems making it work on port 80. I've never spent any real time with the /browse handler. Requiring direct access to Solr is completely unacceptable for us. Thanks, Shawn
Re: Nginx proxy for Solritas
Thank you for the reply. I also thought the relevant resources (CSS, images, JavaScript) need to be accessible for Nginx. I copied the velocity folder to solr-webapp/webapp folder. It didn't work. So how to allow /browse resource accessible by the Nginx rule? -- View this message in context: http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193347p4193352.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: discrepancy between LuceneQParser and ExtendedDismaxQParser
There was a Solr release with a bug that required that you put a space between the left parenthesis and the *:*. The edismax parsed query here indicates that the *:* has not parsed properly. You have area, but in your jira you had a range query. -- Jack Krupansky On Mon, Mar 16, 2015 at 6:42 PM, Arsen barracuda...@mail.ru wrote: Hello, Found discrepancy between LuceneQParser and ExtendedDismaxQParser when executing following query: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451 When executing it through Solr Admin panel and placing query in q field I having following debug output for LuceneQParser -- debug: { rawquerystring: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451, querystring: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451, parsedquery: +((+MatchAllDocsQuery(*:*) -text:area) area:[100 TO 300]) +objectId:40105451, parsedquery_toString: +((+*:* -text:area) area:[100 TO 300]) +objectId: \u0001\u\u\u\u\u\u0013\u000fkk, explain: { 40105451: \n14.3511 = (MATCH) sum of:\n 0.034590416 = (MATCH) product of:\n0.06918083 = (MATCH) sum of:\n 0.06918083 = (MATCH) sum of:\n0.06918083 = (MATCH) MatchAllDocsQuery, product of:\n 0.06918083 = queryNorm\n0.5 = coord(1/2)\n 14.316509 = (MATCH) weight(objectId: \u0001\u\u\u\u\u\u0013\u000fkk in 1109978) [DefaultSimilarity], result of:\n14.316509 = score(doc=1109978,freq=1.0), product of:\n 0.9952025 = queryWeight, product of:\n14.385524 = idf(docFreq=1, maxDocs=1301035)\n 0.06918083 = queryNorm\n 14.385524 = fieldWeight in 1109978, product of:\n1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n14.385524 = idf(docFreq=1, maxDocs=1301035)\n 1.0 = fieldNorm(doc=1109978)\n }, -- So, one object found which is expectable For ExtendedDismaxQParser (only difference is checkbox edismax checked) I am seeing this output -- debug: { rawquerystring: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451, querystring: ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451, parsedquery: (+(+((+DisjunctionMaxQuery((text:*\\:*)) -DisjunctionMaxQuery((text:area))) area:[100 TO 300]) +objectId:40105451))/no_coord, parsedquery_toString: +(+((+(text:*\\:*) -(text:area)) area:[100 TO 300]) +objectId: \u0001\u\u\u\u\u\u0013\u000fkk), explain: {}, -- oops, no objects found! I hastened to fill https://issues.apache.org/jira/browse/SOLR-7249 (sorry, my bad) You may refer to it for additional info (not going to duplicate it here) Thanks -- Best regards, Arsen mailto:barracuda...@mail.ru
Re: Whole RAM consumed while Indexing.
Yes, and doing so is painful and takes lots of people and hardware resources to get there for large amounts of data and queries :) As Erick says, work backwards from 60s and first establish how high the commit interval can be to satisfy your use case.. On 16 Mar 2015 16:04, Erick Erickson erickerick...@gmail.com wrote: First start by lengthening your soft and hard commit intervals substantially. Start with 6 and work backwards I'd say. Ramkumar has tuned the heck out of his installation to get the commit intervals to be that short ;). I'm betting that you'll see your RAM usage go way down, but that' s a guess until you test. Best, Erick On Sun, Mar 15, 2015 at 10:56 PM, Nitin Solanki nitinml...@gmail.com wrote: Hi Erick, You are saying correct. Something, **overlapping searchers warning messages** are coming in logs. **numDocs numbers** are changing when documents are adding at the time of indexing. Any help? On Sat, Mar 14, 2015 at 11:24 PM, Erick Erickson erickerick...@gmail.com wrote: First, the soft commit interval is very short. Very, very, very, very short. 300ms is just short of insane unless it's a typo ;). Here's a long background: https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ But the short form is that you're opening searchers every 300 ms. The hard commit is better, but every 3 seconds is still far too short IMO. I'd start with soft commits of 6 and hard commits of 6 (60 seconds), meaning that you're going to have to wait 1 minute for docs to show up unless you explicitly commit. You're throwing away all the caches configured in solrconfig.xml more than 3 times a second, executing autowarming, etc, etc, etc Changing these to longer intervals might cure the problem, but if not then, as Hoss would say, details matter. I suspect you're also seeing overlapping searchers warning messages in your log, and it;s _possible_ that what's happening is that you're just exceeding the max warming searchers and never opening a new searcher with the newly-indexed documents. But that's a total shot in the dark. How are you looking for docs (and not finding them)? Does the numDocs number in the solr admin screen change? Best, Erick On Thu, Mar 12, 2015 at 10:27 PM, Nitin Solanki nitinml...@gmail.com wrote: Hi Alexandre, *Hard Commit* is : autoCommit maxTime${solr.autoCommit.maxTime:3000}/maxTime openSearcherfalse/openSearcher /autoCommit *Soft Commit* is : autoSoftCommit maxTime${solr.autoSoftCommit.maxTime:300}/maxTime /autoSoftCommit And I am committing 2 documents each time. Is it good config for committing? Or I am good something wrong ? On Fri, Mar 13, 2015 at 8:52 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: What's your commit strategy? Explicit commits? Soft commits/hard commits (in solrconfig.xml)? Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 12 March 2015 at 23:19, Nitin Solanki nitinml...@gmail.com wrote: Hello, I have written a python script to do 2 documents indexing each time on Solr. I have 28 GB RAM with 8 CPU. When I started indexing, at that time 15 GB RAM was freed. While indexing, all RAM is consumed but **not** a single document is indexed. Why so? And it through *HTTPError: HTTP Error 503: Service Unavailable* in python script. I think it is due to heavy load on Zookeeper by which all nodes went down. I am not sure about that. Any help please.. Or anything else is happening.. And how to overcome this issue. Please assist me towards right path. Thanks.. Warm Regards, Nitin Solanki
Re: indexing db records via SolrJ
Take a look at some of the integrations people are using with apache storm, we do something similar on a larger scale , having created a pgsql spout and having a solr indexing bolt. -msj On Mon, Mar 16, 2015 at 11:08 AM, Hal Roberts hrobe...@cyber.law.harvard.edu wrote: We import anywhere from five to fifty million small documents a day from a postgres database. I wrestled to get the DIH stuff to work for us for about a year and was much happier when I ditched that approach and switched to writing the few hundred lines of relatively simple code to handle directly the logic of what gets updated and how it gets queried from postgres ourselves. The DIH stuff is great for lots of cases, but if you are getting to the point of trying to hack its undocumented internals, I suspect you are better off spending a day or two of your time just writing all of the update logic yourself. We found a relatively simple combination of postgres triggers, export to csv based on those triggers, and then just calling update/csv to work best for us. -hal On 3/16/15 9:59 AM, Shawn Heisey wrote: On 3/16/2015 7:15 AM, sreedevi s wrote: I had checked this post.I dont know whether this is possible but my query is whether I can use the configuration for DIH for indexing via SolrJ You can use SolrJ for accessing DIH. I have code that does this, but only for full index rebuilds. It won't be particularly obvious how to do it. Writing code that can intepret DIH status and know when it finishes, succeeds, or fails is very tricky because DIH only uses human-readable status info, not machine-readable, and the info is not very consistent. I can't just share my code, because it's extremely convoluted ... but the general gist is to create a SolrQuery object, use setRequestHandler to set the handler to /dataimport or whatever your DIH handler is, and set the other parameters on the request like command to full-import and so on. Thanks, Shawn -- Hal Roberts Fellow Berkman Center for Internet Society Harvard University
Re: indexing db records via SolrJ
Do you have any references to such integrations (Solr + Storm)? Thanks From: mike st. john mstj...@gmail.com Sent: Monday, March 16, 2015 2:39 PM To: solr-user@lucene.apache.org Subject: Re: indexing db records via SolrJ Take a look at some of the integrations people are using with apache storm, we do something similar on a larger scale , having created a pgsql spout and having a solr indexing bolt. -msj On Mon, Mar 16, 2015 at 11:08 AM, Hal Roberts hrobe...@cyber.law.harvard.edu wrote: We import anywhere from five to fifty million small documents a day from a postgres database. I wrestled to get the DIH stuff to work for us for about a year and was much happier when I ditched that approach and switched to writing the few hundred lines of relatively simple code to handle directly the logic of what gets updated and how it gets queried from postgres ourselves. The DIH stuff is great for lots of cases, but if you are getting to the point of trying to hack its undocumented internals, I suspect you are better off spending a day or two of your time just writing all of the update logic yourself. We found a relatively simple combination of postgres triggers, export to csv based on those triggers, and then just calling update/csv to work best for us. -hal On 3/16/15 9:59 AM, Shawn Heisey wrote: On 3/16/2015 7:15 AM, sreedevi s wrote: I had checked this post.I dont know whether this is possible but my query is whether I can use the configuration for DIH for indexing via SolrJ You can use SolrJ for accessing DIH. I have code that does this, but only for full index rebuilds. It won't be particularly obvious how to do it. Writing code that can intepret DIH status and know when it finishes, succeeds, or fails is very tricky because DIH only uses human-readable status info, not machine-readable, and the info is not very consistent. I can't just share my code, because it's extremely convoluted ... but the general gist is to create a SolrQuery object, use setRequestHandler to set the handler to /dataimport or whatever your DIH handler is, and set the other parameters on the request like command to full-import and so on. Thanks, Shawn -- Hal Roberts Fellow Berkman Center for Internet Society Harvard University
Re: Data Import Handler - reading GET
Have you tried? As ${dih.request.foo}? Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 16 March 2015 at 14:51, Kiran J kiranjuni...@gmail.com wrote: Hi, In data import handler, I can read the clean query parameter using ${dih.request.clean} and pass it on to the queries. Is it possible to read any query parameter from the URL ? for eg ${foo} ? Thanks
Data Import Handler - reading GET
Hi, In data import handler, I can read the clean query parameter using ${dih.request.clean} and pass it on to the queries. Is it possible to read any query parameter from the URL ? for eg ${foo} ? Thanks
Re: Relevancy : Keyword stuffing
You should start by checking out the SweetSpotSimilarity .. it was heavily designed arround the idea of dealing with things like excessively verbose titles, and keyword stuffing in summary text ... so you can configure your expectation for what a normal length doc is, and they will be penalized for being longer then that. similarly you can say what a 'resaonable' tf is, and docs that exceed that would't get added boost (which in conjunction with teh lengthNorm penality penalizes docs that stuff keywords) https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg -Hoss http://www.lucidworks.com/
Re: [Poll]: User need for Solr security
Hi, We tend to recommend ManifoldCF for document level security since that is exactly what it is built for. So I doubt we'll see that as a built in feature in Solr. However, the Solr integration is really not that advanced, and I also see customers implementing similar logic themselves with success. On the document feeding side you need to add a few more fields to all your documents, typically include_acl and exclude_acl. Populate those fields with data from LDAP about who (what groups) have access to that document and who not. If it is open information, index a special token open in the include field. Then assuming your search client application has authenticated a user, you would construct a filter with this users groups, e.g. fq=include_acl:(groupA OR open)fq=-exclude_acl:(groupA) The filter would be constructed either in your application or in a Solr search component or query parser. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 13. mar. 2015 kl. 01.48 skrev johnmu...@aol.com: I would love to see record level (or even field level) restricted access in Solr / Lucene. This should be group level, LDAP like or some rule base (which can be dynamic). If the solution means having a second core, so be it. The following is the closest I found: https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security but I cannot use Manifold CF (Connector Framework). Does anyone know how Manifold does it? - MJ -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, March 12, 2015 6:51 PM To: solr-user@lucene.apache.org Subject: RE: [Poll]: User need for Solr security Jan - we don't really need any security for our products, nor for most clients. However, one client does deal with very sensitive data so we proposed to encrypt the transfer of data and the data on disk through a Lucene Directory. It won't fill all gaps but it would adhere to such a client's guidelines. I think many approaches of security in Solr/Lucene would find advocates, be it index encryption or authentication/authorization or transport security, which is now possible. I understand the reluctance of the PMC, and i agree with it, but some users would definitately benefit and it would certainly make Solr/Lucene the search platform to use for some enterprises. Markus -Original message- From:Henrique O. Santos hensan...@gmail.com Sent: Thursday 12th March 2015 23:43 To: solr-user@lucene.apache.org Subject: Re: [Poll]: User need for Solr security Hi, I’m currently working with indexes that need document level security. Based on the user logged in, query results would omit documents that this user doesn’t have access to, with LDAP integration and such. I think that would be nice to have on a future Solr release. Henrique. On Mar 12, 2015, at 7:32 AM, Jan Høydahl jan@cominvent.com wrote: Hi, Securing various Solr APIs has once again surfaced as a discussion in the developer list. See e.g. SOLR-7236 Would be useful to get some feedback from Solr users about needs in the field. Please reply to this email and let us know what security aspect(s) would be most important for your company to see supported in a future version of Solr. Examples: Local user management, AD/LDAP integration, SSL, authenticated login to Admin UI, authorization for Admin APIs, e.g. admin user vs read-only user etc -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Re: Nginx proxy for Solritas
The links to the screenshots aren’t working for me. I’m not sure what the issue is - but do be aware that /browse with its out of the box templates do refer to resources (CSS, images, JavaScript) that isn’t under /browse, so you’ll need to allow those to be accessible as well with different rules. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Mar 16, 2015, at 3:39 PM, LongY zhangyulin8...@hotmail.com wrote: Dear Community Members, I have searched over the forum and googled a lot, still didn't find the solution. Finally got me here for help. I am implementing a Nginx reverse proxy for Solritas (VelocityResponseWriter) of the example included in Solr. . Nginx listens on port 80, and solr runs on port 8983. This is my Nginx configuration file (It only permits localhost to access the browse request handler). *location ~* /solr/\w+/browse { proxy_pass http://localhost:8983; allow 127.0.0.1; denyall; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; }* when I input http://localhost/solr/collection1/browse in the browser address bar. The output I got is this. http://lucene.472066.n3.nabble.com/file/n4193346/left.png The supposed output should be like this http://lucene.472066.n3.nabble.com/file/n4193346/right.png I tested the Admin page with this Nginx configuration file with some minor modifications, it worked well, but when used in velocity templates, it did not render the output properly. Any input is welcome. Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193346.html Sent from the Solr - User mailing list archive at Nabble.com.
Nginx proxy for Solritas
Dear Community Members, I have searched over the forum and googled a lot, still didn't find the solution. Finally got me here for help. I am implementing a Nginx reverse proxy for Solritas (VelocityResponseWriter) of the example included in Solr. . Nginx listens on port 80, and solr runs on port 8983. This is my Nginx configuration file (It only permits localhost to access the browse request handler). *location ~* /solr/\w+/browse { proxy_pass http://localhost:8983; allow 127.0.0.1; denyall; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; }* when I input http://localhost/solr/collection1/browse in the browser address bar. The output I got is this. http://lucene.472066.n3.nabble.com/file/n4193346/left.png The supposed output should be like this http://lucene.472066.n3.nabble.com/file/n4193346/right.png I tested the Admin page with this Nginx configuration file with some minor modifications, it worked well, but when used in velocity templates, it did not render the output properly. Any input is welcome. Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193346.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 4.7.2 mergeFactor/ Merge policy issue
Hi, I can confirm similar behaviour, but for solr 4.3.1. We use default values for merge related settings. Even though mergeFactor=10 by default, there are 13 segments in one core and 30 segments in another. I am not sure it proves there is a bug in the merging, because it depends on the TieredMergePolicy. Relevant discussion from the past: http://lucene.472066.n3.nabble.com/TieredMergePolicy-reclaimDeletesWeight-td4071487.html Apart from other policy parameters you could play with ReclaimDeletesWeight, in case you'd like to affect on merging the segments with deletes in them. See http://stackoverflow.com/questions/18361300/informations-about-tieredmergepolicy Regarding your attachment: I believe it got cut by the mailing list system, could you share it via a file sharing system? On Sat, Mar 14, 2015 at 7:36 AM, Summer Shire shiresum...@gmail.com wrote: Hi All, Did anyone get a chance to look at my config and the InfoStream File ? I am very curious to see what you think thanks, Summer On Mar 6, 2015, at 5:20 PM, Summer Shire shiresum...@gmail.com wrote: Hi All, Here’s more update on where I am at with this. I enabled infoStream logging and quickly figured that I need to get rid of maxBufferedDocs. So Erick you were absolutely right on that. I increased my ramBufferSize to 100MB and reduced maxMergeAtOnce to 3 and segmentsPerTier to 3 as well. My config looks like this indexConfig useCompoundFilefalse/useCompoundFile ramBufferSizeMB100/ramBufferSizeMB !--maxMergeSizeForForcedMerge9223372036854775807/maxMergeSizeForForcedMerge-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce3/int int name=segmentsPerTier3/int /mergePolicy mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler/ infoStream file=“/tmp/INFOSTREAM.txt”true/infoStream /indexConfig I am attaching a sample infostream log file. In the infoStream logs though you an see how the segments keep on adding and it shows (just an example ) allowedSegmentCount=10 vs count=9 (eligible count=9) tooBigCount=0 I looked at TieredMergePolicy.java to see how allowedSegmentCount is getting calculated // Compute max allowed segs in the index long levelSize = minSegmentBytes; long bytesLeft = totIndexBytes; double allowedSegCount = 0; while(true) { final double segCountLevel = bytesLeft / (double) levelSize; if (segCountLevel segsPerTier) { allowedSegCount += Math.ceil(segCountLevel); break; } allowedSegCount += segsPerTier; bytesLeft -= segsPerTier * levelSize; levelSize *= maxMergeAtOnce; } int allowedSegCountInt = (int) allowedSegCount; and the minSegmentBytes is calculated as follows // Compute total index bytes print details about the index long totIndexBytes = 0; long minSegmentBytes = Long.MAX_VALUE; for(SegmentInfoPerCommit info : infosSorted) { final long segBytes = size(info); if (verbose()) { String extra = merging.contains(info) ? [merging] : ; if (segBytes = maxMergedSegmentBytes/2.0) { extra += [skip: too large]; } else if (segBytes floorSegmentBytes) { extra += [floored]; } message( seg= + writer.get().segString(info) + size= + String.format(Locale.ROOT, %.3f, segBytes/1024/1024.) + MB + extra); } minSegmentBytes = Math.min(segBytes, minSegmentBytes); // Accum total byte size totIndexBytes += segBytes; } any input is welcome. myinfoLog.rtf thanks, Summer On Mar 5, 2015, at 8:11 AM, Erick Erickson erickerick...@gmail.com wrote: I would, BTW, either just get rid of the maxBufferedDocs all together or make it much higher, i.e. 10. I don't think this is really your problem, but you're creating a lot of segments here. But I'm kind of at a loss as to what would be different about your setup. Is there _any_ chance that you have some secondary process looking at your index that's maintaining open searchers? Any custom code that's perhaps failing to close searchers? Is this a Unix or Windows system? And just to be really clear, you _only_ seeing more segments being added, right? If you're only counting files in the index directory, it's _possible_ that merging is happening, you're just seeing new files take the place of old ones. Best, Erick On Wed, Mar 4, 2015 at 7:12 PM, Shawn Heisey apa...@elyograg.org wrote: On 3/4/2015 4:12 PM, Erick Erickson wrote: I _think_, but don't know for sure, that the merging stuff doesn't get triggered until you commit, it doesn't just happen. Shot in the dark... I believe that new segments are created when the indexing buffer (ramBufferSizeMB) fills up, even without commits. I'm pretty sure that anytime a new segment is created, the merge policy is