Re: Real Time Search and External File Fields
I chose 16 as a place to start. You usually reach diminishing returns pretty quickly, i feel it's a mistake to set your autowarm counts to, say 256 (and I've seen this in the thousands) unless you have some proof that it's useful to bump higher. But certainly if you set them to 16 and see spikes just after a searcher is opened that aren't tolerable, feel free to make them larger. You've hit on exactly why newSearcher and firstSearcher are there. The theory behind autowarm counts is that the last N entries are likely to be useful in the near future. There's no guarantee at all that this is true and newSearcher/firstSearcher are certain to exercise what _you_ think is most important. As for why autowarm counts are set to 0 in the examples, there's no overarching reason. Certainly if the soft commit interval is 1 second, autowarming is largely useless so having it also at 0 makes sense. Best, Erick On Sat, Oct 8, 2016 at 12:31 PM, Walter Underwoodwrote: > With time-oriented data, you can use an old trick (goes back to Infoseek in > 1995). > > Make a “today” collection that is very fresh. Nightly, migrate new documents > to > the “not today” collection. The today collection will be small and can be > updated > quickly. The archive collection will be large and slow to update, but who > cares? > > You can also send all docs to both collections and de-dupe. > > Every night, you start over with the “today” collection. > > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Oct 8, 2016, at 12:18 PM, Mike Lissner >> wrote: >> >> On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson >> wrote: >> >>> What you haven't mentioned is how often you add new docs. Is it once a >>> day? Steadily >>> from 8:00 to 17:00? >>> >> >> Alas, it's a steady trickle during business hours. We're ingesting court >> documents as they're posted on court websites, then sending alerts as soon >> as possible. >> >> >>> Whatever, your soft commit really should be longer than your autowarm >>> interval. Configure >>> autowarming to reference queries (firstSearcher or newSearcher events >>> or autowarm >>> counts in queryResultCache and filterCache. Say 16 in each of these >>> latter for a start) such >>> that they cause the external file to load. That _should_ prevent any >>> queries from being >>> blocked since the autowarming will happen in the background and while >>> it's happening >>> incoming queries will be served by the old searcher. >>> >> >> I want to make sure I understand this properly and document this for future >> people that may find this thread. Here's what I interpret your advice to be: >> >> 0. Slacken my auto soft commit interval to something more like a minute. >> >> 1. Set up a query in the newSearcher listener that uses my external file >> field. >> 1a. Do the same in firstSearcher if I want newly started solr to warm up >> before getting queries (this doesn't matter to me, so I'm skipping this). >> >> and/or >> >> 2. Set autowarmcount in queryResultCache and filterCache to 16 so that the >> top 16 query results from the previous searcher are regenerated in the new >> searcher. >> >> Doing #1 seems like a safe strategy since it's guaranteed to hit the >> external file field. #2 feels like a bonus. >> >> I'm a bit confused about the example autowarmcount for the caches, which is >> 0. Why not set this to something higher? I guess it's a RAM utilization vs. >> speed tradeoff? A low number like 16 seems like it'd have minimal impact on >> RAM? >> >> Thanks for all the great replies and for everything you do for Solr. I >> truly appreciate your efforts. >> >> Mike >
Re: Stream expressions: Break up multivalue field into usable tuples
Great, I'm not sure if you noticed that SOLR-9537 has been committed and will be in 6.3. So now you can directly wrap a facet expression with the scoreNodes expression. Yeah, other scoring algorithms would be a great thing. We can adjust the ScoreNodesStream to make this more flexible. Feel free to create a ticket to kickoff the discussion. The fetch() expression (SOLR-9337) is also ready to commit. this will allow you to run the following construct: classify(fetch(top(scoreNodes(facet())) This runs the facets, scores them, takes the top N, fetches a text field (product description), and runs a classifier to personalize the recommendation. This will work with graph expression just as well as facets. classify() uses the model that is optimized by the train() function and stored in SolrCloud. This makes combining graph queries with AI models very simple to deploy in recommender systems. Joel Bernstein http://joelsolr.blogspot.com/ On Sat, Oct 8, 2016 at 8:54 PM, Doug Turnbull < dturnb...@opensourceconnections.com> wrote: > Joel -- thanks! Got this working and now feel in a better shape to grok > what's happening > > Out of curiosity, is there any work being done to customize scoreNodes > scoring? There's a bunch of other forms of similarity I wouldn't mind > playing with as well. > > On Thu, Sep 22, 2016 at 6:06 PM Joel Bernsteinwrote: > > You could use the facet() expression which works with multi-value fields. > > This emits aggregated tuples useful for recommendations. For example: > > > > facet(baskets, > > q="item:taco", > > buckets="item", > > bucketSorts="count(*) desc", > > bucketSizeLimit="100", > > count(*)) > > > > You can feed this to scoreNodes() to score the tuples for a recommendation. > > scoreNodes is a graph expression so it expects tuples to be formatted like > > a node set. Specifically it looks for the following fields: node, field and > > collection, which it uses to retrieve the IDF for each node. > > > > The select() function can turn your facet response into a node set, so > > scoreNodes can operate on it: > > > > scoreNodes( > > select(facet(baskets, > > q="item:taco", > > buckets="item", > > bucketSorts="count(*) desc", > > bucketSizeLimit=100, > > count(*)), > >item as node, > >count(*), > >replace(collection, null, withValue=baskets), > >replace(field, null, withValue=item))) > > > > There is a ticket open to have scoreNodes operate directly on the facet() > > function so you don't have to deal with > > the select() function. https://issues.apache.org/jira/browse/SOLR-9537. > I'd > > like to get to this soon. > > > > > > > > > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Sep 22, 2016 at 5:02 PM, Doug Turnbull < > > dturnb...@opensourceconnections.com> wrote: > > > > > I have a field like follows in my search index > > > > > > { > > >"shopper_id": 1234, > > >"basket_id": 2512, > > >"items_bought": ["eggs", "tacos", "nachos"] > > > } > > > > > > { > > >"shopper_id" 1236, > > >"basket_id": 2515, > > >"items_bought": ["eggs", "tacos", "chicken", "bubble gum"] > > > } > > > > > > I would like to use some of the stream expression capabilities (in this > > > case I'm looking at the recsys stuff) but it seems like I need to break > up > > > my data into tuples like > > > > > > { > > >"shopper_id": 1234, > > >"basket_id": 2512, > > > "item": "egg" > > > }, > > > { > > >"shopper_id": 1234 > > >"basket_id": 2512, > > >"item": "taco" > > > } > > > { > > >"shopper_id": 1234 > > >"basket_id": 2512, > > >"item": "nacho" > > > } > > > ... > > > > > > For various other reasons, I'd prefer to keep my original data model with > > > Solr doc == one shopper basket. > > > > > > Now is there a way to take documents above, output from a search tuple > > > source and apply a stream mutator to emit baskets with a field broken up > > > like above? (do let me know if I'm missing something completely here) > > > > > > Thanks! > > > -Doug > > > >
Re: Stream expressions: Break up multivalue field into usable tuples
Joel -- thanks! Got this working and now feel in a better shape to grok what's happening Out of curiosity, is there any work being done to customize scoreNodes scoring? There's a bunch of other forms of similarity I wouldn't mind playing with as well. On Thu, Sep 22, 2016 at 6:06 PM Joel Bernsteinwrote: You could use the facet() expression which works with multi-value fields. This emits aggregated tuples useful for recommendations. For example: facet(baskets, q="item:taco", buckets="item", bucketSorts="count(*) desc", bucketSizeLimit="100", count(*)) You can feed this to scoreNodes() to score the tuples for a recommendation. scoreNodes is a graph expression so it expects tuples to be formatted like a node set. Specifically it looks for the following fields: node, field and collection, which it uses to retrieve the IDF for each node. The select() function can turn your facet response into a node set, so scoreNodes can operate on it: scoreNodes( select(facet(baskets, q="item:taco", buckets="item", bucketSorts="count(*) desc", bucketSizeLimit=100, count(*)), item as node, count(*), replace(collection, null, withValue=baskets), replace(field, null, withValue=item))) There is a ticket open to have scoreNodes operate directly on the facet() function so you don't have to deal with the select() function. https://issues.apache.org/jira/browse/SOLR-9537. I'd like to get to this soon. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Sep 22, 2016 at 5:02 PM, Doug Turnbull < dturnb...@opensourceconnections.com> wrote: > I have a field like follows in my search index > > { >"shopper_id": 1234, >"basket_id": 2512, >"items_bought": ["eggs", "tacos", "nachos"] > } > > { >"shopper_id" 1236, >"basket_id": 2515, >"items_bought": ["eggs", "tacos", "chicken", "bubble gum"] > } > > I would like to use some of the stream expression capabilities (in this > case I'm looking at the recsys stuff) but it seems like I need to break up > my data into tuples like > > { >"shopper_id": 1234, >"basket_id": 2512, > "item": "egg" > }, > { >"shopper_id": 1234 >"basket_id": 2512, >"item": "taco" > } > { >"shopper_id": 1234 >"basket_id": 2512, >"item": "nacho" > } > ... > > For various other reasons, I'd prefer to keep my original data model with > Solr doc == one shopper basket. > > Now is there a way to take documents above, output from a search tuple > source and apply a stream mutator to emit baskets with a field broken up > like above? (do let me know if I'm missing something completely here) > > Thanks! > -Doug >
Re: solr 5 leaving tomcat, will I be the only one fearing about this?
On 9/10/16 11:11am, Aristedes Maniatis wrote: > * deployment is also scattered: > - Solr platform specific package manager (pkg in FreeBSD in my case, which > I've had to write myself since it didn't exist) > - updating config files above > - writing custom scripts to push Zookeeper configuration into production > - creating collections/cores using the API rather than in a config file Oh, and pushing additional jars (like a JDBC adapter) into a special folder. Again, not easily testable or version controlled. Ari -- --> Aristedes Maniatis GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
Re: solr 5 leaving tomcat, will I be the only one fearing about this?
On 9/10/16 2:09am, Shawn Heisey wrote: > One of the historical challenges on this mailing list is that we were > rarely aware of what steps the user had taken to install or start Solr, > and we had to support pretty much any scenario. Since 5.0, the number > of supported ways to deploy and start Solr is greatly reduced, and those > ways were written by the project, so we tend to have a better > understanding of what is happening when a user starts Solr. We also > usually know the relative location of the logfiles and Solr's data. This migration is causing a lot of grief for us as well, and we are still struggling to get all the bits in place. Before: * gradle build script * gradle project includes our own unit tests, run in jenkins * generates war file * relevant configuration is embedded into the build * deployment specific variables (db uris, passwords, ip addresses) conveniently contained in one context.xml file Now: * Solr version is no longer bound to our tests or configuration * configuration is now scattered in three places: - zookeeper - solr.xml in the data directory - jetty files as part of the solr install that you need to replace (for example to set JNDI properties) * deployment is also scattered: - Solr platform specific package manager (pkg in FreeBSD in my case, which I've had to write myself since it didn't exist) - updating config files above - writing custom scripts to push Zookeeper configuration into production - creating collections/cores using the API rather than in a config file * unit testing no longer possible since you can't run a mock zookeeper instance * zookeeper is very hard to integrate with deployment processes (salt, puppet, etc) since configuration is no longer a set of version controlled files * you can't change the configuration of one node as a 'soft deployment': the whole cluster needs to be changed at once If we didn't need a less broken replication solution, I'd stay on Solr4 forever. I really liked the old war deployment. It bound the solr version and configuration management into our version controlled source repository except for one context.xml file that contained server specific deployment options. Nice. The new arrangement is a mess. Ari -- --> Aristedes Maniatis GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
Re: Real Time Search and External File Fields
With time-oriented data, you can use an old trick (goes back to Infoseek in 1995). Make a “today” collection that is very fresh. Nightly, migrate new documents to the “not today” collection. The today collection will be small and can be updated quickly. The archive collection will be large and slow to update, but who cares? You can also send all docs to both collections and de-dupe. Every night, you start over with the “today” collection. Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 8, 2016, at 12:18 PM, Mike Lissner> wrote: > > On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson > wrote: > >> What you haven't mentioned is how often you add new docs. Is it once a >> day? Steadily >> from 8:00 to 17:00? >> > > Alas, it's a steady trickle during business hours. We're ingesting court > documents as they're posted on court websites, then sending alerts as soon > as possible. > > >> Whatever, your soft commit really should be longer than your autowarm >> interval. Configure >> autowarming to reference queries (firstSearcher or newSearcher events >> or autowarm >> counts in queryResultCache and filterCache. Say 16 in each of these >> latter for a start) such >> that they cause the external file to load. That _should_ prevent any >> queries from being >> blocked since the autowarming will happen in the background and while >> it's happening >> incoming queries will be served by the old searcher. >> > > I want to make sure I understand this properly and document this for future > people that may find this thread. Here's what I interpret your advice to be: > > 0. Slacken my auto soft commit interval to something more like a minute. > > 1. Set up a query in the newSearcher listener that uses my external file > field. > 1a. Do the same in firstSearcher if I want newly started solr to warm up > before getting queries (this doesn't matter to me, so I'm skipping this). > > and/or > > 2. Set autowarmcount in queryResultCache and filterCache to 16 so that the > top 16 query results from the previous searcher are regenerated in the new > searcher. > > Doing #1 seems like a safe strategy since it's guaranteed to hit the > external file field. #2 feels like a bonus. > > I'm a bit confused about the example autowarmcount for the caches, which is > 0. Why not set this to something higher? I guess it's a RAM utilization vs. > speed tradeoff? A low number like 16 seems like it'd have minimal impact on > RAM? > > Thanks for all the great replies and for everything you do for Solr. I > truly appreciate your efforts. > > Mike
Re: Real Time Search and External File Fields
On Fri, Oct 7, 2016 at 8:18 PM Erick Ericksonwrote: > What you haven't mentioned is how often you add new docs. Is it once a > day? Steadily > from 8:00 to 17:00? > Alas, it's a steady trickle during business hours. We're ingesting court documents as they're posted on court websites, then sending alerts as soon as possible. > Whatever, your soft commit really should be longer than your autowarm > interval. Configure > autowarming to reference queries (firstSearcher or newSearcher events > or autowarm > counts in queryResultCache and filterCache. Say 16 in each of these > latter for a start) such > that they cause the external file to load. That _should_ prevent any > queries from being > blocked since the autowarming will happen in the background and while > it's happening > incoming queries will be served by the old searcher. > I want to make sure I understand this properly and document this for future people that may find this thread. Here's what I interpret your advice to be: 0. Slacken my auto soft commit interval to something more like a minute. 1. Set up a query in the newSearcher listener that uses my external file field. 1a. Do the same in firstSearcher if I want newly started solr to warm up before getting queries (this doesn't matter to me, so I'm skipping this). and/or 2. Set autowarmcount in queryResultCache and filterCache to 16 so that the top 16 query results from the previous searcher are regenerated in the new searcher. Doing #1 seems like a safe strategy since it's guaranteed to hit the external file field. #2 feels like a bonus. I'm a bit confused about the example autowarmcount for the caches, which is 0. Why not set this to something higher? I guess it's a RAM utilization vs. speed tradeoff? A low number like 16 seems like it'd have minimal impact on RAM? Thanks for all the great replies and for everything you do for Solr. I truly appreciate your efforts. Mike
Re: Real Time Search and External File Fields
On Sat, Oct 8, 2016 at 8:46 AM Shawn Heiseywrote: > Most soft commit > > documentation talks about setting up soft commits with of > about a > > second. > > IMHO any documentation that recommends autoSoftCommit with a maxTime of > one second is bad documentation, and needs to be fixed. Where have you > seen such a recommendation? You know, I must have made that up, sorry. But the documentation you linked to (on the Lucid Works blog) and the example file says 15 seconds for hard commits, so it I think that got me thinking that soft commits could be more frequent. Should soft commits be less frequent than hard commits (opensearcher=False)? If so, I didn't find that to be at all clear. > right now Solr/Lucene has no > way of knowing that your external file has not changed, so it must read > the file every time it builds a searcher. Is it crazy to file a feature request asking that Solr/Lucene keep the modtime of this file and on reload it if it has changed? Seems like an easy win. > I doubt this feature was > designed to deal well with an extremely large external file like yours. > Perhaps not. It's probably worth mentioning that part of the reason the file is so large is because pagerank uses very small and accurate floats. So a typical line is: 1=9.50539603222e-08 Not something smaller like: 1=3.2 Pagerank also provides a value for every item in the index, so that makes the file long. I'd suspect that anybody with a pagerank boosted index of moderate size would have a similarly-sized file. > If the info changes that infrequently, can you just incorporate it > directly into the index with a standard field, with the info coming in > as a part of your normal indexing process? We've considered that, but whenever you re-run pagerank, it updates EVERY value. So I guess we could try updating every doc in our index whenever we run pagerank, but that's a nasty solution. > It seems unlikely that Solr would stop serving queries while setting up > a new searcher. The old searcher should continue to serve requests > until the new searcher is ready. If this is happening, that definitely > seems like a bug. > I'm positive I've observed this, though you're right, some queries still seem to come through. Is it possible that queries relying on the field are stopped while the field is loading? I've observed this two ways: 1. From the front end, things were stalling every time I was doing a hard commit (opensearcher=true). I had hard commits coming in every ten minutes via cron job, and sure enough, at ten, twenty, thirty...minutes after every hour, I'd see stalls. 2. Watching the logs, I saw a flood of queries come through after the line: Loaded external value source external_pagerank Some queries were coming through before this line, but I think none of those queries use the external file field (external_pagerank). Mike
Re: Real Time Search and External File Fields
On 10/7/2016 6:19 PM, Mike Lissner wrote: > Soft commits seem to be exactly the thing for this, but whenever I open a > new searcher (which soft commits seem to do), the external file is > reloaded, and all queries are halted until it finishes loading. When I just > measured, this took about 30 seconds to complete. Most soft commit > documentation talks about setting up soft commits with of about a > second. IMHO any documentation that recommends autoSoftCommit with a maxTime of one second is bad documentation, and needs to be fixed. Where have you seen such a recommendation? Unless the index is extremely small and has been thoroughly optimized for NRT (which usually means *no* autowarming), achieving commit times of less than one second is usually not possible. This is the page that usually comes out when people start talking about commits: http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ On the topic of one-second commit latency, that page has this to say: "Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the /user/ is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free." The kind of intervals for autocommit and autosoftcommit that I like to see is at LEAST one minute, and preferably longer if you can stand it to be longer. > Is there anything I can do to make the external file field not get reloaded > constantly? It only changes about once a month, and I want to use soft > commits to power the alerts feature. Anytime you want changes to show up in your index, you need a new searcher. When you're using an external file field, part of that info will come from that external source, and right now Solr/Lucene has no way of knowing that your external file has not changed, so it must read the file every time it builds a searcher. I doubt this feature was designed to deal well with an extremely large external file like yours. The code looks like it goes line by line reading the file, and although I'm sure that process has been optimized as far as it can be, it still takes a lot of time when there are millions of lines. If the info changes that infrequently, can you just incorporate it directly into the index with a standard field, with the info coming in as a part of your normal indexing process? I'm sure the performance would be MUCH better if Solr didn't have to reference the external file. It seems unlikely that Solr would stop serving queries while setting up a new searcher. The old searcher should continue to serve requests until the new searcher is ready. If this is happening, that definitely seems like a bug. Thanks, Shawn
Re: solr 5 leaving tomcat, will I be the only one fearing about this?
On 10/7/2016 5:13 PM, Renee Sun wrote: > I just read through the following link Shawn shared in his reply: > https://wiki.apache.org/solr/WhyNoWar > > While the following statement is true: > > "Supporting a single set of binary bits is FAR easier than worrying > about what kind of customized environment the user has chosen for their > deployment. " One of the historical challenges on this mailing list is that we were rarely aware of what steps the user had taken to install or start Solr, and we had to support pretty much any scenario. Since 5.0, the number of supported ways to deploy and start Solr is greatly reduced, and those ways were written by the project, so we tend to have a better understanding of what is happening when a user starts Solr. We also usually know the relative location of the logfiles and Solr's data. > But it also probably will reduce the flexibility... for example, we tune for > Scalability at tomcat level, such as its thread pool etc. I assume the > standalone Solr (which is still using Jetty underlying) would expose > sufficient configurable 'knobs' that allow me to turn 'Solr' to meet our > data work load. Yes, Jetty's full configuration would be at your disposal. The configuration which tends to matter the most -- maxThreads -- has been pre-tuned to a very large value. The other settings have been set to values that work well for the vast majority of Solr installs. > If we want to minimize the migration work, our existing business logic > component will remain in tomcat, then the fact that we will have co-exist > jetty and tomcat deployed in production system is a bit strange... or is it? > > Even if I could port our webapps to use Jetty, I assume the way solr is > embedding Jetty I would be able to integrate at that level, I probably end > up with 2 Jetty container instances running on same server, correct? It is > still too early for me to be sure how this will impact our system but I am a > little worried. You can take the server/solr-webapp/webapp directory from the download in 5.3 and later, which contains the same information that used to be in the .war file, and install it as a webapp into Tomcat. If you use something other than "/solr" as the context URL, then the admin UI in 6.0 and later won't work right, but Solr itself (the HTTP API) should work just fine no matter what the context URL is. I don't have precise information about how to do this kind of install, but I know it CAN be done. The tomcat documentation refers to this as an "exploded web application" and I think you can deploy such an application either directly by copying it into the correct directory, or with a context file. In the Jetty that comes with Solr, the exploded webapp is deployed with a context file. Unless they use tomcat-specific code, you might also be able to take your other webapps and install them into the Jetty that comes with Solr. The Jetty bits that are included with Solr are identical to a Jetty package downloaded from eclipse.org, with a few components removed that Solr doesn't use. I wrote the "WhyNoWar" wiki page, so any errors that it contains are most likely mine. I'm not aware of any errors, but it's possible that some are there. Thanks, Shawn