Martin, It's still not clear to me whether you solve the problem completely or partially: Does reducing number of cores free some resources for searching during commit? Does the commiting one-by-one core prevents the "freeze"?
Thanks On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <m...@issuu.com> wrote: > Mikhail > > To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug > 3985). But this wasn't good enough, because SOLR would still take very long > to restart when that was necessary. > > I don't see how we could throw more hardware at the problem without making > it worse, really - the only solution here would be *fewer* shards, not > more. > > IMO it would be ideal if the lucene/solr community could come up with a > good way of updating fields in a document without reindexing. This could be > by linking to some external data store, or in the lucene/solr internals. If > it would make things easier, a good first step would be to have dynamically > updateable numerical fields only. > > /Martin > > On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev < > mkhlud...@griddynamics.com> wrote: > > > Martin, > > > > I don't think solrconfig.xml shed any light on. I've just found what I > > didn't get in your setup - the way of how to explicitly assigning core to > > collection. Now, I realized most of details after all! > > Ball is on your side, let us know whether you have managed your cores to > > commit one by one to avoid freeze, or could you eliminate pauses by > > allocating more hardware? > > Thanks in advance! > > > > > > On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <m...@issuu.com> wrote: > > > > > Mikhail, > > > > > > PSB > > > > > > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev < > > > mkhlud...@griddynamics.com> wrote: > > > > > > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <m...@issuu.com> wrote: > > > > > > > > > > > > > > I wasn't aware until now that it is possible to send a commit to > one > > > core > > > > > only. What we observed was the effect of curl > > > > > localhost:8080/solr/update?commit=true but perhaps we should > > experiment > > > > > with solr/coreN/update?commit=true. A quick trial run seems to > > indicate > > > > > that a commit to a single core causes commits on all cores. > > > > > > > > > You should see something like this in the log: > > > > ... SolrCmdDistributor .... Distrib commit to: ... > > > > > > > > Yup, a commit towards a single core results in a commit on all cores. > > > > > > > > > > > > > > > > > > > > > Perhaps I should clarify that we are using SOLR as a black box; we > do > > > not > > > > > touch the code at all - we only install the distribution WAR file > and > > > > > proceed from there. > > > > > > > > > I still don't understand how you deploy/launch Solr. How many jettys > > you > > > > start whether you have -DzkRun -DzkHost -DnumShards=2 or you > specifies > > > > shards= param for every request and distributes updates yourself? > What > > > > collections do you create and with which settings? > > > > > > > > We let SOLR do the sharding using one collection with 16 SOLR cores > > > holding one shard each. We launch only one instance of jetty with the > > > folllowing arguments: > > > > > > -DnumShards=16 > > > -DzkHost=<zookeeperhost:port> > > > -Xmx10G > > > -Xms10G > > > -Xmn2G > > > -server > > > > > > Would you like to see the solrconfig.xml? > > > > > > /Martin > > > > > > > > > > > > > > > > > > > > > > Also from my POV such deployments should start at least from *16* > > > 4-way > > > > > > vboxes, it's more expensive, but should be much better available > > > during > > > > > > cpu-consuming operations. > > > > > > > > > > > > > > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4 > hosts > > > > with > > > > > 16 cores? Or am I misunderstanding something :) ? > > > > > > > > > I prefer to start from 16 hosts with 4 cores each. > > > > > > > > > > > > > > > > > > > > > > > > Other details, if you use single jetty for all of them, are you > > sure > > > > that > > > > > > jetty's threadpool doesn't limit requests? is it large enough? > > > > > > You have 60G and set -Xmx=10G. are you sure that total size of > > cores > > > > > index > > > > > > directories is less than 45G? > > > > > > > > > > > > The total index size is 230 GB, so it won't fit in ram, but we're > > > using > > > > > an > > > > > SSD disk to minimize disk access time. We have tried putting the > EFF > > > > onto a > > > > > ram disk, but this didn't have a measurable effect. > > > > > > > > > > Thanks, > > > > > /Martin > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <m...@issuu.com> > > wrote: > > > > > > > > > > > > > Mikhail > > > > > > > > > > > > > > PSB > > > > > > > > > > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev < > > > > > > > mkhlud...@griddynamics.com> wrote: > > > > > > > > > > > > > > > Martin, > > > > > > > > > > > > > > > > Please find additional question from me below. > > > > > > > > > > > > > > > > Simone, > > > > > > > > > > > > > > > > I'm sorry for hijacking your thread. The only what I've heard > > > about > > > > > it > > > > > > at > > > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to > > > > replicate > > > > > > > those > > > > > > > > files as configs under solr home. And I'm really looking > > forward > > > to > > > > > > know > > > > > > > > how it works with huge files in production. > > > > > > > > > > > > > > > > Thank You, Guys! > > > > > > > > > > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com> > > > > написал: > > > > > > > > > > > > > > > > > > Hi Mikhail > > > > > > > > > > > > > > > > > > Please see answers below. > > > > > > > > > > > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < > > > > > > > > > mkhlud...@griddynamics.com> wrote: > > > > > > > > > > > > > > > > > > > Martin, > > > > > > > > > > > > > > > > > > > > Thank you for telling your own "war-story". It's really > > > useful > > > > > for > > > > > > > > > > community. > > > > > > > > > > The first question might seems not really conscious, but > > > would > > > > > you > > > > > > > tell > > > > > > > > me > > > > > > > > > > what blocks searching during EFF reload, when it's > > triggered > > > by > > > > > > > handler > > > > > > > > or > > > > > > > > > > by listener? > > > > > > > > > > > > > > > > > > > > > > > > > > > > We continuously index new documents using CommitWithin to > get > > > > > regular > > > > > > > > > commits. However, we observed that the EFFs were not > re-read, > > > so > > > > we > > > > > > had > > > > > > > > to > > > > > > > > > do external commits (curl '.../solr/update?commit=true') to > > > force > > > > > > > reload. > > > > > > > > > When this is done, solr blocks. I can't tell you exactly > why > > > it's > > > > > > doing > > > > > > > > > that (it was related to SOLR-3985). > > > > > > > > > > > > > > > > Is there a chance to get a thread dump when they are blocked? > > > > > > > > > > > > > > > > > > > > > > > Well I could try to recreate the situation. But the setup is > > fairly > > > > > > simple: > > > > > > > Create a large EFF in a largeish index with many shards. Issue > a > > > > > commit, > > > > > > > and then try to do a search. Solr will not respond to the > search > > > > before > > > > > > the > > > > > > > commit has completed, and this will take a long time. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't really get the sentence about sequential commits > > and > > > > > number > > > > > > > of > > > > > > > > > > cores. Do I get right that file is replicated via > > Zookeeper? > > > > > > Doesn't > > > > > > > it > > > > > > > > > > > > > > > > > > > > > > > > > > > > Again, this is observed behavior. When we issue a commit > on a > > > > > system > > > > > > > with > > > > > > > > a > > > > > > > > > system with many solr cores using EFFs, the system blocks > > for a > > > > > long > > > > > > > time > > > > > > > > > (15 minutes). We do NOT use zookeeper for anything. The > EFF > > > is a > > > > > > > symlink > > > > > > > > > from each cores index dir to the actual file, which is > > updated > > > by > > > > > an > > > > > > > > > external process. > > > > > > > > > > > > > > > > Hold on, I asked about Zookeeper because the subj mentions > > > > SolrCloud. > > > > > > > > > > > > > > > > Do you use SolrCloud, SolrShards, or these cores are just > > > replicas > > > > of > > > > > > the > > > > > > > > same index? > > > > > > > > > > > > > > > > > > > > > > Ah - we use solr 4 out of the box, so I guess this is > SolrCloud. > > > I'm > > > > a > > > > > > bit > > > > > > > unsure about the terminology here, but we've got a single index > > > > divided > > > > > > > into 16 shard. Each shard is hosted in a solr core. > > > > > > > > > > > > > > > > > > > > > > Also, about simlink - Don't you share that file via some NFS? > > > > > > > > > > > > > > > > No, we generate the EFF on the local solr host (there is only > > one > > > > > > > physical > > > > > > > host that holds all shards), so there is no need for NFS or > > copying > > > > > files > > > > > > > around. No need for Zookeeper either. > > > > > > > > > > > > > > > > > > > > > > how many cores you run per box? > > > > > > > > > > > > > > > This box is a 16-virtual core (8 hyperthreaded cores) with > 60GB > > of > > > > > RAM. > > > > > > We > > > > > > > run 16 solr cores on this box in Jetty. > > > > > > > > > > > > > > > > > > > > > > Do boxes has plenty of ram to cache filesystem beside of jvm > > > heaps? > > > > > > > > > > > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for > the > > > OS. > > > > > > > > > > > > > > > > > > > > > > I assume you use 64 bit linux and mmap directory. Please > > confirm > > > > > that. > > > > > > > > > > > > > > > > > > > > > > > We use 64-bit linux. I'm not sure about the mmap directory or > > where > > > > > that > > > > > > > would be configured in solr - can you explain that? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > causes scalability problem or long time to reload? Will > it > > > help > > > > > if > > > > > > > > we'll > > > > > > > > > > have, let's say ExternalDatabaseField which will pull > > values > > > > from > > > > > > > jdbc. > > > > > > > > ie. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the possibility of having some fields being > retrieved > > > > from > > > > > an > > > > > > > > > external, dynamically updatable store would be really > > > > interesting. > > > > > > This > > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql > > > product > > > > > > (e.g. > > > > > > > > > Cassandra). > > > > > > > > > > > > > > > > Ok. Let's have it in mind as a possible direction. > > > > > > > > > > > > > > > > > > > > > > Alternatively, an API that would allow updating a single field > > for > > > a > > > > > > > document might be an option. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > why all cores can't read these values simultaneously? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Again, this is a solr implementation detail that I can't > > answer > > > > :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you confirm that IDs in the file is ordered by the > > index > > > > term > > > > > > > > order? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, we sorted the files (standard UNIX sort). > > > > > > > > > > > > > > > > > > > > > > > > > > > > AFAIK it can impact load time. > > > > > > > > > > > > > > > > > > > Yes, it does > > > > > > > > > > > > > > > > Ok, I've got that you aware of it, and your IDs are just > > strings, > > > > not > > > > > > > > integers. > > > > > > > > > > > > > > > > > > > > > > > Yes, ids are strings. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regarding your post-query solution can you tell me if > query > > > > found > > > > > > > 10000 > > > > > > > > > > docs, but I need to display only first page with 100 > rows, > > > > > whether > > > > > > I > > > > > > > > need > > > > > > > > > > to pull all 10K results to frontend to order them by the > > > rank? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In our architecture, the clients query an API that > generates > > > the > > > > > SOLR > > > > > > > > > query, retrieves the relevant additional fields that we > > needs, > > > > and > > > > > > > > returns > > > > > > > > > the relevant JSON to the front-end. > > > > > > > > > > > > > > > > > > In our use case, results are returned from SOLR by the > 10's, > > > not > > > > by > > > > > > the > > > > > > > > > 1000's, so it is a manageable job. Even so, if solr > returned > > > > > > thousands > > > > > > > of > > > > > > > > > results, it would be up to the implementation of the api to > > > > augment > > > > > > > only > > > > > > > > > the results that needed to be returned to the front-end. > > > > > > > > > > > > > > > > > > Even so, patching up a JSON structure with 10000 results > > should > > > > be > > > > > > > > > possible. > > > > > > > > > > > > > > > > You are right. I'm concerned anyway because retrieving whole > > > result > > > > > is > > > > > > > > expensive, and not always possible. > > > > > > > > > > > > > > > > > > > > > > > In our case, getting the whole result is almost impossible, > > because > > > > > that > > > > > > > would be millions of documents, and returning the Nth result > > seems > > > to > > > > > be > > > > > > a > > > > > > > quadratic (or worse) operation in SOLR. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm really appreciate if you comment on the questions > > above. > > > > > > > > > > PS: It's time to pitch, how much > > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 > "Commit-free > > > > > > > > > > ExternalFileField" can help you? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It looks very interesting :) Does it make it possible to > > > avoid > > > > > > > > re-reading > > > > > > > > > the EFF on every commit, and only re-read the values that > > have > > > > > > actually > > > > > > > > > changed? > > > > > > > > > > > > > > > > > > > > > > > > You don't need commit (in SOLR-4085) to reload file content, > > but > > > > > after > > > > > > > > commit you need to read whole file and scan all key terms and > > > > > postings. > > > > > > > > That's because EFF sits on top of top level searcher. it's a > > > > > Solr-like > > > > > > > way. > > > > > > > > In some future we might have per-segment EFF, in this case > > > adding a > > > > > > > segment > > > > > > > > will trigger full file scan, but in the index only that new > > > segment > > > > > > will > > > > > > > be > > > > > > > > scanned. It should be faster. You know, straightforward > sharing > > > > > > internal > > > > > > > > data structures between different index views/generations is > > not > > > > > > > possible. > > > > > > > > If you are asking about applying delta changes on external > file > > > > > that's > > > > > > > > something what we did ourselves http://goo.gl/P8GFq . This > > > feature > > > > > is > > > > > > > much > > > > > > > > more doubtful and vague, although it might be the next > > > contribution > > > > > > after > > > > > > > > SOLR-4085. > > > > > > > > > > > > > > > > > > > > > > > > > > /Martin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch < > > m...@issuu.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give > > you > > > > > what > > > > > > > > you're > > > > > > > > > > > hoping fore. > > > > > > > > > > > > > > > > > > > > > > We tried using Solr Cloud, and have given up again. > > > > > > > > > > > > > > > > > > > > > > The EFF is placed in the parent of the index directory > in > > > > each > > > > > > > core; > > > > > > > > each > > > > > > > > > > > core reads the entire EFF and picks out the IDs that it > > is > > > > > > > > responsible > > > > > > > > > > for. > > > > > > > > > > > > > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks > > (doesn't > > > > > answer > > > > > > > > > > queries) > > > > > > > > > > > while re-reading the EFF. Even worse, it seems that the > > > time > > > > to > > > > > > > > re-read > > > > > > > > > > the > > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e. > the > > > EFF > > > > > is > > > > > > > > re-read > > > > > > > > > > by > > > > > > > > > > > each core sequentially). The contents of the EFF become > > > > active > > > > > > > after > > > > > > > > the > > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here) > > > after > > > > > the > > > > > > > > file > > > > > > > > > > has > > > > > > > > > > > been updated. > > > > > > > > > > > > > > > > > > > > > > In our case, the EFF was quite large - around 450MB - > and > > > we > > > > > use > > > > > > 16 > > > > > > > > > > shards, > > > > > > > > > > > so when we triggered an external commit to force > > > re-reading, > > > > > the > > > > > > > > whole > > > > > > > > > > > system would block for several (10-15) minutes. This > > won't > > > > work > > > > > > in > > > > > > > a > > > > > > > > > > > production environment. The reason for the size of the > > EFF > > > is > > > > > > that > > > > > > > we > > > > > > > > > > have > > > > > > > > > > > around 7M documents in the index; each document has a > 45 > > > > > > character > > > > > > > > ID. > > > > > > > > > > > > > > > > > > > > > > We got some help to try to fix the problem so that the > > > > re-read > > > > > of > > > > > > > the > > > > > > > > EFF > > > > > > > > > > > proceeds in the background (see > > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> > > for > > > > > > > > > > > a fix on the 4.1 branch). However, even though the > > re-read > > > > > > proceeds > > > > > > > > in > > > > > > > > > > the > > > > > > > > > > > background, the time required to launch solr now takes > at > > > > least > > > > > > as > > > > > > > > long > > > > > > > > > > as > > > > > > > > > > > re-reading the EFFs. Again, this is not good enough for > > our > > > > > > needs. > > > > > > > > > > > > > > > > > > > > > > The next issue is that you cannot sort on EFF fields > > > (though > > > > > you > > > > > > > can > > > > > > > > > > return > > > > > > > > > > > them as values using &fl=field(my_eff_field). This is > > also > > > > > fixed > > > > > > in > > > > > > > > the > > > > > > > > > > 4.1 > > > > > > > > > > > branch here < > > > https://issues.apache.org/jira/browse/SOLR-4022 > > > > >. > > > > > > > > > > > > > > > > > > > > > > So: Even after these fixes, EFF performance is not that > > > > great. > > > > > > Our > > > > > > > > > > solution > > > > > > > > > > > is as follows: The actual value of the popularity > measure > > > > (say, > > > > > > > > reads) > > > > > > > > > > that > > > > > > > > > > > we want to report to the user is inserted into the > search > > > > > > response > > > > > > > > > > > post-query by our query front-end. This value will then > > be > > > > the > > > > > > > > > > > authoritative value at the time of the query. The value > > of > > > > the > > > > > > > > popularity > > > > > > > > > > > measure that we use for boosting in the ranking of the > > > search > > > > > > > results > > > > > > > > is > > > > > > > > > > > only updated when the value has changed enough so that > > the > > > > > impact > > > > > > > on > > > > > > > > the > > > > > > > > > > > boost will be significant (say, more than 2%). This > does > > > > > require > > > > > > > > frequent > > > > > > > > > > > re-indexing of the documents that have significant > > changes > > > in > > > > > the > > > > > > > > number > > > > > > > > > > of > > > > > > > > > > > reads, but at least we won't have to update a document > if > > > it > > > > > > moves > > > > > > > > from, > > > > > > > > > > > say, 1000000 to 1000001 reads. > > > > > > > > > > > > > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect. > > > > > > > > > > > > > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < > > > > > > simo...@apache.org > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > I'm planning to move a quite big Solr index to > > SolrCloud. > > > > > > > However, > > > > > > > > in > > > > > > > > > > > this > > > > > > > > > > > > index, an external file field is used for popularity > > > > ranking. > > > > > > > > > > > > > > > > > > > > > > > > Does SolrCloud supports external file fields? How > does > > it > > > > > cope > > > > > > > with > > > > > > > > > > > > sharding and replication? Where should the external > > file > > > be > > > > > > > placed > > > > > > > > now > > > > > > > > > > > that > > > > > > > > > > > > the index folder is not local but in the cloud? > > > > > > > > > > > > > > > > > > > > > > > > Are there otherwise other best practices to deal with > > the > > > > use > > > > > > > cases > > > > > > > > > > > > external file fields were used for, like > > > > popularity/ranking, > > > > > in > > > > > > > > > > > SolrCloud? > > > > > > > > > > > > Custom ValueSources going to something external? > > > > > > > > > > > > > > > > > > > > > > > > Thanks in advance, > > > > > > > > > > > > Simone > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Sincerely yours > > > > > > > > > > Mikhail Khludnev > > > > > > > > > > Principal Engineer, > > > > > > > > > > Grid Dynamics > > > > > > > > > > > > > > > > > > > > <http://www.griddynamics.com> > > > > > > > > > > <mkhlud...@griddynamics.com> > > > > > > > > > > > > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com> > > > > > написал: > > > > > > > > > > > > > > > > > Hi Mikhail > > > > > > > > > > > > > > > > > > Please see answers below. > > > > > > > > > > > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < > > > > > > > > > mkhlud...@griddynamics.com> wrote: > > > > > > > > > > > > > > > > > > > Martin, > > > > > > > > > > > > > > > > > > > > Thank you for telling your own "war-story". It's really > > > useful > > > > > for > > > > > > > > > > community. > > > > > > > > > > The first question might seems not really conscious, but > > > would > > > > > you > > > > > > > tell > > > > > > > > > me > > > > > > > > > > what blocks searching during EFF reload, when it's > > triggered > > > by > > > > > > > handler > > > > > > > > > or > > > > > > > > > > by listener? > > > > > > > > > > > > > > > > > > > > > > > > > > > > We continuously index new documents using CommitWithin to > get > > > > > regular > > > > > > > > > commits. However, we observed that the EFFs were not > re-read, > > > so > > > > we > > > > > > had > > > > > > > > to > > > > > > > > > do external commits (curl '.../solr/update?commit=true') to > > > force > > > > > > > reload. > > > > > > > > > When this is done, solr blocks. I can't tell you exactly > why > > > it's > > > > > > doing > > > > > > > > > that (it was related to SOLR-3985). > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't really get the sentence about sequential commits > > and > > > > > number > > > > > > > of > > > > > > > > > > cores. Do I get right that file is replicated via > > Zookeeper? > > > > > > Doesn't > > > > > > > it > > > > > > > > > > > > > > > > > > > > > > > > > > > > Again, this is observed behavior. When we issue a commit > on a > > > > > system > > > > > > > > with a > > > > > > > > > system with many solr cores using EFFs, the system blocks > > for a > > > > > long > > > > > > > time > > > > > > > > > (15 minutes). We do NOT use zookeeper for anything. The > EFF > > > is a > > > > > > > symlink > > > > > > > > > from each cores index dir to the actual file, which is > > updated > > > by > > > > > an > > > > > > > > > external process. > > > > > > > > > > > > > > > > > > > > > > > > > > > > causes scalability problem or long time to reload? Will > it > > > help > > > > > if > > > > > > > > we'll > > > > > > > > > > have, let's say ExternalDatabaseField which will pull > > values > > > > from > > > > > > > jdbc. > > > > > > > > > ie. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the possibility of having some fields being > retrieved > > > > from > > > > > an > > > > > > > > > external, dynamically updatable store would be really > > > > interesting. > > > > > > This > > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql > > > product > > > > > > (e.g. > > > > > > > > > Cassandra). > > > > > > > > > > > > > > > > > > > > > > > > > > > > why all cores can't read these values simultaneously? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Again, this is a solr implementation detail that I can't > > answer > > > > :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you confirm that IDs in the file is ordered by the > > index > > > > term > > > > > > > > order? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, we sorted the files (standard UNIX sort). > > > > > > > > > > > > > > > > > > > > > > > > > > > > AFAIK it can impact load time. > > > > > > > > > > > > > > > > > > > Yes, it does. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regarding your post-query solution can you tell me if > query > > > > found > > > > > > > 10000 > > > > > > > > > > docs, but I need to display only first page with 100 > rows, > > > > > whether > > > > > > I > > > > > > > > need > > > > > > > > > > to pull all 10K results to frontend to order them by the > > > rank? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In our architecture, the clients query an API that > generates > > > the > > > > > SOLR > > > > > > > > > query, retrieves the relevant additional fields that we > > needs, > > > > and > > > > > > > > returns > > > > > > > > > the relevant JSON to the front-end. > > > > > > > > > > > > > > > > > > In our use case, results are returned from SOLR by the > 10's, > > > not > > > > by > > > > > > the > > > > > > > > > 1000's, so it is a manageable job. Even so, if solr > returned > > > > > > thousands > > > > > > > of > > > > > > > > > results, it would be up to the implementation of the api to > > > > augment > > > > > > > only > > > > > > > > > the results that needed to be returned to the front-end. > > > > > > > > > > > > > > > > > > Even so, patching up a JSON structure with 10000 results > > should > > > > be > > > > > > > > > possible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm really appreciate if you comment on the questions > > above. > > > > > > > > > > PS: It's time to pitch, how much > > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 > "Commit-free > > > > > > > > > > ExternalFileField" can help you? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It looks very interesting :) Does it make it possible to > > > avoid > > > > > > > > re-reading > > > > > > > > > the EFF on every commit, and only re-read the values that > > have > > > > > > actually > > > > > > > > > changed? > > > > > > > > > > > > > > > > > > /Martin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch < > > m...@issuu.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give > > you > > > > > what > > > > > > > > you're > > > > > > > > > > > hoping fore. > > > > > > > > > > > > > > > > > > > > > > We tried using Solr Cloud, and have given up again. > > > > > > > > > > > > > > > > > > > > > > The EFF is placed in the parent of the index directory > in > > > > each > > > > > > > core; > > > > > > > > > each > > > > > > > > > > > core reads the entire EFF and picks out the IDs that it > > is > > > > > > > > responsible > > > > > > > > > > for. > > > > > > > > > > > > > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks > > (doesn't > > > > > answer > > > > > > > > > > queries) > > > > > > > > > > > while re-reading the EFF. Even worse, it seems that the > > > time > > > > to > > > > > > > > re-read > > > > > > > > > > the > > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e. > the > > > EFF > > > > > is > > > > > > > > > re-read > > > > > > > > > > by > > > > > > > > > > > each core sequentially). The contents of the EFF become > > > > active > > > > > > > after > > > > > > > > > the > > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here) > > > after > > > > > the > > > > > > > > file > > > > > > > > > > has > > > > > > > > > > > been updated. > > > > > > > > > > > > > > > > > > > > > > In our case, the EFF was quite large - around 450MB - > and > > > we > > > > > use > > > > > > 16 > > > > > > > > > > shards, > > > > > > > > > > > so when we triggered an external commit to force > > > re-reading, > > > > > the > > > > > > > > whole > > > > > > > > > > > system would block for several (10-15) minutes. This > > won't > > > > work > > > > > > in > > > > > > > a > > > > > > > > > > > production environment. The reason for the size of the > > EFF > > > is > > > > > > that > > > > > > > we > > > > > > > > > > have > > > > > > > > > > > around 7M documents in the index; each document has a > 45 > > > > > > character > > > > > > > > ID. > > > > > > > > > > > > > > > > > > > > > > We got some help to try to fix the problem so that the > > > > re-read > > > > > of > > > > > > > the > > > > > > > > > EFF > > > > > > > > > > > proceeds in the background (see > > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985> > > for > > > > > > > > > > > a fix on the 4.1 branch). However, even though the > > re-read > > > > > > proceeds > > > > > > > > in > > > > > > > > > > the > > > > > > > > > > > background, the time required to launch solr now takes > at > > > > least > > > > > > as > > > > > > > > long > > > > > > > > > > as > > > > > > > > > > > re-reading the EFFs. Again, this is not good enough for > > our > > > > > > needs. > > > > > > > > > > > > > > > > > > > > > > The next issue is that you cannot sort on EFF fields > > > (though > > > > > you > > > > > > > can > > > > > > > > > > return > > > > > > > > > > > them as values using &fl=field(my_eff_field). This is > > also > > > > > fixed > > > > > > in > > > > > > > > the > > > > > > > > > > 4.1 > > > > > > > > > > > branch here < > > > https://issues.apache.org/jira/browse/SOLR-4022 > > > > >. > > > > > > > > > > > > > > > > > > > > > > So: Even after these fixes, EFF performance is not that > > > > great. > > > > > > Our > > > > > > > > > > solution > > > > > > > > > > > is as follows: The actual value of the popularity > measure > > > > (say, > > > > > > > > reads) > > > > > > > > > > that > > > > > > > > > > > we want to report to the user is inserted into the > search > > > > > > response > > > > > > > > > > > post-query by our query front-end. This value will then > > be > > > > the > > > > > > > > > > > authoritative value at the time of the query. The value > > of > > > > the > > > > > > > > > popularity > > > > > > > > > > > measure that we use for boosting in the ranking of the > > > search > > > > > > > results > > > > > > > > > is > > > > > > > > > > > only updated when the value has changed enough so that > > the > > > > > impact > > > > > > > on > > > > > > > > > the > > > > > > > > > > > boost will be significant (say, more than 2%). This > does > > > > > require > > > > > > > > > frequent > > > > > > > > > > > re-indexing of the documents that have significant > > changes > > > in > > > > > the > > > > > > > > > number > > > > > > > > > > of > > > > > > > > > > > reads, but at least we won't have to update a document > if > > > it > > > > > > moves > > > > > > > > > from, > > > > > > > > > > > say, 1000000 to 1000001 reads. > > > > > > > > > > > > > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect. > > > > > > > > > > > > > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < > > > > > > simo...@apache.org > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > I'm planning to move a quite big Solr index to > > SolrCloud. > > > > > > > However, > > > > > > > > in > > > > > > > > > > > this > > > > > > > > > > > > index, an external file field is used for popularity > > > > ranking. > > > > > > > > > > > > > > > > > > > > > > > > Does SolrCloud supports external file fields? How > does > > it > > > > > cope > > > > > > > with > > > > > > > > > > > > sharding and replication? Where should the external > > file > > > be > > > > > > > placed > > > > > > > > > now > > > > > > > > > > > that > > > > > > > > > > > > the index folder is not local but in the cloud? > > > > > > > > > > > > > > > > > > > > > > > > Are there otherwise other best practices to deal with > > the > > > > use > > > > > > > cases > > > > > > > > > > > > external file fields were used for, like > > > > popularity/ranking, > > > > > in > > > > > > > > > > > SolrCloud? > > > > > > > > > > > > Custom ValueSources going to something external? > > > > > > > > > > > > > > > > > > > > > > > > Thanks in advance, > > > > > > > > > > > > Simone > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Sincerely yours > > > > > > > > > > Mikhail Khludnev > > > > > > > > > > Principal Engineer, > > > > > > > > > > Grid Dynamics > > > > > > > > > > > > > > > > > > > > <http://www.griddynamics.com> > > > > > > > > > > <mkhlud...@griddynamics.com> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sincerely yours > > > > > > Mikhail Khludnev > > > > > > Principal Engineer, > > > > > > Grid Dynamics > > > > > > > > > > > > <http://www.griddynamics.com> > > > > > > <mkhlud...@griddynamics.com> > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Sincerely yours > > > > Mikhail Khludnev > > > > Principal Engineer, > > > > Grid Dynamics > > > > > > > > <http://www.griddynamics.com> > > > > <mkhlud...@griddynamics.com> > > > > > > > > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > Principal Engineer, > > Grid Dynamics > > > > <http://www.griddynamics.com> > > <mkhlud...@griddynamics.com> > > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>