Mark, Your comment is quite valuable. Let me mention the keyword to be able to find later NoOpDistributingUpdateProcessorFactory.* *Thanks*! *
On Wed, Nov 28, 2012 at 5:56 PM, Mark Miller <markrmil...@gmail.com> wrote: > Keep in mind that the distrib update proc will be auto inserted into > chains! You have to include a proc that disables it - see the FAQ: > http://wiki.apache.org/solr/SolrCloud#FAQ > > - Mark > > On Nov 28, 2012, at 7:25 AM, Mikhail Khludnev <mkhlud...@griddynamics.com> > wrote: > > > Martin, > > Right as far node in Zookeeper DistributedUpdateProcessor will broadcast > > commits to all peers. To hack this you can introduce dedicated > > UpdateProcessorChain without DistributedUpdateProcessor and send commit > to > > that chain. > > 28.11.2012 13:16 пользователь "Martin Koch" <m...@issuu.com> написал: > > > >> Mikhail > >> > >> I haven't experimented further yet. I think that the previous experiment > >> of issuing a commit to a specific core proved that all cores get the > >> commit, so I don't think that this approach will work. > >> > >> Thanks, > >> /Martin > >> > >> > >> On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev < > >> mkhlud...@griddynamics.com> wrote: > >> > >>> Martin, > >>> > >>> It's still not clear to me whether you solve the problem completely or > >>> partially: > >>> Does reducing number of cores free some resources for searching during > >>> commit? > >>> Does the commiting one-by-one core prevents the "freeze"? > >>> > >>> Thanks > >>> > >>> > >>> On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <m...@issuu.com> wrote: > >>> > >>>> Mikhail > >>>> > >>>> To avoid freezes we deployed the patches that are now on the 4.1 trunk > >>>> (bug > >>>> 3985). But this wasn't good enough, because SOLR would still take very > >>>> long > >>>> to restart when that was necessary. > >>>> > >>>> I don't see how we could throw more hardware at the problem without > >>>> making > >>>> it worse, really - the only solution here would be *fewer* shards, not > >>>> > >>>> more. > >>>> > >>>> IMO it would be ideal if the lucene/solr community could come up with > a > >>>> good way of updating fields in a document without reindexing. This > could > >>>> be > >>>> by linking to some external data store, or in the lucene/solr > internals. > >>>> If > >>>> it would make things easier, a good first step would be to have > >>>> dynamically > >>>> updateable numerical fields only. > >>>> > >>>> /Martin > >>>> > >>>> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev < > >>>> mkhlud...@griddynamics.com> wrote: > >>>> > >>>>> Martin, > >>>>> > >>>>> I don't think solrconfig.xml shed any light on. I've just found what > I > >>>>> didn't get in your setup - the way of how to explicitly assigning > core > >>>> to > >>>>> collection. Now, I realized most of details after all! > >>>>> Ball is on your side, let us know whether you have managed your cores > >>>> to > >>>>> commit one by one to avoid freeze, or could you eliminate pauses by > >>>>> allocating more hardware? > >>>>> Thanks in advance! > >>>>> > >>>>> > >>>>> On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <m...@issuu.com> wrote: > >>>>> > >>>>>> Mikhail, > >>>>>> > >>>>>> PSB > >>>>>> > >>>>>> On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev < > >>>>>> mkhlud...@griddynamics.com> wrote: > >>>>>> > >>>>>>> On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <m...@issuu.com> > >>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> I wasn't aware until now that it is possible to send a commit to > >>>> one > >>>>>> core > >>>>>>>> only. What we observed was the effect of curl > >>>>>>>> localhost:8080/solr/update?commit=true but perhaps we should > >>>>> experiment > >>>>>>>> with solr/coreN/update?commit=true. A quick trial run seems to > >>>>> indicate > >>>>>>>> that a commit to a single core causes commits on all cores. > >>>>>>>> > >>>>>>> You should see something like this in the log: > >>>>>>> ... SolrCmdDistributor .... Distrib commit to: ... > >>>>>>> > >>>>>>> Yup, a commit towards a single core results in a commit on all > >>>> cores. > >>>>>> > >>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Perhaps I should clarify that we are using SOLR as a black box; > >>>> we do > >>>>>> not > >>>>>>>> touch the code at all - we only install the distribution WAR > >>>> file and > >>>>>>>> proceed from there. > >>>>>>>> > >>>>>>> I still don't understand how you deploy/launch Solr. How many > >>>> jettys > >>>>> you > >>>>>>> start whether you have -DzkRun -DzkHost -DnumShards=2 or you > >>>> specifies > >>>>>>> shards= param for every request and distributes updates yourself? > >>>> What > >>>>>>> collections do you create and with which settings? > >>>>>>> > >>>>>>> We let SOLR do the sharding using one collection with 16 SOLR cores > >>>>>> holding one shard each. We launch only one instance of jetty with > the > >>>>>> folllowing arguments: > >>>>>> > >>>>>> -DnumShards=16 > >>>>>> -DzkHost=<zookeeperhost:port> > >>>>>> -Xmx10G > >>>>>> -Xms10G > >>>>>> -Xmn2G > >>>>>> -server > >>>>>> > >>>>>> Would you like to see the solrconfig.xml? > >>>>>> > >>>>>> /Martin > >>>>>> > >>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> Also from my POV such deployments should start at least from > >>>> *16* > >>>>>> 4-way > >>>>>>>>> vboxes, it's more expensive, but should be much better > >>>> available > >>>>>> during > >>>>>>>>> cpu-consuming operations. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Do you mean that you recommend 16 hosts with 4 cores each? Or 4 > >>>> hosts > >>>>>>> with > >>>>>>>> 16 cores? Or am I misunderstanding something :) ? > >>>>>>>> > >>>>>>> I prefer to start from 16 hosts with 4 cores each. > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> Other details, if you use single jetty for all of them, are you > >>>>> sure > >>>>>>> that > >>>>>>>>> jetty's threadpool doesn't limit requests? is it large enough? > >>>>>>>>> You have 60G and set -Xmx=10G. are you sure that total size of > >>>>> cores > >>>>>>>> index > >>>>>>>>> directories is less than 45G? > >>>>>>>>> > >>>>>>>>> The total index size is 230 GB, so it won't fit in ram, but > >>>> we're > >>>>>> using > >>>>>>>> an > >>>>>>>> SSD disk to minimize disk access time. We have tried putting the > >>>> EFF > >>>>>>> onto a > >>>>>>>> ram disk, but this didn't have a measurable effect. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> /Martin > >>>>>>>> > >>>>>>>> > >>>>>>>>> Thanks > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <m...@issuu.com> > >>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Mikhail > >>>>>>>>>> > >>>>>>>>>> PSB > >>>>>>>>>> > >>>>>>>>>> On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev < > >>>>>>>>>> mkhlud...@griddynamics.com> wrote: > >>>>>>>>>> > >>>>>>>>>>> Martin, > >>>>>>>>>>> > >>>>>>>>>>> Please find additional question from me below. > >>>>>>>>>>> > >>>>>>>>>>> Simone, > >>>>>>>>>>> > >>>>>>>>>>> I'm sorry for hijacking your thread. The only what I've > >>>> heard > >>>>>> about > >>>>>>>> it > >>>>>>>>> at > >>>>>>>>>>> recent ApacheCon sessions is that Zookeeper is supposed to > >>>>>>> replicate > >>>>>>>>>> those > >>>>>>>>>>> files as configs under solr home. And I'm really looking > >>>>> forward > >>>>>> to > >>>>>>>>> know > >>>>>>>>>>> how it works with huge files in production. > >>>>>>>>>>> > >>>>>>>>>>> Thank You, Guys! > >>>>>>>>>>> > >>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com > >>>>> > >>>>>>> написал: > >>>>>>>>>>>> > >>>>>>>>>>>> Hi Mikhail > >>>>>>>>>>>> > >>>>>>>>>>>> Please see answers below. > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < > >>>>>>>>>>>> mkhlud...@griddynamics.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Martin, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thank you for telling your own "war-story". It's really > >>>>>> useful > >>>>>>>> for > >>>>>>>>>>>>> community. > >>>>>>>>>>>>> The first question might seems not really conscious, > >>>> but > >>>>>> would > >>>>>>>> you > >>>>>>>>>> tell > >>>>>>>>>>> me > >>>>>>>>>>>>> what blocks searching during EFF reload, when it's > >>>>> triggered > >>>>>> by > >>>>>>>>>> handler > >>>>>>>>>>> or > >>>>>>>>>>>>> by listener? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> We continuously index new documents using CommitWithin > >>>> to get > >>>>>>>> regular > >>>>>>>>>>>> commits. However, we observed that the EFFs were not > >>>> re-read, > >>>>>> so > >>>>>>> we > >>>>>>>>> had > >>>>>>>>>>> to > >>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true') > >>>> to > >>>>>> force > >>>>>>>>>> reload. > >>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly > >>>> why > >>>>>> it's > >>>>>>>>> doing > >>>>>>>>>>>> that (it was related to SOLR-3985). > >>>>>>>>>>> > >>>>>>>>>>> Is there a chance to get a thread dump when they are > >>>> blocked? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> Well I could try to recreate the situation. But the setup is > >>>>> fairly > >>>>>>>>> simple: > >>>>>>>>>> Create a large EFF in a largeish index with many shards. > >>>> Issue a > >>>>>>>> commit, > >>>>>>>>>> and then try to do a search. Solr will not respond to the > >>>> search > >>>>>>> before > >>>>>>>>> the > >>>>>>>>>> commit has completed, and this will take a long time. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> I don't really get the sentence about sequential > >>>> commits > >>>>> and > >>>>>>>> number > >>>>>>>>>> of > >>>>>>>>>>>>> cores. Do I get right that file is replicated via > >>>>> Zookeeper? > >>>>>>>>> Doesn't > >>>>>>>>>> it > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Again, this is observed behavior. When we issue a commit > >>>> on a > >>>>>>>> system > >>>>>>>>>> with > >>>>>>>>>>> a > >>>>>>>>>>>> system with many solr cores using EFFs, the system blocks > >>>>> for a > >>>>>>>> long > >>>>>>>>>> time > >>>>>>>>>>>> (15 minutes). We do NOT use zookeeper for anything. The > >>>> EFF > >>>>>> is a > >>>>>>>>>> symlink > >>>>>>>>>>>> from each cores index dir to the actual file, which is > >>>>> updated > >>>>>> by > >>>>>>>> an > >>>>>>>>>>>> external process. > >>>>>>>>>>> > >>>>>>>>>>> Hold on, I asked about Zookeeper because the subj mentions > >>>>>>> SolrCloud. > >>>>>>>>>>> > >>>>>>>>>>> Do you use SolrCloud, SolrShards, or these cores are just > >>>>>> replicas > >>>>>>> of > >>>>>>>>> the > >>>>>>>>>>> same index? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Ah - we use solr 4 out of the box, so I guess this is > >>>> SolrCloud. > >>>>>> I'm > >>>>>>> a > >>>>>>>>> bit > >>>>>>>>>> unsure about the terminology here, but we've got a single > >>>> index > >>>>>>> divided > >>>>>>>>>> into 16 shard. Each shard is hosted in a solr core. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> Also, about simlink - Don't you share that file via some > >>>> NFS? > >>>>>>>>>>> > >>>>>>>>>>> No, we generate the EFF on the local solr host (there is > >>>> only > >>>>> one > >>>>>>>>>> physical > >>>>>>>>>> host that holds all shards), so there is no need for NFS or > >>>>> copying > >>>>>>>> files > >>>>>>>>>> around. No need for Zookeeper either. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> how many cores you run per box? > >>>>>>>>>>> > >>>>>>>>>> This box is a 16-virtual core (8 hyperthreaded cores) with > >>>> 60GB > >>>>> of > >>>>>>>> RAM. > >>>>>>>>> We > >>>>>>>>>> run 16 solr cores on this box in Jetty. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> Do boxes has plenty of ram to cache filesystem beside of > >>>> jvm > >>>>>> heaps? > >>>>>>>>>>> > >>>>>>>>>>> Yes. We've allocated 10GB for jetty, and left the rest for > >>>> the > >>>>>> OS. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> I assume you use 64 bit linux and mmap directory. Please > >>>>> confirm > >>>>>>>> that. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> We use 64-bit linux. I'm not sure about the mmap directory or > >>>>> where > >>>>>>>> that > >>>>>>>>>> would be configured in solr - can you explain that? > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> causes scalability problem or long time to reload? > >>>> Will it > >>>>>> help > >>>>>>>> if > >>>>>>>>>>> we'll > >>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull > >>>>> values > >>>>>>> from > >>>>>>>>>> jdbc. > >>>>>>>>>>> ie. > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> I think the possibility of having some fields being > >>>> retrieved > >>>>>>> from > >>>>>>>> an > >>>>>>>>>>>> external, dynamically updatable store would be really > >>>>>>> interesting. > >>>>>>>>> This > >>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql > >>>>>> product > >>>>>>>>> (e.g. > >>>>>>>>>>>> Cassandra). > >>>>>>>>>>> > >>>>>>>>>>> Ok. Let's have it in mind as a possible direction. > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Alternatively, an API that would allow updating a single > >>>> field > >>>>> for > >>>>>> a > >>>>>>>>>> document might be an option. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> why all cores can't read these values simultaneously? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Again, this is a solr implementation detail that I can't > >>>>> answer > >>>>>>> :) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the > >>>>> index > >>>>>>> term > >>>>>>>>>>> order? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort). > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> AFAIK it can impact load time. > >>>>>>>>>>>>> > >>>>>>>>>>>> Yes, it does > >>>>>>>>>>> > >>>>>>>>>>> Ok, I've got that you aware of it, and your IDs are just > >>>>> strings, > >>>>>>> not > >>>>>>>>>>> integers. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> Yes, ids are strings. > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> Regarding your post-query solution can you tell me if > >>>> query > >>>>>>> found > >>>>>>>>>> 10000 > >>>>>>>>>>>>> docs, but I need to display only first page with 100 > >>>> rows, > >>>>>>>> whether > >>>>>>>>> I > >>>>>>>>>>> need > >>>>>>>>>>>>> to pull all 10K results to frontend to order them by > >>>> the > >>>>>> rank? > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> In our architecture, the clients query an API that > >>>> generates > >>>>>> the > >>>>>>>> SOLR > >>>>>>>>>>>> query, retrieves the relevant additional fields that we > >>>>> needs, > >>>>>>> and > >>>>>>>>>>> returns > >>>>>>>>>>>> the relevant JSON to the front-end. > >>>>>>>>>>>> > >>>>>>>>>>>> In our use case, results are returned from SOLR by the > >>>> 10's, > >>>>>> not > >>>>>>> by > >>>>>>>>> the > >>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr > >>>> returned > >>>>>>>>> thousands > >>>>>>>>>> of > >>>>>>>>>>>> results, it would be up to the implementation of the api > >>>> to > >>>>>>> augment > >>>>>>>>>> only > >>>>>>>>>>>> the results that needed to be returned to the front-end. > >>>>>>>>>>>> > >>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results > >>>>> should > >>>>>>> be > >>>>>>>>>>>> possible. > >>>>>>>>>>> > >>>>>>>>>>> You are right. I'm concerned anyway because retrieving > >>>> whole > >>>>>> result > >>>>>>>> is > >>>>>>>>>>> expensive, and not always possible. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> In our case, getting the whole result is almost impossible, > >>>>> because > >>>>>>>> that > >>>>>>>>>> would be millions of documents, and returning the Nth result > >>>>> seems > >>>>>> to > >>>>>>>> be > >>>>>>>>> a > >>>>>>>>>> quadratic (or worse) operation in SOLR. > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> I'm really appreciate if you comment on the questions > >>>>> above. > >>>>>>>>>>>>> PS: It's time to pitch, how much > >>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085 > >>>> "Commit-free > >>>>>>>>>>>>> ExternalFileField" can help you? > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> It looks very interesting :) Does it make it possible > >>>> to > >>>>>> avoid > >>>>>>>>>>> re-reading > >>>>>>>>>>>> the EFF on every commit, and only re-read the values that > >>>>> have > >>>>>>>>> actually > >>>>>>>>>>>> changed? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> You don't need commit (in SOLR-4085) to reload file > >>>> content, > >>>>> but > >>>>>>>> after > >>>>>>>>>>> commit you need to read whole file and scan all key terms > >>>> and > >>>>>>>> postings. > >>>>>>>>>>> That's because EFF sits on top of top level searcher. it's > >>>> a > >>>>>>>> Solr-like > >>>>>>>>>> way. > >>>>>>>>>>> In some future we might have per-segment EFF, in this case > >>>>>> adding a > >>>>>>>>>> segment > >>>>>>>>>>> will trigger full file scan, but in the index only that new > >>>>>> segment > >>>>>>>>> will > >>>>>>>>>> be > >>>>>>>>>>> scanned. It should be faster. You know, straightforward > >>>> sharing > >>>>>>>>> internal > >>>>>>>>>>> data structures between different index views/generations > >>>> is > >>>>> not > >>>>>>>>>> possible. > >>>>>>>>>>> If you are asking about applying delta changes on external > >>>> file > >>>>>>>> that's > >>>>>>>>>>> something what we did ourselves http://goo.gl/P8GFq . This > >>>>>> feature > >>>>>>>> is > >>>>>>>>>> much > >>>>>>>>>>> more doubtful and vague, although it might be the next > >>>>>> contribution > >>>>>>>>> after > >>>>>>>>>>> SOLR-4085. > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> /Martin > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch < > >>>>> m...@issuu.com> > >>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not > >>>> give > >>>>> you > >>>>>>>> what > >>>>>>>>>>> you're > >>>>>>>>>>>>>> hoping fore. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The EFF is placed in the parent of the index > >>>> directory in > >>>>>>> each > >>>>>>>>>> core; > >>>>>>>>>>> each > >>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that > >>>> it > >>>>> is > >>>>>>>>>>> responsible > >>>>>>>>>>>>> for. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks > >>>>> (doesn't > >>>>>>>> answer > >>>>>>>>>>>>> queries) > >>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that > >>>> the > >>>>>> time > >>>>>>> to > >>>>>>>>>>> re-read > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> EFF is multiplied by the number of cores in use > >>>> (i.e. the > >>>>>> EFF > >>>>>>>> is > >>>>>>>>>>> re-read > >>>>>>>>>>>>> by > >>>>>>>>>>>>>> each core sequentially). The contents of the EFF > >>>> become > >>>>>>> active > >>>>>>>>>> after > >>>>>>>>>>> the > >>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work > >>>> here) > >>>>>> after > >>>>>>>> the > >>>>>>>>>>> file > >>>>>>>>>>>>> has > >>>>>>>>>>>>>> been updated. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB > >>>> - and > >>>>>> we > >>>>>>>> use > >>>>>>>>> 16 > >>>>>>>>>>>>> shards, > >>>>>>>>>>>>>> so when we triggered an external commit to force > >>>>>> re-reading, > >>>>>>>> the > >>>>>>>>>>> whole > >>>>>>>>>>>>>> system would block for several (10-15) minutes. This > >>>>> won't > >>>>>>> work > >>>>>>>>> in > >>>>>>>>>> a > >>>>>>>>>>>>>> production environment. The reason for the size of > >>>> the > >>>>> EFF > >>>>>> is > >>>>>>>>> that > >>>>>>>>>> we > >>>>>>>>>>>>> have > >>>>>>>>>>>>>> around 7M documents in the index; each document has > >>>> a 45 > >>>>>>>>> character > >>>>>>>>>>> ID. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We got some help to try to fix the problem so that > >>>> the > >>>>>>> re-read > >>>>>>>> of > >>>>>>>>>> the > >>>>>>>>>>> EFF > >>>>>>>>>>>>>> proceeds in the background (see > >>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985 > >>>>> > >>>>> for > >>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the > >>>>> re-read > >>>>>>>>> proceeds > >>>>>>>>>>> in > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> background, the time required to launch solr now > >>>> takes at > >>>>>>> least > >>>>>>>>> as > >>>>>>>>>>> long > >>>>>>>>>>>>> as > >>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough > >>>> for > >>>>> our > >>>>>>>>> needs. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields > >>>>>> (though > >>>>>>>> you > >>>>>>>>>> can > >>>>>>>>>>>>> return > >>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is > >>>>> also > >>>>>>>> fixed > >>>>>>>>> in > >>>>>>>>>>> the > >>>>>>>>>>>>> 4.1 > >>>>>>>>>>>>>> branch here < > >>>>>> https://issues.apache.org/jira/browse/SOLR-4022 > >>>>>>>> . > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not > >>>> that > >>>>>>> great. > >>>>>>>>> Our > >>>>>>>>>>>>> solution > >>>>>>>>>>>>>> is as follows: The actual value of the popularity > >>>> measure > >>>>>>> (say, > >>>>>>>>>>> reads) > >>>>>>>>>>>>> that > >>>>>>>>>>>>>> we want to report to the user is inserted into the > >>>> search > >>>>>>>>> response > >>>>>>>>>>>>>> post-query by our query front-end. This value will > >>>> then > >>>>> be > >>>>>>> the > >>>>>>>>>>>>>> authoritative value at the time of the query. The > >>>> value > >>>>> of > >>>>>>> the > >>>>>>>>>>> popularity > >>>>>>>>>>>>>> measure that we use for boosting in the ranking of > >>>> the > >>>>>> search > >>>>>>>>>> results > >>>>>>>>>>> is > >>>>>>>>>>>>>> only updated when the value has changed enough so > >>>> that > >>>>> the > >>>>>>>> impact > >>>>>>>>>> on > >>>>>>>>>>> the > >>>>>>>>>>>>>> boost will be significant (say, more than 2%). This > >>>> does > >>>>>>>> require > >>>>>>>>>>> frequent > >>>>>>>>>>>>>> re-indexing of the documents that have significant > >>>>> changes > >>>>>> in > >>>>>>>> the > >>>>>>>>>>> number > >>>>>>>>>>>>> of > >>>>>>>>>>>>>> reads, but at least we won't have to update a > >>>> document if > >>>>>> it > >>>>>>>>> moves > >>>>>>>>>>> from, > >>>>>>>>>>>>>> say, 1000000 to 1000001 reads. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < > >>>>>>>>> simo...@apache.org > >>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to > >>>>> SolrCloud. > >>>>>>>>>> However, > >>>>>>>>>>> in > >>>>>>>>>>>>>> this > >>>>>>>>>>>>>>> index, an external file field is used for > >>>> popularity > >>>>>>> ranking. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How > >>>> does > >>>>> it > >>>>>>>> cope > >>>>>>>>>> with > >>>>>>>>>>>>>>> sharding and replication? Where should the external > >>>>> file > >>>>>> be > >>>>>>>>>> placed > >>>>>>>>>>> now > >>>>>>>>>>>>>> that > >>>>>>>>>>>>>>> the index folder is not local but in the cloud? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Are there otherwise other best practices to deal > >>>> with > >>>>> the > >>>>>>> use > >>>>>>>>>> cases > >>>>>>>>>>>>>>> external file fields were used for, like > >>>>>>> popularity/ranking, > >>>>>>>> in > >>>>>>>>>>>>>> SolrCloud? > >>>>>>>>>>>>>>> Custom ValueSources going to something external? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks in advance, > >>>>>>>>>>>>>>> Simone > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> Sincerely yours > >>>>>>>>>>>>> Mikhail Khludnev > >>>>>>>>>>>>> Principal Engineer, > >>>>>>>>>>>>> Grid Dynamics > >>>>>>>>>>>>> > >>>>>>>>>>>>> <http://www.griddynamics.com> > >>>>>>>>>>>>> <mkhlud...@griddynamics.com> > >>>>>>>>>>>>> > >>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" < > >>>> m...@issuu.com> > >>>>>>>> написал: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Mikhail > >>>>>>>>>>>> > >>>>>>>>>>>> Please see answers below. > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < > >>>>>>>>>>>> mkhlud...@griddynamics.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Martin, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thank you for telling your own "war-story". It's really > >>>>>> useful > >>>>>>>> for > >>>>>>>>>>>>> community. > >>>>>>>>>>>>> The first question might seems not really conscious, > >>>> but > >>>>>> would > >>>>>>>> you > >>>>>>>>>> tell > >>>>>>>>>>>> me > >>>>>>>>>>>>> what blocks searching during EFF reload, when it's > >>>>> triggered > >>>>>> by > >>>>>>>>>> handler > >>>>>>>>>>>> or > >>>>>>>>>>>>> by listener? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> We continuously index new documents using CommitWithin > >>>> to get > >>>>>>>> regular > >>>>>>>>>>>> commits. However, we observed that the EFFs were not > >>>> re-read, > >>>>>> so > >>>>>>> we > >>>>>>>>> had > >>>>>>>>>>> to > >>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true') > >>>> to > >>>>>> force > >>>>>>>>>> reload. > >>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly > >>>> why > >>>>>> it's > >>>>>>>>> doing > >>>>>>>>>>>> that (it was related to SOLR-3985). > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> I don't really get the sentence about sequential > >>>> commits > >>>>> and > >>>>>>>> number > >>>>>>>>>> of > >>>>>>>>>>>>> cores. Do I get right that file is replicated via > >>>>> Zookeeper? > >>>>>>>>> Doesn't > >>>>>>>>>> it > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Again, this is observed behavior. When we issue a commit > >>>> on a > >>>>>>>> system > >>>>>>>>>>> with a > >>>>>>>>>>>> system with many solr cores using EFFs, the system blocks > >>>>> for a > >>>>>>>> long > >>>>>>>>>> time > >>>>>>>>>>>> (15 minutes). We do NOT use zookeeper for anything. The > >>>> EFF > >>>>>> is a > >>>>>>>>>> symlink > >>>>>>>>>>>> from each cores index dir to the actual file, which is > >>>>> updated > >>>>>> by > >>>>>>>> an > >>>>>>>>>>>> external process. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> causes scalability problem or long time to reload? > >>>> Will it > >>>>>> help > >>>>>>>> if > >>>>>>>>>>> we'll > >>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull > >>>>> values > >>>>>>> from > >>>>>>>>>> jdbc. > >>>>>>>>>>>> ie. > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> I think the possibility of having some fields being > >>>> retrieved > >>>>>>> from > >>>>>>>> an > >>>>>>>>>>>> external, dynamically updatable store would be really > >>>>>>> interesting. > >>>>>>>>> This > >>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql > >>>>>> product > >>>>>>>>> (e.g. > >>>>>>>>>>>> Cassandra). > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> why all cores can't read these values simultaneously? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Again, this is a solr implementation detail that I can't > >>>>> answer > >>>>>>> :) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the > >>>>> index > >>>>>>> term > >>>>>>>>>>> order? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort). > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> AFAIK it can impact load time. > >>>>>>>>>>>>> > >>>>>>>>>>>> Yes, it does. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> Regarding your post-query solution can you tell me if > >>>> query > >>>>>>> found > >>>>>>>>>> 10000 > >>>>>>>>>>>>> docs, but I need to display only first page with 100 > >>>> rows, > >>>>>>>> whether > >>>>>>>>> I > >>>>>>>>>>> need > >>>>>>>>>>>>> to pull all 10K results to frontend to order them by > >>>> the > >>>>>> rank? > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> In our architecture, the clients query an API that > >>>> generates > >>>>>> the > >>>>>>>> SOLR > >>>>>>>>>>>> query, retrieves the relevant additional fields that we > >>>>> needs, > >>>>>>> and > >>>>>>>>>>> returns > >>>>>>>>>>>> the relevant JSON to the front-end. > >>>>>>>>>>>> > >>>>>>>>>>>> In our use case, results are returned from SOLR by the > >>>> 10's, > >>>>>> not > >>>>>>> by > >>>>>>>>> the > >>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr > >>>> returned > >>>>>>>>> thousands > >>>>>>>>>> of > >>>>>>>>>>>> results, it would be up to the implementation of the api > >>>> to > >>>>>>> augment > >>>>>>>>>> only > >>>>>>>>>>>> the results that needed to be returned to the front-end. > >>>>>>>>>>>> > >>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results > >>>>> should > >>>>>>> be > >>>>>>>>>>>> possible. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> I'm really appreciate if you comment on the questions > >>>>> above. > >>>>>>>>>>>>> PS: It's time to pitch, how much > >>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085 > >>>> "Commit-free > >>>>>>>>>>>>> ExternalFileField" can help you? > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> It looks very interesting :) Does it make it possible > >>>> to > >>>>>> avoid > >>>>>>>>>>> re-reading > >>>>>>>>>>>> the EFF on every commit, and only re-read the values that > >>>>> have > >>>>>>>>> actually > >>>>>>>>>>>> changed? > >>>>>>>>>>>> > >>>>>>>>>>>> /Martin > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch < > >>>>> m...@issuu.com> > >>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not > >>>> give > >>>>> you > >>>>>>>> what > >>>>>>>>>>> you're > >>>>>>>>>>>>>> hoping fore. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The EFF is placed in the parent of the index > >>>> directory in > >>>>>>> each > >>>>>>>>>> core; > >>>>>>>>>>>> each > >>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that > >>>> it > >>>>> is > >>>>>>>>>>> responsible > >>>>>>>>>>>>> for. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks > >>>>> (doesn't > >>>>>>>> answer > >>>>>>>>>>>>> queries) > >>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that > >>>> the > >>>>>> time > >>>>>>> to > >>>>>>>>>>> re-read > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> EFF is multiplied by the number of cores in use > >>>> (i.e. the > >>>>>> EFF > >>>>>>>> is > >>>>>>>>>>>> re-read > >>>>>>>>>>>>> by > >>>>>>>>>>>>>> each core sequentially). The contents of the EFF > >>>> become > >>>>>>> active > >>>>>>>>>> after > >>>>>>>>>>>> the > >>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work > >>>> here) > >>>>>> after > >>>>>>>> the > >>>>>>>>>>> file > >>>>>>>>>>>>> has > >>>>>>>>>>>>>> been updated. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB > >>>> - and > >>>>>> we > >>>>>>>> use > >>>>>>>>> 16 > >>>>>>>>>>>>> shards, > >>>>>>>>>>>>>> so when we triggered an external commit to force > >>>>>> re-reading, > >>>>>>>> the > >>>>>>>>>>> whole > >>>>>>>>>>>>>> system would block for several (10-15) minutes. This > >>>>> won't > >>>>>>> work > >>>>>>>>> in > >>>>>>>>>> a > >>>>>>>>>>>>>> production environment. The reason for the size of > >>>> the > >>>>> EFF > >>>>>> is > >>>>>>>>> that > >>>>>>>>>> we > >>>>>>>>>>>>> have > >>>>>>>>>>>>>> around 7M documents in the index; each document has > >>>> a 45 > >>>>>>>>> character > >>>>>>>>>>> ID. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We got some help to try to fix the problem so that > >>>> the > >>>>>>> re-read > >>>>>>>> of > >>>>>>>>>> the > >>>>>>>>>>>> EFF > >>>>>>>>>>>>>> proceeds in the background (see > >>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985 > >>>>> > >>>>> for > >>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the > >>>>> re-read > >>>>>>>>> proceeds > >>>>>>>>>>> in > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> background, the time required to launch solr now > >>>> takes at > >>>>>>> least > >>>>>>>>> as > >>>>>>>>>>> long > >>>>>>>>>>>>> as > >>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough > >>>> for > >>>>> our > >>>>>>>>> needs. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields > >>>>>> (though > >>>>>>>> you > >>>>>>>>>> can > >>>>>>>>>>>>> return > >>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is > >>>>> also > >>>>>>>> fixed > >>>>>>>>> in > >>>>>>>>>>> the > >>>>>>>>>>>>> 4.1 > >>>>>>>>>>>>>> branch here < > >>>>>> https://issues.apache.org/jira/browse/SOLR-4022 > >>>>>>>> . > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not > >>>> that > >>>>>>> great. > >>>>>>>>> Our > >>>>>>>>>>>>> solution > >>>>>>>>>>>>>> is as follows: The actual value of the popularity > >>>> measure > >>>>>>> (say, > >>>>>>>>>>> reads) > >>>>>>>>>>>>> that > >>>>>>>>>>>>>> we want to report to the user is inserted into the > >>>> search > >>>>>>>>> response > >>>>>>>>>>>>>> post-query by our query front-end. This value will > >>>> then > >>>>> be > >>>>>>> the > >>>>>>>>>>>>>> authoritative value at the time of the query. The > >>>> value > >>>>> of > >>>>>>> the > >>>>>>>>>>>> popularity > >>>>>>>>>>>>>> measure that we use for boosting in the ranking of > >>>> the > >>>>>> search > >>>>>>>>>> results > >>>>>>>>>>>> is > >>>>>>>>>>>>>> only updated when the value has changed enough so > >>>> that > >>>>> the > >>>>>>>> impact > >>>>>>>>>> on > >>>>>>>>>>>> the > >>>>>>>>>>>>>> boost will be significant (say, more than 2%). This > >>>> does > >>>>>>>> require > >>>>>>>>>>>> frequent > >>>>>>>>>>>>>> re-indexing of the documents that have significant > >>>>> changes > >>>>>> in > >>>>>>>> the > >>>>>>>>>>>> number > >>>>>>>>>>>>> of > >>>>>>>>>>>>>> reads, but at least we won't have to update a > >>>> document if > >>>>>> it > >>>>>>>>> moves > >>>>>>>>>>>> from, > >>>>>>>>>>>>>> say, 1000000 to 1000001 reads. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < > >>>>>>>>> simo...@apache.org > >>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to > >>>>> SolrCloud. > >>>>>>>>>> However, > >>>>>>>>>>> in > >>>>>>>>>>>>>> this > >>>>>>>>>>>>>>> index, an external file field is used for > >>>> popularity > >>>>>>> ranking. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How > >>>> does > >>>>> it > >>>>>>>> cope > >>>>>>>>>> with > >>>>>>>>>>>>>>> sharding and replication? Where should the external > >>>>> file > >>>>>> be > >>>>>>>>>> placed > >>>>>>>>>>>> now > >>>>>>>>>>>>>> that > >>>>>>>>>>>>>>> the index folder is not local but in the cloud? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Are there otherwise other best practices to deal > >>>> with > >>>>> the > >>>>>>> use > >>>>>>>>>> cases > >>>>>>>>>>>>>>> external file fields were used for, like > >>>>>>> popularity/ranking, > >>>>>>>> in > >>>>>>>>>>>>>> SolrCloud? > >>>>>>>>>>>>>>> Custom ValueSources going to something external? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks in advance, > >>>>>>>>>>>>>>> Simone > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> Sincerely yours > >>>>>>>>>>>>> Mikhail Khludnev > >>>>>>>>>>>>> Principal Engineer, > >>>>>>>>>>>>> Grid Dynamics > >>>>>>>>>>>>> > >>>>>>>>>>>>> <http://www.griddynamics.com> > >>>>>>>>>>>>> <mkhlud...@griddynamics.com> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Sincerely yours > >>>>>>>>> Mikhail Khludnev > >>>>>>>>> Principal Engineer, > >>>>>>>>> Grid Dynamics > >>>>>>>>> > >>>>>>>>> <http://www.griddynamics.com> > >>>>>>>>> <mkhlud...@griddynamics.com> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Sincerely yours > >>>>>>> Mikhail Khludnev > >>>>>>> Principal Engineer, > >>>>>>> Grid Dynamics > >>>>>>> > >>>>>>> <http://www.griddynamics.com> > >>>>>>> <mkhlud...@griddynamics.com> > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Sincerely yours > >>>>> Mikhail Khludnev > >>>>> Principal Engineer, > >>>>> Grid Dynamics > >>>>> > >>>>> <http://www.griddynamics.com> > >>>>> <mkhlud...@griddynamics.com> > >>>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> Sincerely yours > >>> Mikhail Khludnev > >>> Principal Engineer, > >>> Grid Dynamics > >>> > >>> <http://www.griddynamics.com> > >>> <mkhlud...@griddynamics.com> > >>> > >>> > >> > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>