Keep in mind that the distrib update proc will be auto inserted into chains! You have to include a proc that disables it - see the FAQ: http://wiki.apache.org/solr/SolrCloud#FAQ
- Mark On Nov 28, 2012, at 7:25 AM, Mikhail Khludnev <mkhlud...@griddynamics.com> wrote: > Martin, > Right as far node in Zookeeper DistributedUpdateProcessor will broadcast > commits to all peers. To hack this you can introduce dedicated > UpdateProcessorChain without DistributedUpdateProcessor and send commit to > that chain. > 28.11.2012 13:16 пользователь "Martin Koch" <m...@issuu.com> написал: > >> Mikhail >> >> I haven't experimented further yet. I think that the previous experiment >> of issuing a commit to a specific core proved that all cores get the >> commit, so I don't think that this approach will work. >> >> Thanks, >> /Martin >> >> >> On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev < >> mkhlud...@griddynamics.com> wrote: >> >>> Martin, >>> >>> It's still not clear to me whether you solve the problem completely or >>> partially: >>> Does reducing number of cores free some resources for searching during >>> commit? >>> Does the commiting one-by-one core prevents the "freeze"? >>> >>> Thanks >>> >>> >>> On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <m...@issuu.com> wrote: >>> >>>> Mikhail >>>> >>>> To avoid freezes we deployed the patches that are now on the 4.1 trunk >>>> (bug >>>> 3985). But this wasn't good enough, because SOLR would still take very >>>> long >>>> to restart when that was necessary. >>>> >>>> I don't see how we could throw more hardware at the problem without >>>> making >>>> it worse, really - the only solution here would be *fewer* shards, not >>>> >>>> more. >>>> >>>> IMO it would be ideal if the lucene/solr community could come up with a >>>> good way of updating fields in a document without reindexing. This could >>>> be >>>> by linking to some external data store, or in the lucene/solr internals. >>>> If >>>> it would make things easier, a good first step would be to have >>>> dynamically >>>> updateable numerical fields only. >>>> >>>> /Martin >>>> >>>> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev < >>>> mkhlud...@griddynamics.com> wrote: >>>> >>>>> Martin, >>>>> >>>>> I don't think solrconfig.xml shed any light on. I've just found what I >>>>> didn't get in your setup - the way of how to explicitly assigning core >>>> to >>>>> collection. Now, I realized most of details after all! >>>>> Ball is on your side, let us know whether you have managed your cores >>>> to >>>>> commit one by one to avoid freeze, or could you eliminate pauses by >>>>> allocating more hardware? >>>>> Thanks in advance! >>>>> >>>>> >>>>> On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <m...@issuu.com> wrote: >>>>> >>>>>> Mikhail, >>>>>> >>>>>> PSB >>>>>> >>>>>> On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev < >>>>>> mkhlud...@griddynamics.com> wrote: >>>>>> >>>>>>> On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <m...@issuu.com> >>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> I wasn't aware until now that it is possible to send a commit to >>>> one >>>>>> core >>>>>>>> only. What we observed was the effect of curl >>>>>>>> localhost:8080/solr/update?commit=true but perhaps we should >>>>> experiment >>>>>>>> with solr/coreN/update?commit=true. A quick trial run seems to >>>>> indicate >>>>>>>> that a commit to a single core causes commits on all cores. >>>>>>>> >>>>>>> You should see something like this in the log: >>>>>>> ... SolrCmdDistributor .... Distrib commit to: ... >>>>>>> >>>>>>> Yup, a commit towards a single core results in a commit on all >>>> cores. >>>>>> >>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Perhaps I should clarify that we are using SOLR as a black box; >>>> we do >>>>>> not >>>>>>>> touch the code at all - we only install the distribution WAR >>>> file and >>>>>>>> proceed from there. >>>>>>>> >>>>>>> I still don't understand how you deploy/launch Solr. How many >>>> jettys >>>>> you >>>>>>> start whether you have -DzkRun -DzkHost -DnumShards=2 or you >>>> specifies >>>>>>> shards= param for every request and distributes updates yourself? >>>> What >>>>>>> collections do you create and with which settings? >>>>>>> >>>>>>> We let SOLR do the sharding using one collection with 16 SOLR cores >>>>>> holding one shard each. We launch only one instance of jetty with the >>>>>> folllowing arguments: >>>>>> >>>>>> -DnumShards=16 >>>>>> -DzkHost=<zookeeperhost:port> >>>>>> -Xmx10G >>>>>> -Xms10G >>>>>> -Xmn2G >>>>>> -server >>>>>> >>>>>> Would you like to see the solrconfig.xml? >>>>>> >>>>>> /Martin >>>>>> >>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Also from my POV such deployments should start at least from >>>> *16* >>>>>> 4-way >>>>>>>>> vboxes, it's more expensive, but should be much better >>>> available >>>>>> during >>>>>>>>> cpu-consuming operations. >>>>>>>>> >>>>>>>> >>>>>>>> Do you mean that you recommend 16 hosts with 4 cores each? Or 4 >>>> hosts >>>>>>> with >>>>>>>> 16 cores? Or am I misunderstanding something :) ? >>>>>>>> >>>>>>> I prefer to start from 16 hosts with 4 cores each. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Other details, if you use single jetty for all of them, are you >>>>> sure >>>>>>> that >>>>>>>>> jetty's threadpool doesn't limit requests? is it large enough? >>>>>>>>> You have 60G and set -Xmx=10G. are you sure that total size of >>>>> cores >>>>>>>> index >>>>>>>>> directories is less than 45G? >>>>>>>>> >>>>>>>>> The total index size is 230 GB, so it won't fit in ram, but >>>> we're >>>>>> using >>>>>>>> an >>>>>>>> SSD disk to minimize disk access time. We have tried putting the >>>> EFF >>>>>>> onto a >>>>>>>> ram disk, but this didn't have a measurable effect. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> /Martin >>>>>>>> >>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <m...@issuu.com> >>>>> wrote: >>>>>>>>> >>>>>>>>>> Mikhail >>>>>>>>>> >>>>>>>>>> PSB >>>>>>>>>> >>>>>>>>>> On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev < >>>>>>>>>> mkhlud...@griddynamics.com> wrote: >>>>>>>>>> >>>>>>>>>>> Martin, >>>>>>>>>>> >>>>>>>>>>> Please find additional question from me below. >>>>>>>>>>> >>>>>>>>>>> Simone, >>>>>>>>>>> >>>>>>>>>>> I'm sorry for hijacking your thread. The only what I've >>>> heard >>>>>> about >>>>>>>> it >>>>>>>>> at >>>>>>>>>>> recent ApacheCon sessions is that Zookeeper is supposed to >>>>>>> replicate >>>>>>>>>> those >>>>>>>>>>> files as configs under solr home. And I'm really looking >>>>> forward >>>>>> to >>>>>>>>> know >>>>>>>>>>> how it works with huge files in production. >>>>>>>>>>> >>>>>>>>>>> Thank You, Guys! >>>>>>>>>>> >>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com >>>>> >>>>>>> написал: >>>>>>>>>>>> >>>>>>>>>>>> Hi Mikhail >>>>>>>>>>>> >>>>>>>>>>>> Please see answers below. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < >>>>>>>>>>>> mkhlud...@griddynamics.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Martin, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for telling your own "war-story". It's really >>>>>> useful >>>>>>>> for >>>>>>>>>>>>> community. >>>>>>>>>>>>> The first question might seems not really conscious, >>>> but >>>>>> would >>>>>>>> you >>>>>>>>>> tell >>>>>>>>>>> me >>>>>>>>>>>>> what blocks searching during EFF reload, when it's >>>>> triggered >>>>>> by >>>>>>>>>> handler >>>>>>>>>>> or >>>>>>>>>>>>> by listener? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We continuously index new documents using CommitWithin >>>> to get >>>>>>>> regular >>>>>>>>>>>> commits. However, we observed that the EFFs were not >>>> re-read, >>>>>> so >>>>>>> we >>>>>>>>> had >>>>>>>>>>> to >>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true') >>>> to >>>>>> force >>>>>>>>>> reload. >>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly >>>> why >>>>>> it's >>>>>>>>> doing >>>>>>>>>>>> that (it was related to SOLR-3985). >>>>>>>>>>> >>>>>>>>>>> Is there a chance to get a thread dump when they are >>>> blocked? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Well I could try to recreate the situation. But the setup is >>>>> fairly >>>>>>>>> simple: >>>>>>>>>> Create a large EFF in a largeish index with many shards. >>>> Issue a >>>>>>>> commit, >>>>>>>>>> and then try to do a search. Solr will not respond to the >>>> search >>>>>>> before >>>>>>>>> the >>>>>>>>>> commit has completed, and this will take a long time. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> I don't really get the sentence about sequential >>>> commits >>>>> and >>>>>>>> number >>>>>>>>>> of >>>>>>>>>>>>> cores. Do I get right that file is replicated via >>>>> Zookeeper? >>>>>>>>> Doesn't >>>>>>>>>> it >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Again, this is observed behavior. When we issue a commit >>>> on a >>>>>>>> system >>>>>>>>>> with >>>>>>>>>>> a >>>>>>>>>>>> system with many solr cores using EFFs, the system blocks >>>>> for a >>>>>>>> long >>>>>>>>>> time >>>>>>>>>>>> (15 minutes). We do NOT use zookeeper for anything. The >>>> EFF >>>>>> is a >>>>>>>>>> symlink >>>>>>>>>>>> from each cores index dir to the actual file, which is >>>>> updated >>>>>> by >>>>>>>> an >>>>>>>>>>>> external process. >>>>>>>>>>> >>>>>>>>>>> Hold on, I asked about Zookeeper because the subj mentions >>>>>>> SolrCloud. >>>>>>>>>>> >>>>>>>>>>> Do you use SolrCloud, SolrShards, or these cores are just >>>>>> replicas >>>>>>> of >>>>>>>>> the >>>>>>>>>>> same index? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Ah - we use solr 4 out of the box, so I guess this is >>>> SolrCloud. >>>>>> I'm >>>>>>> a >>>>>>>>> bit >>>>>>>>>> unsure about the terminology here, but we've got a single >>>> index >>>>>>> divided >>>>>>>>>> into 16 shard. Each shard is hosted in a solr core. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Also, about simlink - Don't you share that file via some >>>> NFS? >>>>>>>>>>> >>>>>>>>>>> No, we generate the EFF on the local solr host (there is >>>> only >>>>> one >>>>>>>>>> physical >>>>>>>>>> host that holds all shards), so there is no need for NFS or >>>>> copying >>>>>>>> files >>>>>>>>>> around. No need for Zookeeper either. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> how many cores you run per box? >>>>>>>>>>> >>>>>>>>>> This box is a 16-virtual core (8 hyperthreaded cores) with >>>> 60GB >>>>> of >>>>>>>> RAM. >>>>>>>>> We >>>>>>>>>> run 16 solr cores on this box in Jetty. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Do boxes has plenty of ram to cache filesystem beside of >>>> jvm >>>>>> heaps? >>>>>>>>>>> >>>>>>>>>>> Yes. We've allocated 10GB for jetty, and left the rest for >>>> the >>>>>> OS. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> I assume you use 64 bit linux and mmap directory. Please >>>>> confirm >>>>>>>> that. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> We use 64-bit linux. I'm not sure about the mmap directory or >>>>> where >>>>>>>> that >>>>>>>>>> would be configured in solr - can you explain that? >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> causes scalability problem or long time to reload? >>>> Will it >>>>>> help >>>>>>>> if >>>>>>>>>>> we'll >>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull >>>>> values >>>>>>> from >>>>>>>>>> jdbc. >>>>>>>>>>> ie. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I think the possibility of having some fields being >>>> retrieved >>>>>>> from >>>>>>>> an >>>>>>>>>>>> external, dynamically updatable store would be really >>>>>>> interesting. >>>>>>>>> This >>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql >>>>>> product >>>>>>>>> (e.g. >>>>>>>>>>>> Cassandra). >>>>>>>>>>> >>>>>>>>>>> Ok. Let's have it in mind as a possible direction. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Alternatively, an API that would allow updating a single >>>> field >>>>> for >>>>>> a >>>>>>>>>> document might be an option. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> why all cores can't read these values simultaneously? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Again, this is a solr implementation detail that I can't >>>>> answer >>>>>>> :) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the >>>>> index >>>>>>> term >>>>>>>>>>> order? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> AFAIK it can impact load time. >>>>>>>>>>>>> >>>>>>>>>>>> Yes, it does >>>>>>>>>>> >>>>>>>>>>> Ok, I've got that you aware of it, and your IDs are just >>>>> strings, >>>>>>> not >>>>>>>>>>> integers. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Yes, ids are strings. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Regarding your post-query solution can you tell me if >>>> query >>>>>>> found >>>>>>>>>> 10000 >>>>>>>>>>>>> docs, but I need to display only first page with 100 >>>> rows, >>>>>>>> whether >>>>>>>>> I >>>>>>>>>>> need >>>>>>>>>>>>> to pull all 10K results to frontend to order them by >>>> the >>>>>> rank? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> In our architecture, the clients query an API that >>>> generates >>>>>> the >>>>>>>> SOLR >>>>>>>>>>>> query, retrieves the relevant additional fields that we >>>>> needs, >>>>>>> and >>>>>>>>>>> returns >>>>>>>>>>>> the relevant JSON to the front-end. >>>>>>>>>>>> >>>>>>>>>>>> In our use case, results are returned from SOLR by the >>>> 10's, >>>>>> not >>>>>>> by >>>>>>>>> the >>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr >>>> returned >>>>>>>>> thousands >>>>>>>>>> of >>>>>>>>>>>> results, it would be up to the implementation of the api >>>> to >>>>>>> augment >>>>>>>>>> only >>>>>>>>>>>> the results that needed to be returned to the front-end. >>>>>>>>>>>> >>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results >>>>> should >>>>>>> be >>>>>>>>>>>> possible. >>>>>>>>>>> >>>>>>>>>>> You are right. I'm concerned anyway because retrieving >>>> whole >>>>>> result >>>>>>>> is >>>>>>>>>>> expensive, and not always possible. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> In our case, getting the whole result is almost impossible, >>>>> because >>>>>>>> that >>>>>>>>>> would be millions of documents, and returning the Nth result >>>>> seems >>>>>> to >>>>>>>> be >>>>>>>>> a >>>>>>>>>> quadratic (or worse) operation in SOLR. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> I'm really appreciate if you comment on the questions >>>>> above. >>>>>>>>>>>>> PS: It's time to pitch, how much >>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085 >>>> "Commit-free >>>>>>>>>>>>> ExternalFileField" can help you? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> It looks very interesting :) Does it make it possible >>>> to >>>>>> avoid >>>>>>>>>>> re-reading >>>>>>>>>>>> the EFF on every commit, and only re-read the values that >>>>> have >>>>>>>>> actually >>>>>>>>>>>> changed? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> You don't need commit (in SOLR-4085) to reload file >>>> content, >>>>> but >>>>>>>> after >>>>>>>>>>> commit you need to read whole file and scan all key terms >>>> and >>>>>>>> postings. >>>>>>>>>>> That's because EFF sits on top of top level searcher. it's >>>> a >>>>>>>> Solr-like >>>>>>>>>> way. >>>>>>>>>>> In some future we might have per-segment EFF, in this case >>>>>> adding a >>>>>>>>>> segment >>>>>>>>>>> will trigger full file scan, but in the index only that new >>>>>> segment >>>>>>>>> will >>>>>>>>>> be >>>>>>>>>>> scanned. It should be faster. You know, straightforward >>>> sharing >>>>>>>>> internal >>>>>>>>>>> data structures between different index views/generations >>>> is >>>>> not >>>>>>>>>> possible. >>>>>>>>>>> If you are asking about applying delta changes on external >>>> file >>>>>>>> that's >>>>>>>>>>> something what we did ourselves http://goo.gl/P8GFq . This >>>>>> feature >>>>>>>> is >>>>>>>>>> much >>>>>>>>>>> more doubtful and vague, although it might be the next >>>>>> contribution >>>>>>>>> after >>>>>>>>>>> SOLR-4085. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> /Martin >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch < >>>>> m...@issuu.com> >>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not >>>> give >>>>> you >>>>>>>> what >>>>>>>>>>> you're >>>>>>>>>>>>>> hoping fore. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The EFF is placed in the parent of the index >>>> directory in >>>>>>> each >>>>>>>>>> core; >>>>>>>>>>> each >>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that >>>> it >>>>> is >>>>>>>>>>> responsible >>>>>>>>>>>>> for. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks >>>>> (doesn't >>>>>>>> answer >>>>>>>>>>>>> queries) >>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that >>>> the >>>>>> time >>>>>>> to >>>>>>>>>>> re-read >>>>>>>>>>>>> the >>>>>>>>>>>>>> EFF is multiplied by the number of cores in use >>>> (i.e. the >>>>>> EFF >>>>>>>> is >>>>>>>>>>> re-read >>>>>>>>>>>>> by >>>>>>>>>>>>>> each core sequentially). The contents of the EFF >>>> become >>>>>>> active >>>>>>>>>> after >>>>>>>>>>> the >>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work >>>> here) >>>>>> after >>>>>>>> the >>>>>>>>>>> file >>>>>>>>>>>>> has >>>>>>>>>>>>>> been updated. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB >>>> - and >>>>>> we >>>>>>>> use >>>>>>>>> 16 >>>>>>>>>>>>> shards, >>>>>>>>>>>>>> so when we triggered an external commit to force >>>>>> re-reading, >>>>>>>> the >>>>>>>>>>> whole >>>>>>>>>>>>>> system would block for several (10-15) minutes. This >>>>> won't >>>>>>> work >>>>>>>>> in >>>>>>>>>> a >>>>>>>>>>>>>> production environment. The reason for the size of >>>> the >>>>> EFF >>>>>> is >>>>>>>>> that >>>>>>>>>> we >>>>>>>>>>>>> have >>>>>>>>>>>>>> around 7M documents in the index; each document has >>>> a 45 >>>>>>>>> character >>>>>>>>>>> ID. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We got some help to try to fix the problem so that >>>> the >>>>>>> re-read >>>>>>>> of >>>>>>>>>> the >>>>>>>>>>> EFF >>>>>>>>>>>>>> proceeds in the background (see >>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985 >>>>> >>>>> for >>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the >>>>> re-read >>>>>>>>> proceeds >>>>>>>>>>> in >>>>>>>>>>>>> the >>>>>>>>>>>>>> background, the time required to launch solr now >>>> takes at >>>>>>> least >>>>>>>>> as >>>>>>>>>>> long >>>>>>>>>>>>> as >>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough >>>> for >>>>> our >>>>>>>>> needs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields >>>>>> (though >>>>>>>> you >>>>>>>>>> can >>>>>>>>>>>>> return >>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is >>>>> also >>>>>>>> fixed >>>>>>>>> in >>>>>>>>>>> the >>>>>>>>>>>>> 4.1 >>>>>>>>>>>>>> branch here < >>>>>> https://issues.apache.org/jira/browse/SOLR-4022 >>>>>>>> . >>>>>>>>>>>>>> >>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not >>>> that >>>>>>> great. >>>>>>>>> Our >>>>>>>>>>>>> solution >>>>>>>>>>>>>> is as follows: The actual value of the popularity >>>> measure >>>>>>> (say, >>>>>>>>>>> reads) >>>>>>>>>>>>> that >>>>>>>>>>>>>> we want to report to the user is inserted into the >>>> search >>>>>>>>> response >>>>>>>>>>>>>> post-query by our query front-end. This value will >>>> then >>>>> be >>>>>>> the >>>>>>>>>>>>>> authoritative value at the time of the query. The >>>> value >>>>> of >>>>>>> the >>>>>>>>>>> popularity >>>>>>>>>>>>>> measure that we use for boosting in the ranking of >>>> the >>>>>> search >>>>>>>>>> results >>>>>>>>>>> is >>>>>>>>>>>>>> only updated when the value has changed enough so >>>> that >>>>> the >>>>>>>> impact >>>>>>>>>> on >>>>>>>>>>> the >>>>>>>>>>>>>> boost will be significant (say, more than 2%). This >>>> does >>>>>>>> require >>>>>>>>>>> frequent >>>>>>>>>>>>>> re-indexing of the documents that have significant >>>>> changes >>>>>> in >>>>>>>> the >>>>>>>>>>> number >>>>>>>>>>>>> of >>>>>>>>>>>>>> reads, but at least we won't have to update a >>>> document if >>>>>> it >>>>>>>>> moves >>>>>>>>>>> from, >>>>>>>>>>>>>> say, 1000000 to 1000001 reads. >>>>>>>>>>>>>> >>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < >>>>>>>>> simo...@apache.org >>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to >>>>> SolrCloud. >>>>>>>>>> However, >>>>>>>>>>> in >>>>>>>>>>>>>> this >>>>>>>>>>>>>>> index, an external file field is used for >>>> popularity >>>>>>> ranking. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How >>>> does >>>>> it >>>>>>>> cope >>>>>>>>>> with >>>>>>>>>>>>>>> sharding and replication? Where should the external >>>>> file >>>>>> be >>>>>>>>>> placed >>>>>>>>>>> now >>>>>>>>>>>>>> that >>>>>>>>>>>>>>> the index folder is not local but in the cloud? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Are there otherwise other best practices to deal >>>> with >>>>> the >>>>>>> use >>>>>>>>>> cases >>>>>>>>>>>>>>> external file fields were used for, like >>>>>>> popularity/ranking, >>>>>>>> in >>>>>>>>>>>>>> SolrCloud? >>>>>>>>>>>>>>> Custom ValueSources going to something external? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks in advance, >>>>>>>>>>>>>>> Simone >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Sincerely yours >>>>>>>>>>>>> Mikhail Khludnev >>>>>>>>>>>>> Principal Engineer, >>>>>>>>>>>>> Grid Dynamics >>>>>>>>>>>>> >>>>>>>>>>>>> <http://www.griddynamics.com> >>>>>>>>>>>>> <mkhlud...@griddynamics.com> >>>>>>>>>>>>> >>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" < >>>> m...@issuu.com> >>>>>>>> написал: >>>>>>>>>>> >>>>>>>>>>>> Hi Mikhail >>>>>>>>>>>> >>>>>>>>>>>> Please see answers below. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev < >>>>>>>>>>>> mkhlud...@griddynamics.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Martin, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for telling your own "war-story". It's really >>>>>> useful >>>>>>>> for >>>>>>>>>>>>> community. >>>>>>>>>>>>> The first question might seems not really conscious, >>>> but >>>>>> would >>>>>>>> you >>>>>>>>>> tell >>>>>>>>>>>> me >>>>>>>>>>>>> what blocks searching during EFF reload, when it's >>>>> triggered >>>>>> by >>>>>>>>>> handler >>>>>>>>>>>> or >>>>>>>>>>>>> by listener? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We continuously index new documents using CommitWithin >>>> to get >>>>>>>> regular >>>>>>>>>>>> commits. However, we observed that the EFFs were not >>>> re-read, >>>>>> so >>>>>>> we >>>>>>>>> had >>>>>>>>>>> to >>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true') >>>> to >>>>>> force >>>>>>>>>> reload. >>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly >>>> why >>>>>> it's >>>>>>>>> doing >>>>>>>>>>>> that (it was related to SOLR-3985). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> I don't really get the sentence about sequential >>>> commits >>>>> and >>>>>>>> number >>>>>>>>>> of >>>>>>>>>>>>> cores. Do I get right that file is replicated via >>>>> Zookeeper? >>>>>>>>> Doesn't >>>>>>>>>> it >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Again, this is observed behavior. When we issue a commit >>>> on a >>>>>>>> system >>>>>>>>>>> with a >>>>>>>>>>>> system with many solr cores using EFFs, the system blocks >>>>> for a >>>>>>>> long >>>>>>>>>> time >>>>>>>>>>>> (15 minutes). We do NOT use zookeeper for anything. The >>>> EFF >>>>>> is a >>>>>>>>>> symlink >>>>>>>>>>>> from each cores index dir to the actual file, which is >>>>> updated >>>>>> by >>>>>>>> an >>>>>>>>>>>> external process. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> causes scalability problem or long time to reload? >>>> Will it >>>>>> help >>>>>>>> if >>>>>>>>>>> we'll >>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull >>>>> values >>>>>>> from >>>>>>>>>> jdbc. >>>>>>>>>>>> ie. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I think the possibility of having some fields being >>>> retrieved >>>>>>> from >>>>>>>> an >>>>>>>>>>>> external, dynamically updatable store would be really >>>>>>> interesting. >>>>>>>>> This >>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql >>>>>> product >>>>>>>>> (e.g. >>>>>>>>>>>> Cassandra). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> why all cores can't read these values simultaneously? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Again, this is a solr implementation detail that I can't >>>>> answer >>>>>>> :) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the >>>>> index >>>>>>> term >>>>>>>>>>> order? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> AFAIK it can impact load time. >>>>>>>>>>>>> >>>>>>>>>>>> Yes, it does. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Regarding your post-query solution can you tell me if >>>> query >>>>>>> found >>>>>>>>>> 10000 >>>>>>>>>>>>> docs, but I need to display only first page with 100 >>>> rows, >>>>>>>> whether >>>>>>>>> I >>>>>>>>>>> need >>>>>>>>>>>>> to pull all 10K results to frontend to order them by >>>> the >>>>>> rank? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> In our architecture, the clients query an API that >>>> generates >>>>>> the >>>>>>>> SOLR >>>>>>>>>>>> query, retrieves the relevant additional fields that we >>>>> needs, >>>>>>> and >>>>>>>>>>> returns >>>>>>>>>>>> the relevant JSON to the front-end. >>>>>>>>>>>> >>>>>>>>>>>> In our use case, results are returned from SOLR by the >>>> 10's, >>>>>> not >>>>>>> by >>>>>>>>> the >>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr >>>> returned >>>>>>>>> thousands >>>>>>>>>> of >>>>>>>>>>>> results, it would be up to the implementation of the api >>>> to >>>>>>> augment >>>>>>>>>> only >>>>>>>>>>>> the results that needed to be returned to the front-end. >>>>>>>>>>>> >>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results >>>>> should >>>>>>> be >>>>>>>>>>>> possible. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> I'm really appreciate if you comment on the questions >>>>> above. >>>>>>>>>>>>> PS: It's time to pitch, how much >>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085 >>>> "Commit-free >>>>>>>>>>>>> ExternalFileField" can help you? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> It looks very interesting :) Does it make it possible >>>> to >>>>>> avoid >>>>>>>>>>> re-reading >>>>>>>>>>>> the EFF on every commit, and only re-read the values that >>>>> have >>>>>>>>> actually >>>>>>>>>>>> changed? >>>>>>>>>>>> >>>>>>>>>>>> /Martin >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch < >>>>> m...@issuu.com> >>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not >>>> give >>>>> you >>>>>>>> what >>>>>>>>>>> you're >>>>>>>>>>>>>> hoping fore. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The EFF is placed in the parent of the index >>>> directory in >>>>>>> each >>>>>>>>>> core; >>>>>>>>>>>> each >>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that >>>> it >>>>> is >>>>>>>>>>> responsible >>>>>>>>>>>>> for. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks >>>>> (doesn't >>>>>>>> answer >>>>>>>>>>>>> queries) >>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that >>>> the >>>>>> time >>>>>>> to >>>>>>>>>>> re-read >>>>>>>>>>>>> the >>>>>>>>>>>>>> EFF is multiplied by the number of cores in use >>>> (i.e. the >>>>>> EFF >>>>>>>> is >>>>>>>>>>>> re-read >>>>>>>>>>>>> by >>>>>>>>>>>>>> each core sequentially). The contents of the EFF >>>> become >>>>>>> active >>>>>>>>>> after >>>>>>>>>>>> the >>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work >>>> here) >>>>>> after >>>>>>>> the >>>>>>>>>>> file >>>>>>>>>>>>> has >>>>>>>>>>>>>> been updated. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB >>>> - and >>>>>> we >>>>>>>> use >>>>>>>>> 16 >>>>>>>>>>>>> shards, >>>>>>>>>>>>>> so when we triggered an external commit to force >>>>>> re-reading, >>>>>>>> the >>>>>>>>>>> whole >>>>>>>>>>>>>> system would block for several (10-15) minutes. This >>>>> won't >>>>>>> work >>>>>>>>> in >>>>>>>>>> a >>>>>>>>>>>>>> production environment. The reason for the size of >>>> the >>>>> EFF >>>>>> is >>>>>>>>> that >>>>>>>>>> we >>>>>>>>>>>>> have >>>>>>>>>>>>>> around 7M documents in the index; each document has >>>> a 45 >>>>>>>>> character >>>>>>>>>>> ID. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We got some help to try to fix the problem so that >>>> the >>>>>>> re-read >>>>>>>> of >>>>>>>>>> the >>>>>>>>>>>> EFF >>>>>>>>>>>>>> proceeds in the background (see >>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985 >>>>> >>>>> for >>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the >>>>> re-read >>>>>>>>> proceeds >>>>>>>>>>> in >>>>>>>>>>>>> the >>>>>>>>>>>>>> background, the time required to launch solr now >>>> takes at >>>>>>> least >>>>>>>>> as >>>>>>>>>>> long >>>>>>>>>>>>> as >>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough >>>> for >>>>> our >>>>>>>>> needs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields >>>>>> (though >>>>>>>> you >>>>>>>>>> can >>>>>>>>>>>>> return >>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is >>>>> also >>>>>>>> fixed >>>>>>>>> in >>>>>>>>>>> the >>>>>>>>>>>>> 4.1 >>>>>>>>>>>>>> branch here < >>>>>> https://issues.apache.org/jira/browse/SOLR-4022 >>>>>>>> . >>>>>>>>>>>>>> >>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not >>>> that >>>>>>> great. >>>>>>>>> Our >>>>>>>>>>>>> solution >>>>>>>>>>>>>> is as follows: The actual value of the popularity >>>> measure >>>>>>> (say, >>>>>>>>>>> reads) >>>>>>>>>>>>> that >>>>>>>>>>>>>> we want to report to the user is inserted into the >>>> search >>>>>>>>> response >>>>>>>>>>>>>> post-query by our query front-end. This value will >>>> then >>>>> be >>>>>>> the >>>>>>>>>>>>>> authoritative value at the time of the query. The >>>> value >>>>> of >>>>>>> the >>>>>>>>>>>> popularity >>>>>>>>>>>>>> measure that we use for boosting in the ranking of >>>> the >>>>>> search >>>>>>>>>> results >>>>>>>>>>>> is >>>>>>>>>>>>>> only updated when the value has changed enough so >>>> that >>>>> the >>>>>>>> impact >>>>>>>>>> on >>>>>>>>>>>> the >>>>>>>>>>>>>> boost will be significant (say, more than 2%). This >>>> does >>>>>>>> require >>>>>>>>>>>> frequent >>>>>>>>>>>>>> re-indexing of the documents that have significant >>>>> changes >>>>>> in >>>>>>>> the >>>>>>>>>>>> number >>>>>>>>>>>>> of >>>>>>>>>>>>>> reads, but at least we won't have to update a >>>> document if >>>>>> it >>>>>>>>> moves >>>>>>>>>>>> from, >>>>>>>>>>>>>> say, 1000000 to 1000001 reads. >>>>>>>>>>>>>> >>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni < >>>>>>>>> simo...@apache.org >>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to >>>>> SolrCloud. >>>>>>>>>> However, >>>>>>>>>>> in >>>>>>>>>>>>>> this >>>>>>>>>>>>>>> index, an external file field is used for >>>> popularity >>>>>>> ranking. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How >>>> does >>>>> it >>>>>>>> cope >>>>>>>>>> with >>>>>>>>>>>>>>> sharding and replication? Where should the external >>>>> file >>>>>> be >>>>>>>>>> placed >>>>>>>>>>>> now >>>>>>>>>>>>>> that >>>>>>>>>>>>>>> the index folder is not local but in the cloud? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Are there otherwise other best practices to deal >>>> with >>>>> the >>>>>>> use >>>>>>>>>> cases >>>>>>>>>>>>>>> external file fields were used for, like >>>>>>> popularity/ranking, >>>>>>>> in >>>>>>>>>>>>>> SolrCloud? >>>>>>>>>>>>>>> Custom ValueSources going to something external? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks in advance, >>>>>>>>>>>>>>> Simone >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Sincerely yours >>>>>>>>>>>>> Mikhail Khludnev >>>>>>>>>>>>> Principal Engineer, >>>>>>>>>>>>> Grid Dynamics >>>>>>>>>>>>> >>>>>>>>>>>>> <http://www.griddynamics.com> >>>>>>>>>>>>> <mkhlud...@griddynamics.com> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Sincerely yours >>>>>>>>> Mikhail Khludnev >>>>>>>>> Principal Engineer, >>>>>>>>> Grid Dynamics >>>>>>>>> >>>>>>>>> <http://www.griddynamics.com> >>>>>>>>> <mkhlud...@griddynamics.com> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sincerely yours >>>>>>> Mikhail Khludnev >>>>>>> Principal Engineer, >>>>>>> Grid Dynamics >>>>>>> >>>>>>> <http://www.griddynamics.com> >>>>>>> <mkhlud...@griddynamics.com> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Sincerely yours >>>>> Mikhail Khludnev >>>>> Principal Engineer, >>>>> Grid Dynamics >>>>> >>>>> <http://www.griddynamics.com> >>>>> <mkhlud...@griddynamics.com> >>>>> >>>> >>> >>> >>> >>> -- >>> Sincerely yours >>> Mikhail Khludnev >>> Principal Engineer, >>> Grid Dynamics >>> >>> <http://www.griddynamics.com> >>> <mkhlud...@griddynamics.com> >>> >>> >>