Re: SolrCloud and exernal file fields

Mark Miller Wed, 28 Nov 2012 05:57:22 -0800

Keep in mind that the distrib update proc will be auto inserted into chains! 
You have to include a proc that disables it - see the FAQ: 
http://wiki.apache.org/solr/SolrCloud#FAQ


- Mark

On Nov 28, 2012, at 7:25 AM, Mikhail Khludnev <mkhlud...@griddynamics.com> 
wrote:

> Martin,
> Right as far node in Zookeeper DistributedUpdateProcessor will broadcast
> commits to all peers. To hack this you can introduce dedicated
> UpdateProcessorChain without DistributedUpdateProcessor and send commit to
> that chain.
> 28.11.2012 13:16 пользователь "Martin Koch" <m...@issuu.com> написал:
> 
>> Mikhail
>> 
>> I haven't experimented further yet. I think that the previous experiment
>> of issuing a commit to a specific core proved that all cores get the
>> commit, so I don't think that this approach will work.
>> 
>> Thanks,
>> /Martin
>> 
>> 
>> On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev <
>> mkhlud...@griddynamics.com> wrote:
>> 
>>> Martin,
>>> 
>>> It's still not clear to me whether you solve the problem completely or
>>> partially:
>>> Does reducing number of cores free some resources for searching during
>>> commit?
>>> Does the commiting one-by-one core prevents the "freeze"?
>>> 
>>> Thanks
>>> 
>>> 
>>> On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <m...@issuu.com> wrote:
>>> 
>>>> Mikhail
>>>> 
>>>> To avoid freezes we deployed the patches that are now on the 4.1 trunk
>>>> (bug
>>>> 3985). But this wasn't good enough, because SOLR would still take very
>>>> long
>>>> to restart when that was necessary.
>>>> 
>>>> I don't see how we could throw more hardware at the problem without
>>>> making
>>>> it worse, really - the only solution here would be *fewer* shards, not
>>>> 
>>>> more.
>>>> 
>>>> IMO it would be ideal if the lucene/solr community could come up with a
>>>> good way of updating fields in a document without reindexing. This could
>>>> be
>>>> by linking to some external data store, or in the lucene/solr internals.
>>>> If
>>>> it would make things easier, a good first step would be to have
>>>> dynamically
>>>> updateable numerical fields only.
>>>> 
>>>> /Martin
>>>> 
>>>> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
>>>> mkhlud...@griddynamics.com> wrote:
>>>> 
>>>>> Martin,
>>>>> 
>>>>> I don't think solrconfig.xml shed any light on. I've just found what I
>>>>> didn't get in your setup - the way of how to explicitly assigning core
>>>> to
>>>>> collection. Now, I realized most of details after all!
>>>>> Ball is on your side, let us know whether you have managed your cores
>>>> to
>>>>> commit one by one to avoid freeze, or could you eliminate pauses by
>>>>> allocating more hardware?
>>>>> Thanks in advance!
>>>>> 
>>>>> 
>>>>> On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <m...@issuu.com> wrote:
>>>>> 
>>>>>> Mikhail,
>>>>>> 
>>>>>> PSB
>>>>>> 
>>>>>> On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
>>>>>> mkhlud...@griddynamics.com> wrote:
>>>>>> 
>>>>>>> On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <m...@issuu.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> I wasn't aware until now that it is possible to send a commit to
>>>> one
>>>>>> core
>>>>>>>> only. What we observed was the effect of curl
>>>>>>>> localhost:8080/solr/update?commit=true but perhaps we should
>>>>> experiment
>>>>>>>> with solr/coreN/update?commit=true. A quick trial run seems to
>>>>> indicate
>>>>>>>> that a commit to a single core causes commits on all cores.
>>>>>>>> 
>>>>>>> You should see something like this in the log:
>>>>>>> ... SolrCmdDistributor .... Distrib commit to: ...
>>>>>>> 
>>>>>>> Yup, a commit towards a single core results in a commit on all
>>>> cores.
>>>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Perhaps I should clarify that we are using SOLR as a black box;
>>>> we do
>>>>>> not
>>>>>>>> touch the code at all - we only install the distribution WAR
>>>> file and
>>>>>>>> proceed from there.
>>>>>>>> 
>>>>>>> I still don't understand how you deploy/launch Solr. How many
>>>> jettys
>>>>> you
>>>>>>> start whether you have -DzkRun -DzkHost -DnumShards=2  or you
>>>> specifies
>>>>>>> shards= param for every request and distributes updates yourself?
>>>> What
>>>>>>> collections do you create and with which settings?
>>>>>>> 
>>>>>>> We let SOLR do the sharding using one collection with 16 SOLR cores
>>>>>> holding one shard each. We launch only one instance of jetty with the
>>>>>> folllowing arguments:
>>>>>> 
>>>>>> -DnumShards=16
>>>>>> -DzkHost=<zookeeperhost:port>
>>>>>> -Xmx10G
>>>>>> -Xms10G
>>>>>> -Xmn2G
>>>>>> -server
>>>>>> 
>>>>>> Would you like to see the solrconfig.xml?
>>>>>> 
>>>>>> /Martin
>>>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Also from my POV such deployments should start at least from
>>>> *16*
>>>>>> 4-way
>>>>>>>>> vboxes, it's more expensive, but should be much better
>>>> available
>>>>>> during
>>>>>>>>> cpu-consuming operations.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Do you mean that you recommend 16 hosts with 4 cores each? Or 4
>>>> hosts
>>>>>>> with
>>>>>>>> 16 cores? Or am I misunderstanding something :) ?
>>>>>>>> 
>>>>>>> I prefer to start from 16 hosts with 4 cores each.
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Other details, if you use single jetty for all of them, are you
>>>>> sure
>>>>>>> that
>>>>>>>>> jetty's threadpool doesn't limit requests? is it large enough?
>>>>>>>>> You have 60G and set -Xmx=10G. are you sure that total size of
>>>>> cores
>>>>>>>> index
>>>>>>>>> directories is less than 45G?
>>>>>>>>> 
>>>>>>>>> The total index size is 230 GB, so it won't fit in ram, but
>>>> we're
>>>>>> using
>>>>>>>> an
>>>>>>>> SSD disk to minimize disk access time. We have tried putting the
>>>> EFF
>>>>>>> onto a
>>>>>>>> ram disk, but this didn't have a measurable effect.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> /Martin
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <m...@issuu.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Mikhail
>>>>>>>>>> 
>>>>>>>>>> PSB
>>>>>>>>>> 
>>>>>>>>>> On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
>>>>>>>>>> mkhlud...@griddynamics.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Martin,
>>>>>>>>>>> 
>>>>>>>>>>> Please find additional question from me below.
>>>>>>>>>>> 
>>>>>>>>>>> Simone,
>>>>>>>>>>> 
>>>>>>>>>>> I'm sorry for hijacking your thread. The only what I've
>>>> heard
>>>>>> about
>>>>>>>> it
>>>>>>>>> at
>>>>>>>>>>> recent ApacheCon sessions is that Zookeeper is supposed to
>>>>>>> replicate
>>>>>>>>>> those
>>>>>>>>>>> files as configs under solr home. And I'm really looking
>>>>> forward
>>>>>> to
>>>>>>>>> know
>>>>>>>>>>> how it works with huge files in production.
>>>>>>>>>>> 
>>>>>>>>>>> Thank You, Guys!
>>>>>>>>>>> 
>>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com
>>>>> 
>>>>>>> написал:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Mikhail
>>>>>>>>>>>> 
>>>>>>>>>>>> Please see answers below.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>>>>>>>>>>>> mkhlud...@griddynamics.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Martin,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you for telling your own "war-story". It's really
>>>>>> useful
>>>>>>>> for
>>>>>>>>>>>>> community.
>>>>>>>>>>>>> The first question might seems not really conscious,
>>>> but
>>>>>> would
>>>>>>>> you
>>>>>>>>>> tell
>>>>>>>>>>> me
>>>>>>>>>>>>> what blocks searching during EFF reload, when it's
>>>>> triggered
>>>>>> by
>>>>>>>>>> handler
>>>>>>>>>>> or
>>>>>>>>>>>>> by listener?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> We continuously index new documents using CommitWithin
>>>> to get
>>>>>>>> regular
>>>>>>>>>>>> commits. However, we observed that the EFFs were not
>>>> re-read,
>>>>>> so
>>>>>>> we
>>>>>>>>> had
>>>>>>>>>>> to
>>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true')
>>>> to
>>>>>> force
>>>>>>>>>> reload.
>>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly
>>>> why
>>>>>> it's
>>>>>>>>> doing
>>>>>>>>>>>> that (it was related to SOLR-3985).
>>>>>>>>>>> 
>>>>>>>>>>> Is there a chance to get a thread dump when they are
>>>> blocked?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> Well I could try to recreate the situation. But the setup is
>>>>> fairly
>>>>>>>>> simple:
>>>>>>>>>> Create a large EFF in a largeish index with many shards.
>>>> Issue a
>>>>>>>> commit,
>>>>>>>>>> and then try to do a search. Solr will not respond to the
>>>> search
>>>>>>> before
>>>>>>>>> the
>>>>>>>>>> commit has completed, and this will take a long time.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't really get the sentence about sequential
>>>> commits
>>>>> and
>>>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>>>> cores. Do I get right that file is replicated via
>>>>> Zookeeper?
>>>>>>>>> Doesn't
>>>>>>>>>> it
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Again, this is observed behavior. When we issue a commit
>>>> on a
>>>>>>>> system
>>>>>>>>>> with
>>>>>>>>>>> a
>>>>>>>>>>>> system with many solr cores using EFFs, the system blocks
>>>>> for a
>>>>>>>> long
>>>>>>>>>> time
>>>>>>>>>>>> (15 minutes).  We do NOT use zookeeper for anything. The
>>>> EFF
>>>>>> is a
>>>>>>>>>> symlink
>>>>>>>>>>>> from each cores index dir to the actual file, which is
>>>>> updated
>>>>>> by
>>>>>>>> an
>>>>>>>>>>>> external process.
>>>>>>>>>>> 
>>>>>>>>>>> Hold on, I asked about Zookeeper because the subj mentions
>>>>>>> SolrCloud.
>>>>>>>>>>> 
>>>>>>>>>>> Do you use SolrCloud, SolrShards, or these cores are just
>>>>>> replicas
>>>>>>> of
>>>>>>>>> the
>>>>>>>>>>> same index?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Ah - we use solr 4 out of the box, so I guess this is
>>>> SolrCloud.
>>>>>> I'm
>>>>>>> a
>>>>>>>>> bit
>>>>>>>>>> unsure about the terminology here, but we've got a single
>>>> index
>>>>>>> divided
>>>>>>>>>> into 16 shard. Each shard is hosted in a solr core.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Also, about simlink - Don't you share that file via some
>>>> NFS?
>>>>>>>>>>> 
>>>>>>>>>>> No, we generate the EFF on the local solr host (there is
>>>> only
>>>>> one
>>>>>>>>>> physical
>>>>>>>>>> host that holds all shards), so there is no need for NFS or
>>>>> copying
>>>>>>>> files
>>>>>>>>>> around. No need for Zookeeper either.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> how many cores you run per box?
>>>>>>>>>>> 
>>>>>>>>>> This box is a 16-virtual core (8 hyperthreaded cores)  with
>>>> 60GB
>>>>> of
>>>>>>>> RAM.
>>>>>>>>> We
>>>>>>>>>> run 16 solr cores on this box in Jetty.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Do boxes has plenty of ram to cache filesystem beside of
>>>> jvm
>>>>>> heaps?
>>>>>>>>>>> 
>>>>>>>>>>> Yes. We've allocated 10GB for jetty, and left the rest for
>>>> the
>>>>>> OS.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> I assume you use 64 bit linux and mmap directory. Please
>>>>> confirm
>>>>>>>> that.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> We use 64-bit linux. I'm not sure about the mmap directory or
>>>>> where
>>>>>>>> that
>>>>>>>>>> would be configured in solr - can you explain that?
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> causes scalability problem or long time to reload?
>>>> Will it
>>>>>> help
>>>>>>>> if
>>>>>>>>>>> we'll
>>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull
>>>>> values
>>>>>>> from
>>>>>>>>>> jdbc.
>>>>>>>>>>> ie.
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the possibility of having some fields being
>>>> retrieved
>>>>>>> from
>>>>>>>> an
>>>>>>>>>>>> external, dynamically updatable store would be really
>>>>>>> interesting.
>>>>>>>>> This
>>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql
>>>>>> product
>>>>>>>>> (e.g.
>>>>>>>>>>>> Cassandra).
>>>>>>>>>>> 
>>>>>>>>>>> Ok. Let's have it in mind as a possible direction.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Alternatively, an API that would allow updating a single
>>>> field
>>>>> for
>>>>>> a
>>>>>>>>>> document might be an option.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> why all cores can't read these values simultaneously?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Again, this is a solr implementation detail that I can't
>>>>> answer
>>>>>>> :)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the
>>>>> index
>>>>>>> term
>>>>>>>>>>> order?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> AFAIK it can impact load time.
>>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, it does
>>>>>>>>>>> 
>>>>>>>>>>> Ok, I've got that you aware of it, and your IDs are just
>>>>> strings,
>>>>>>> not
>>>>>>>>>>> integers.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> Yes, ids are strings.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Regarding your post-query solution can you tell me if
>>>> query
>>>>>>> found
>>>>>>>>>> 10000
>>>>>>>>>>>>> docs, but I need to display only first page with 100
>>>> rows,
>>>>>>>> whether
>>>>>>>>> I
>>>>>>>>>>> need
>>>>>>>>>>>>> to pull all 10K results to frontend to order them by
>>>> the
>>>>>> rank?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> In our architecture, the clients query an API that
>>>> generates
>>>>>> the
>>>>>>>> SOLR
>>>>>>>>>>>> query, retrieves the relevant additional fields that we
>>>>> needs,
>>>>>>> and
>>>>>>>>>>> returns
>>>>>>>>>>>> the relevant JSON to the front-end.
>>>>>>>>>>>> 
>>>>>>>>>>>> In our use case, results are returned from SOLR by the
>>>> 10's,
>>>>>> not
>>>>>>> by
>>>>>>>>> the
>>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr
>>>> returned
>>>>>>>>> thousands
>>>>>>>>>> of
>>>>>>>>>>>> results, it would be up to the implementation of the api
>>>> to
>>>>>>> augment
>>>>>>>>>> only
>>>>>>>>>>>> the results that needed to be returned to the front-end.
>>>>>>>>>>>> 
>>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results
>>>>> should
>>>>>>> be
>>>>>>>>>>>> possible.
>>>>>>>>>>> 
>>>>>>>>>>> You are right. I'm concerned anyway because retrieving
>>>> whole
>>>>>> result
>>>>>>>> is
>>>>>>>>>>> expensive, and not always possible.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> In our case, getting the whole result is almost impossible,
>>>>> because
>>>>>>>> that
>>>>>>>>>> would be millions of documents, and returning the Nth result
>>>>> seems
>>>>>> to
>>>>>>>> be
>>>>>>>>> a
>>>>>>>>>> quadratic (or worse) operation in SOLR.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm really appreciate if you comment on the questions
>>>>> above.
>>>>>>>>>>>>> PS: It's time to pitch, how much
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085
>>>> "Commit-free
>>>>>>>>>>>>> ExternalFileField" can help you?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It looks very interesting :) Does it make it possible
>>>> to
>>>>>> avoid
>>>>>>>>>>> re-reading
>>>>>>>>>>>> the EFF on every commit, and only re-read the values that
>>>>> have
>>>>>>>>> actually
>>>>>>>>>>>> changed?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> You don't need commit (in SOLR-4085) to reload file
>>>> content,
>>>>> but
>>>>>>>> after
>>>>>>>>>>> commit you need to read whole file and scan all key terms
>>>> and
>>>>>>>> postings.
>>>>>>>>>>> That's because EFF sits on top of top level searcher. it's
>>>> a
>>>>>>>> Solr-like
>>>>>>>>>> way.
>>>>>>>>>>> In some future we might have per-segment EFF, in this case
>>>>>> adding a
>>>>>>>>>> segment
>>>>>>>>>>> will trigger full file scan, but in the index only that new
>>>>>> segment
>>>>>>>>> will
>>>>>>>>>> be
>>>>>>>>>>> scanned. It should be faster. You know, straightforward
>>>> sharing
>>>>>>>>> internal
>>>>>>>>>>> data structures between different index views/generations
>>>> is
>>>>> not
>>>>>>>>>> possible.
>>>>>>>>>>> If you are asking about applying delta changes on external
>>>> file
>>>>>>>> that's
>>>>>>>>>>> something what we did ourselves http://goo.gl/P8GFq . This
>>>>>> feature
>>>>>>>> is
>>>>>>>>>> much
>>>>>>>>>>> more doubtful and vague, although it might be the next
>>>>>> contribution
>>>>>>>>> after
>>>>>>>>>>> SOLR-4085.
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> /Martin
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>>>>> m...@issuu.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not
>>>> give
>>>>> you
>>>>>>>> what
>>>>>>>>>>> you're
>>>>>>>>>>>>>> hoping fore.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The EFF is placed in the parent of the index
>>>> directory in
>>>>>>> each
>>>>>>>>>> core;
>>>>>>>>>>> each
>>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that
>>>> it
>>>>> is
>>>>>>>>>>> responsible
>>>>>>>>>>>>> for.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks
>>>>> (doesn't
>>>>>>>> answer
>>>>>>>>>>>>> queries)
>>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that
>>>> the
>>>>>> time
>>>>>>> to
>>>>>>>>>>> re-read
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> EFF is multiplied by the number of cores in use
>>>> (i.e. the
>>>>>> EFF
>>>>>>>> is
>>>>>>>>>>> re-read
>>>>>>>>>>>>> by
>>>>>>>>>>>>>> each core sequentially). The contents of the EFF
>>>> become
>>>>>>> active
>>>>>>>>>> after
>>>>>>>>>>> the
>>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work
>>>> here)
>>>>>> after
>>>>>>>> the
>>>>>>>>>>> file
>>>>>>>>>>>>> has
>>>>>>>>>>>>>> been updated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB
>>>> - and
>>>>>> we
>>>>>>>> use
>>>>>>>>> 16
>>>>>>>>>>>>> shards,
>>>>>>>>>>>>>> so when we triggered an external commit to force
>>>>>> re-reading,
>>>>>>>> the
>>>>>>>>>>> whole
>>>>>>>>>>>>>> system would block for several (10-15) minutes. This
>>>>> won't
>>>>>>> work
>>>>>>>>> in
>>>>>>>>>> a
>>>>>>>>>>>>>> production environment. The reason for the size of
>>>> the
>>>>> EFF
>>>>>> is
>>>>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>>> have
>>>>>>>>>>>>>> around 7M documents in the index; each document has
>>>> a 45
>>>>>>>>> character
>>>>>>>>>>> ID.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We got some help to try to fix the problem so that
>>>> the
>>>>>>> re-read
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>> EFF
>>>>>>>>>>>>>> proceeds in the background (see
>>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985
>>>>> 
>>>>> for
>>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the
>>>>> re-read
>>>>>>>>> proceeds
>>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> background, the time required to launch solr now
>>>> takes at
>>>>>>> least
>>>>>>>>> as
>>>>>>>>>>> long
>>>>>>>>>>>>> as
>>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough
>>>> for
>>>>> our
>>>>>>>>> needs.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields
>>>>>> (though
>>>>>>>> you
>>>>>>>>>> can
>>>>>>>>>>>>> return
>>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is
>>>>> also
>>>>>>>> fixed
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>> 4.1
>>>>>>>>>>>>>> branch here <
>>>>>> https://issues.apache.org/jira/browse/SOLR-4022
>>>>>>>> .
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not
>>>> that
>>>>>>> great.
>>>>>>>>> Our
>>>>>>>>>>>>> solution
>>>>>>>>>>>>>> is as follows: The actual value of the popularity
>>>> measure
>>>>>>> (say,
>>>>>>>>>>> reads)
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> we want to report to the user is inserted into the
>>>> search
>>>>>>>>> response
>>>>>>>>>>>>>> post-query by our query front-end. This value will
>>>> then
>>>>> be
>>>>>>> the
>>>>>>>>>>>>>> authoritative value at the time of the query. The
>>>> value
>>>>> of
>>>>>>> the
>>>>>>>>>>> popularity
>>>>>>>>>>>>>> measure that we use for boosting in the ranking of
>>>> the
>>>>>> search
>>>>>>>>>> results
>>>>>>>>>>> is
>>>>>>>>>>>>>> only updated when the value has changed enough so
>>>> that
>>>>> the
>>>>>>>> impact
>>>>>>>>>> on
>>>>>>>>>>> the
>>>>>>>>>>>>>> boost will be significant (say, more than 2%). This
>>>> does
>>>>>>>> require
>>>>>>>>>>> frequent
>>>>>>>>>>>>>> re-indexing of the documents that have significant
>>>>> changes
>>>>>> in
>>>>>>>> the
>>>>>>>>>>> number
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> reads, but at least we won't have to update a
>>>> document if
>>>>>> it
>>>>>>>>> moves
>>>>>>>>>>> from,
>>>>>>>>>>>>>> say, 1000000 to 1000001 reads.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>>>>>>>>> simo...@apache.org
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to
>>>>> SolrCloud.
>>>>>>>>>> However,
>>>>>>>>>>> in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> index, an external file field is used for
>>>> popularity
>>>>>>> ranking.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How
>>>> does
>>>>> it
>>>>>>>> cope
>>>>>>>>>> with
>>>>>>>>>>>>>>> sharding and replication? Where should the external
>>>>> file
>>>>>> be
>>>>>>>>>> placed
>>>>>>>>>>> now
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> the index folder is not local but in the cloud?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Are there otherwise other best practices to deal
>>>> with
>>>>> the
>>>>>>> use
>>>>>>>>>> cases
>>>>>>>>>>>>>>> external file fields were used for, like
>>>>>>> popularity/ranking,
>>>>>>>> in
>>>>>>>>>>>>>> SolrCloud?
>>>>>>>>>>>>>>> Custom ValueSources going to something external?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>> Simone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Sincerely yours
>>>>>>>>>>>>> Mikhail Khludnev
>>>>>>>>>>>>> Principal Engineer,
>>>>>>>>>>>>> Grid Dynamics
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <http://www.griddynamics.com>
>>>>>>>>>>>>> <mkhlud...@griddynamics.com>
>>>>>>>>>>>>> 
>>>>>>>>>>> 20.11.2012 18:06 пользователь "Martin Koch" <
>>>> m...@issuu.com>
>>>>>>>> написал:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Mikhail
>>>>>>>>>>>> 
>>>>>>>>>>>> Please see answers below.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>>>>>>>>>>>> mkhlud...@griddynamics.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Martin,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you for telling your own "war-story". It's really
>>>>>> useful
>>>>>>>> for
>>>>>>>>>>>>> community.
>>>>>>>>>>>>> The first question might seems not really conscious,
>>>> but
>>>>>> would
>>>>>>>> you
>>>>>>>>>> tell
>>>>>>>>>>>> me
>>>>>>>>>>>>> what blocks searching during EFF reload, when it's
>>>>> triggered
>>>>>> by
>>>>>>>>>> handler
>>>>>>>>>>>> or
>>>>>>>>>>>>> by listener?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> We continuously index new documents using CommitWithin
>>>> to get
>>>>>>>> regular
>>>>>>>>>>>> commits. However, we observed that the EFFs were not
>>>> re-read,
>>>>>> so
>>>>>>> we
>>>>>>>>> had
>>>>>>>>>>> to
>>>>>>>>>>>> do external commits (curl '.../solr/update?commit=true')
>>>> to
>>>>>> force
>>>>>>>>>> reload.
>>>>>>>>>>>> When this is done, solr blocks. I can't tell you exactly
>>>> why
>>>>>> it's
>>>>>>>>> doing
>>>>>>>>>>>> that (it was related to SOLR-3985).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't really get the sentence about sequential
>>>> commits
>>>>> and
>>>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>>>> cores. Do I get right that file is replicated via
>>>>> Zookeeper?
>>>>>>>>> Doesn't
>>>>>>>>>> it
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Again, this is observed behavior. When we issue a commit
>>>> on a
>>>>>>>> system
>>>>>>>>>>> with a
>>>>>>>>>>>> system with many solr cores using EFFs, the system blocks
>>>>> for a
>>>>>>>> long
>>>>>>>>>> time
>>>>>>>>>>>> (15 minutes).  We do NOT use zookeeper for anything. The
>>>> EFF
>>>>>> is a
>>>>>>>>>> symlink
>>>>>>>>>>>> from each cores index dir to the actual file, which is
>>>>> updated
>>>>>> by
>>>>>>>> an
>>>>>>>>>>>> external process.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> causes scalability problem or long time to reload?
>>>> Will it
>>>>>> help
>>>>>>>> if
>>>>>>>>>>> we'll
>>>>>>>>>>>>> have, let's say ExternalDatabaseField which will pull
>>>>> values
>>>>>>> from
>>>>>>>>>> jdbc.
>>>>>>>>>>>> ie.
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the possibility of having some fields being
>>>> retrieved
>>>>>>> from
>>>>>>>> an
>>>>>>>>>>>> external, dynamically updatable store would be really
>>>>>>> interesting.
>>>>>>>>> This
>>>>>>>>>>>> could be JDBC, something in-memory like redis, or a NoSql
>>>>>> product
>>>>>>>>> (e.g.
>>>>>>>>>>>> Cassandra).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> why all cores can't read these values simultaneously?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Again, this is a solr implementation detail that I can't
>>>>> answer
>>>>>>> :)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Can you confirm that IDs in the file is ordered by the
>>>>> index
>>>>>>> term
>>>>>>>>>>> order?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, we sorted the files (standard UNIX sort).
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> AFAIK it can impact load time.
>>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, it does.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Regarding your post-query solution can you tell me if
>>>> query
>>>>>>> found
>>>>>>>>>> 10000
>>>>>>>>>>>>> docs, but I need to display only first page with 100
>>>> rows,
>>>>>>>> whether
>>>>>>>>> I
>>>>>>>>>>> need
>>>>>>>>>>>>> to pull all 10K results to frontend to order them by
>>>> the
>>>>>> rank?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> In our architecture, the clients query an API that
>>>> generates
>>>>>> the
>>>>>>>> SOLR
>>>>>>>>>>>> query, retrieves the relevant additional fields that we
>>>>> needs,
>>>>>>> and
>>>>>>>>>>> returns
>>>>>>>>>>>> the relevant JSON to the front-end.
>>>>>>>>>>>> 
>>>>>>>>>>>> In our use case, results are returned from SOLR by the
>>>> 10's,
>>>>>> not
>>>>>>> by
>>>>>>>>> the
>>>>>>>>>>>> 1000's, so it is a manageable job. Even so, if solr
>>>> returned
>>>>>>>>> thousands
>>>>>>>>>> of
>>>>>>>>>>>> results, it would be up to the implementation of the api
>>>> to
>>>>>>> augment
>>>>>>>>>> only
>>>>>>>>>>>> the results that needed to be returned to the front-end.
>>>>>>>>>>>> 
>>>>>>>>>>>> Even so, patching up a JSON structure with 10000 results
>>>>> should
>>>>>>> be
>>>>>>>>>>>> possible.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm really appreciate if you comment on the questions
>>>>> above.
>>>>>>>>>>>>> PS: It's time to pitch, how much
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-4085
>>>> "Commit-free
>>>>>>>>>>>>> ExternalFileField" can help you?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It looks very interesting :) Does it make it possible
>>>> to
>>>>>> avoid
>>>>>>>>>>> re-reading
>>>>>>>>>>>> the EFF on every commit, and only re-read the values that
>>>>> have
>>>>>>>>> actually
>>>>>>>>>>>> changed?
>>>>>>>>>>>> 
>>>>>>>>>>>> /Martin
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>>>>> m...@issuu.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Solr 4.0 does support using EFFs, but it might not
>>>> give
>>>>> you
>>>>>>>> what
>>>>>>>>>>> you're
>>>>>>>>>>>>>> hoping fore.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We tried using Solr Cloud, and have given up again.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The EFF is placed in the parent of the index
>>>> directory in
>>>>>>> each
>>>>>>>>>> core;
>>>>>>>>>>>> each
>>>>>>>>>>>>>> core reads the entire EFF and picks out the IDs that
>>>> it
>>>>> is
>>>>>>>>>>> responsible
>>>>>>>>>>>>> for.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In the current 4.0.0 release of solr, solr blocks
>>>>> (doesn't
>>>>>>>> answer
>>>>>>>>>>>>> queries)
>>>>>>>>>>>>>> while re-reading the EFF. Even worse, it seems that
>>>> the
>>>>>> time
>>>>>>> to
>>>>>>>>>>> re-read
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> EFF is multiplied by the number of cores in use
>>>> (i.e. the
>>>>>> EFF
>>>>>>>> is
>>>>>>>>>>>> re-read
>>>>>>>>>>>>> by
>>>>>>>>>>>>>> each core sequentially). The contents of the EFF
>>>> become
>>>>>>> active
>>>>>>>>>> after
>>>>>>>>>>>> the
>>>>>>>>>>>>>> first EXTERNAL commit (commitWithin does NOT work
>>>> here)
>>>>>> after
>>>>>>>> the
>>>>>>>>>>> file
>>>>>>>>>>>>> has
>>>>>>>>>>>>>> been updated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In our case, the EFF was quite large - around 450MB
>>>> - and
>>>>>> we
>>>>>>>> use
>>>>>>>>> 16
>>>>>>>>>>>>> shards,
>>>>>>>>>>>>>> so when we triggered an external commit to force
>>>>>> re-reading,
>>>>>>>> the
>>>>>>>>>>> whole
>>>>>>>>>>>>>> system would block for several (10-15) minutes. This
>>>>> won't
>>>>>>> work
>>>>>>>>> in
>>>>>>>>>> a
>>>>>>>>>>>>>> production environment. The reason for the size of
>>>> the
>>>>> EFF
>>>>>> is
>>>>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>>> have
>>>>>>>>>>>>>> around 7M documents in the index; each document has
>>>> a 45
>>>>>>>>> character
>>>>>>>>>>> ID.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We got some help to try to fix the problem so that
>>>> the
>>>>>>> re-read
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>> EFF
>>>>>>>>>>>>>> proceeds in the background (see
>>>>>>>>>>>>>> here<https://issues.apache.org/jira/browse/SOLR-3985
>>>>> 
>>>>> for
>>>>>>>>>>>>>> a fix on the 4.1 branch). However, even though the
>>>>> re-read
>>>>>>>>> proceeds
>>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> background, the time required to launch solr now
>>>> takes at
>>>>>>> least
>>>>>>>>> as
>>>>>>>>>>> long
>>>>>>>>>>>>> as
>>>>>>>>>>>>>> re-reading the EFFs. Again, this is not good enough
>>>> for
>>>>> our
>>>>>>>>> needs.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The next issue is that you cannot sort on EFF fields
>>>>>> (though
>>>>>>>> you
>>>>>>>>>> can
>>>>>>>>>>>>> return
>>>>>>>>>>>>>> them as values using &fl=field(my_eff_field). This is
>>>>> also
>>>>>>>> fixed
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>> 4.1
>>>>>>>>>>>>>> branch here <
>>>>>> https://issues.apache.org/jira/browse/SOLR-4022
>>>>>>>> .
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So: Even after these fixes, EFF performance is not
>>>> that
>>>>>>> great.
>>>>>>>>> Our
>>>>>>>>>>>>> solution
>>>>>>>>>>>>>> is as follows: The actual value of the popularity
>>>> measure
>>>>>>> (say,
>>>>>>>>>>> reads)
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> we want to report to the user is inserted into the
>>>> search
>>>>>>>>> response
>>>>>>>>>>>>>> post-query by our query front-end. This value will
>>>> then
>>>>> be
>>>>>>> the
>>>>>>>>>>>>>> authoritative value at the time of the query. The
>>>> value
>>>>> of
>>>>>>> the
>>>>>>>>>>>> popularity
>>>>>>>>>>>>>> measure that we use for boosting in the ranking of
>>>> the
>>>>>> search
>>>>>>>>>> results
>>>>>>>>>>>> is
>>>>>>>>>>>>>> only updated when the value has changed enough so
>>>> that
>>>>> the
>>>>>>>> impact
>>>>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>>>> boost will be significant (say, more than 2%). This
>>>> does
>>>>>>>> require
>>>>>>>>>>>> frequent
>>>>>>>>>>>>>> re-indexing of the documents that have significant
>>>>> changes
>>>>>> in
>>>>>>>> the
>>>>>>>>>>>> number
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> reads, but at least we won't have to update a
>>>> document if
>>>>>> it
>>>>>>>>> moves
>>>>>>>>>>>> from,
>>>>>>>>>>>>>> say, 1000000 to 1000001 reads.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> /Martin Koch - ISSUU - senior systems architect.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>>>>>>>>> simo...@apache.org
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>> I'm planning to move a quite big Solr index to
>>>>> SolrCloud.
>>>>>>>>>> However,
>>>>>>>>>>> in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> index, an external file field is used for
>>>> popularity
>>>>>>> ranking.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Does SolrCloud supports external file fields? How
>>>> does
>>>>> it
>>>>>>>> cope
>>>>>>>>>> with
>>>>>>>>>>>>>>> sharding and replication? Where should the external
>>>>> file
>>>>>> be
>>>>>>>>>> placed
>>>>>>>>>>>> now
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> the index folder is not local but in the cloud?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Are there otherwise other best practices to deal
>>>> with
>>>>> the
>>>>>>> use
>>>>>>>>>> cases
>>>>>>>>>>>>>>> external file fields were used for, like
>>>>>>> popularity/ranking,
>>>>>>>> in
>>>>>>>>>>>>>> SolrCloud?
>>>>>>>>>>>>>>> Custom ValueSources going to something external?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>> Simone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Sincerely yours
>>>>>>>>>>>>> Mikhail Khludnev
>>>>>>>>>>>>> Principal Engineer,
>>>>>>>>>>>>> Grid Dynamics
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <http://www.griddynamics.com>
>>>>>>>>>>>>> <mkhlud...@griddynamics.com>
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Sincerely yours
>>>>>>>>> Mikhail Khludnev
>>>>>>>>> Principal Engineer,
>>>>>>>>> Grid Dynamics
>>>>>>>>> 
>>>>>>>>> <http://www.griddynamics.com>
>>>>>>>>> <mkhlud...@griddynamics.com>
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Sincerely yours
>>>>>>> Mikhail Khludnev
>>>>>>> Principal Engineer,
>>>>>>> Grid Dynamics
>>>>>>> 
>>>>>>> <http://www.griddynamics.com>
>>>>>>> <mkhlud...@griddynamics.com>
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>> Principal Engineer,
>>>>> Grid Dynamics
>>>>> 
>>>>> <http://www.griddynamics.com>
>>>>> <mkhlud...@griddynamics.com>
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> Principal Engineer,
>>> Grid Dynamics
>>> 
>>> <http://www.griddynamics.com>
>>> <mkhlud...@griddynamics.com>
>>> 
>>> 
>>

Re: SolrCloud and exernal file fields

Reply via email to