Re: SolrCloud and exernal file fields

Mikhail Khludnev Wed, 28 Nov 2012 04:25:34 -0800

Martin,
Right as far node in Zookeeper DistributedUpdateProcessor will broadcast
commits to all peers. To hack this you can introduce dedicated
UpdateProcessorChain without DistributedUpdateProcessor and send commit to
that chain.
 28.11.2012 13:16 пользователь "Martin Koch" <m...@issuu.com> написал:


> Mikhail
>
> I haven't experimented further yet. I think that the previous experiment
> of issuing a commit to a specific core proved that all cores get the
> commit, so I don't think that this approach will work.
>
> Thanks,
> /Martin
>
>
> On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
>> Martin,
>>
>> It's still not clear to me whether you solve the problem completely or
>> partially:
>> Does reducing number of cores free some resources for searching during
>> commit?
>> Does the commiting one-by-one core prevents the "freeze"?
>>
>> Thanks
>>
>>
>> On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <m...@issuu.com> wrote:
>>
>>> Mikhail
>>>
>>> To avoid freezes we deployed the patches that are now on the 4.1 trunk
>>> (bug
>>> 3985). But this wasn't good enough, because SOLR would still take very
>>> long
>>> to restart when that was necessary.
>>>
>>> I don't see how we could throw more hardware at the problem without
>>> making
>>> it worse, really - the only solution here would be *fewer* shards, not
>>>
>>> more.
>>>
>>> IMO it would be ideal if the lucene/solr community could come up with a
>>> good way of updating fields in a document without reindexing. This could
>>> be
>>> by linking to some external data store, or in the lucene/solr internals.
>>> If
>>> it would make things easier, a good first step would be to have
>>> dynamically
>>> updateable numerical fields only.
>>>
>>> /Martin
>>>
>>> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
>>> mkhlud...@griddynamics.com> wrote:
>>>
>>> > Martin,
>>> >
>>> > I don't think solrconfig.xml shed any light on. I've just found what I
>>> > didn't get in your setup - the way of how to explicitly assigning core
>>> to
>>> > collection. Now, I realized most of details after all!
>>> > Ball is on your side, let us know whether you have managed your cores
>>> to
>>> > commit one by one to avoid freeze, or could you eliminate pauses by
>>> > allocating more hardware?
>>> > Thanks in advance!
>>> >
>>> >
>>> > On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <m...@issuu.com> wrote:
>>> >
>>> > > Mikhail,
>>> > >
>>> > > PSB
>>> > >
>>> > > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
>>> > > mkhlud...@griddynamics.com> wrote:
>>> > >
>>> > > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <m...@issuu.com>
>>> wrote:
>>> > > >
>>> > > > >
>>> > > > > I wasn't aware until now that it is possible to send a commit to
>>> one
>>> > > core
>>> > > > > only. What we observed was the effect of curl
>>> > > > > localhost:8080/solr/update?commit=true but perhaps we should
>>> > experiment
>>> > > > > with solr/coreN/update?commit=true. A quick trial run seems to
>>> > indicate
>>> > > > > that a commit to a single core causes commits on all cores.
>>> > > > >
>>> > > > You should see something like this in the log:
>>> > > > ... SolrCmdDistributor .... Distrib commit to: ...
>>> > > >
>>> > > > Yup, a commit towards a single core results in a commit on all
>>> cores.
>>> > >
>>> > >
>>> > > > >
>>> > > > >
>>> > > > > Perhaps I should clarify that we are using SOLR as a black box;
>>> we do
>>> > > not
>>> > > > > touch the code at all - we only install the distribution WAR
>>> file and
>>> > > > > proceed from there.
>>> > > > >
>>> > > > I still don't understand how you deploy/launch Solr. How many
>>> jettys
>>> > you
>>> > > > start whether you have -DzkRun -DzkHost -DnumShards=2  or you
>>> specifies
>>> > > > shards= param for every request and distributes updates yourself?
>>> What
>>> > > > collections do you create and with which settings?
>>> > > >
>>> > > > We let SOLR do the sharding using one collection with 16 SOLR cores
>>> > > holding one shard each. We launch only one instance of jetty with the
>>> > > folllowing arguments:
>>> > >
>>> > > -DnumShards=16
>>> > > -DzkHost=<zookeeperhost:port>
>>> > > -Xmx10G
>>> > > -Xms10G
>>> > > -Xmn2G
>>> > > -server
>>> > >
>>> > > Would you like to see the solrconfig.xml?
>>> > >
>>> > > /Martin
>>> > >
>>> > >
>>> > > > >
>>> > > > >
>>> > > > > > Also from my POV such deployments should start at least from
>>> *16*
>>> > > 4-way
>>> > > > > > vboxes, it's more expensive, but should be much better
>>> available
>>> > > during
>>> > > > > > cpu-consuming operations.
>>> > > > > >
>>> > > > >
>>> > > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4
>>> hosts
>>> > > > with
>>> > > > > 16 cores? Or am I misunderstanding something :) ?
>>> > > > >
>>> > > > I prefer to start from 16 hosts with 4 cores each.
>>> > > >
>>> > > >
>>> > > > >
>>> > > > >
>>> > > > > > Other details, if you use single jetty for all of them, are you
>>> > sure
>>> > > > that
>>> > > > > > jetty's threadpool doesn't limit requests? is it large enough?
>>> > > > > > You have 60G and set -Xmx=10G. are you sure that total size of
>>> > cores
>>> > > > > index
>>> > > > > > directories is less than 45G?
>>> > > > > >
>>> > > > > > The total index size is 230 GB, so it won't fit in ram, but
>>> we're
>>> > > using
>>> > > > > an
>>> > > > > SSD disk to minimize disk access time. We have tried putting the
>>> EFF
>>> > > > onto a
>>> > > > > ram disk, but this didn't have a measurable effect.
>>> > > > >
>>> > > > > Thanks,
>>> > > > > /Martin
>>> > > > >
>>> > > > >
>>> > > > > > Thanks
>>> > > > > >
>>> > > > > >
>>> > > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <m...@issuu.com>
>>> > wrote:
>>> > > > > >
>>> > > > > > > Mikhail
>>> > > > > > >
>>> > > > > > > PSB
>>> > > > > > >
>>> > > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
>>> > > > > > > mkhlud...@griddynamics.com> wrote:
>>> > > > > > >
>>> > > > > > > > Martin,
>>> > > > > > > >
>>> > > > > > > > Please find additional question from me below.
>>> > > > > > > >
>>> > > > > > > > Simone,
>>> > > > > > > >
>>> > > > > > > > I'm sorry for hijacking your thread. The only what I've
>>> heard
>>> > > about
>>> > > > > it
>>> > > > > > at
>>> > > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to
>>> > > > replicate
>>> > > > > > > those
>>> > > > > > > > files as configs under solr home. And I'm really looking
>>> > forward
>>> > > to
>>> > > > > > know
>>> > > > > > > > how it works with huge files in production.
>>> > > > > > > >
>>> > > > > > > > Thank You, Guys!
>>> > > > > > > >
>>> > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com
>>> >
>>> > > > написал:
>>> > > > > > > > >
>>> > > > > > > > > Hi Mikhail
>>> > > > > > > > >
>>> > > > > > > > > Please see answers below.
>>> > > > > > > > >
>>> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>>> > > > > > > > > mkhlud...@griddynamics.com> wrote:
>>> > > > > > > > >
>>> > > > > > > > > > Martin,
>>> > > > > > > > > >
>>> > > > > > > > > > Thank you for telling your own "war-story". It's really
>>> > > useful
>>> > > > > for
>>> > > > > > > > > > community.
>>> > > > > > > > > > The first question might seems not really conscious,
>>> but
>>> > > would
>>> > > > > you
>>> > > > > > > tell
>>> > > > > > > > me
>>> > > > > > > > > > what blocks searching during EFF reload, when it's
>>> > triggered
>>> > > by
>>> > > > > > > handler
>>> > > > > > > > or
>>> > > > > > > > > > by listener?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > We continuously index new documents using CommitWithin
>>> to get
>>> > > > > regular
>>> > > > > > > > > commits. However, we observed that the EFFs were not
>>> re-read,
>>> > > so
>>> > > > we
>>> > > > > > had
>>> > > > > > > > to
>>> > > > > > > > > do external commits (curl '.../solr/update?commit=true')
>>> to
>>> > > force
>>> > > > > > > reload.
>>> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
>>> why
>>> > > it's
>>> > > > > > doing
>>> > > > > > > > > that (it was related to SOLR-3985).
>>> > > > > > > >
>>> > > > > > > > Is there a chance to get a thread dump when they are
>>> blocked?
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > Well I could try to recreate the situation. But the setup is
>>> > fairly
>>> > > > > > simple:
>>> > > > > > > Create a large EFF in a largeish index with many shards.
>>> Issue a
>>> > > > > commit,
>>> > > > > > > and then try to do a search. Solr will not respond to the
>>> search
>>> > > > before
>>> > > > > > the
>>> > > > > > > commit has completed, and this will take a long time.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > I don't really get the sentence about sequential
>>> commits
>>> > and
>>> > > > > number
>>> > > > > > > of
>>> > > > > > > > > > cores. Do I get right that file is replicated via
>>> > Zookeeper?
>>> > > > > > Doesn't
>>> > > > > > > it
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Again, this is observed behavior. When we issue a commit
>>> on a
>>> > > > > system
>>> > > > > > > with
>>> > > > > > > > a
>>> > > > > > > > > system with many solr cores using EFFs, the system blocks
>>> > for a
>>> > > > > long
>>> > > > > > > time
>>> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
>>> EFF
>>> > > is a
>>> > > > > > > symlink
>>> > > > > > > > > from each cores index dir to the actual file, which is
>>> > updated
>>> > > by
>>> > > > > an
>>> > > > > > > > > external process.
>>> > > > > > > >
>>> > > > > > > > Hold on, I asked about Zookeeper because the subj mentions
>>> > > > SolrCloud.
>>> > > > > > > >
>>> > > > > > > > Do you use SolrCloud, SolrShards, or these cores are just
>>> > > replicas
>>> > > > of
>>> > > > > > the
>>> > > > > > > > same index?
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > > Ah - we use solr 4 out of the box, so I guess this is
>>> SolrCloud.
>>> > > I'm
>>> > > > a
>>> > > > > > bit
>>> > > > > > > unsure about the terminology here, but we've got a single
>>> index
>>> > > > divided
>>> > > > > > > into 16 shard. Each shard is hosted in a solr core.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > > Also, about simlink - Don't you share that file via some
>>> NFS?
>>> > > > > > > >
>>> > > > > > > > No, we generate the EFF on the local solr host (there is
>>> only
>>> > one
>>> > > > > > > physical
>>> > > > > > > host that holds all shards), so there is no need for NFS or
>>> > copying
>>> > > > > files
>>> > > > > > > around. No need for Zookeeper either.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > > how many cores you run per box?
>>> > > > > > > >
>>> > > > > > > This box is a 16-virtual core (8 hyperthreaded cores)  with
>>> 60GB
>>> > of
>>> > > > > RAM.
>>> > > > > > We
>>> > > > > > > run 16 solr cores on this box in Jetty.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > > Do boxes has plenty of ram to cache filesystem beside of
>>> jvm
>>> > > heaps?
>>> > > > > > > >
>>> > > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for
>>> the
>>> > > OS.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > > I assume you use 64 bit linux and mmap directory. Please
>>> > confirm
>>> > > > > that.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > We use 64-bit linux. I'm not sure about the mmap directory or
>>> > where
>>> > > > > that
>>> > > > > > > would be configured in solr - can you explain that?
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > causes scalability problem or long time to reload?
>>> Will it
>>> > > help
>>> > > > > if
>>> > > > > > > > we'll
>>> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
>>> > values
>>> > > > from
>>> > > > > > > jdbc.
>>> > > > > > > > ie.
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > I think the possibility of having some fields being
>>> retrieved
>>> > > > from
>>> > > > > an
>>> > > > > > > > > external, dynamically updatable store would be really
>>> > > > interesting.
>>> > > > > > This
>>> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
>>> > > product
>>> > > > > > (e.g.
>>> > > > > > > > > Cassandra).
>>> > > > > > > >
>>> > > > > > > > Ok. Let's have it in mind as a possible direction.
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > > Alternatively, an API that would allow updating a single
>>> field
>>> > for
>>> > > a
>>> > > > > > > document might be an option.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > why all cores can't read these values simultaneously?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Again, this is a solr implementation detail that I can't
>>> > answer
>>> > > > :)
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
>>> > index
>>> > > > term
>>> > > > > > > > order?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > AFAIK it can impact load time.
>>> > > > > > > > > >
>>> > > > > > > > > Yes, it does
>>> > > > > > > >
>>> > > > > > > > Ok, I've got that you aware of it, and your IDs are just
>>> > strings,
>>> > > > not
>>> > > > > > > > integers.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > Yes, ids are strings.
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > Regarding your post-query solution can you tell me if
>>> query
>>> > > > found
>>> > > > > > > 10000
>>> > > > > > > > > > docs, but I need to display only first page with 100
>>> rows,
>>> > > > > whether
>>> > > > > > I
>>> > > > > > > > need
>>> > > > > > > > > > to pull all 10K results to frontend to order them by
>>> the
>>> > > rank?
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > In our architecture, the clients query an API that
>>> generates
>>> > > the
>>> > > > > SOLR
>>> > > > > > > > > query, retrieves the relevant additional fields that we
>>> > needs,
>>> > > > and
>>> > > > > > > > returns
>>> > > > > > > > > the relevant JSON to the front-end.
>>> > > > > > > > >
>>> > > > > > > > > In our use case, results are returned from SOLR by the
>>> 10's,
>>> > > not
>>> > > > by
>>> > > > > > the
>>> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
>>> returned
>>> > > > > > thousands
>>> > > > > > > of
>>> > > > > > > > > results, it would be up to the implementation of the api
>>> to
>>> > > > augment
>>> > > > > > > only
>>> > > > > > > > > the results that needed to be returned to the front-end.
>>> > > > > > > > >
>>> > > > > > > > > Even so, patching up a JSON structure with 10000 results
>>> > should
>>> > > > be
>>> > > > > > > > > possible.
>>> > > > > > > >
>>> > > > > > > > You are right. I'm concerned anyway because retrieving
>>> whole
>>> > > result
>>> > > > > is
>>> > > > > > > > expensive, and not always possible.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > In our case, getting the whole result is almost impossible,
>>> > because
>>> > > > > that
>>> > > > > > > would be millions of documents, and returning the Nth result
>>> > seems
>>> > > to
>>> > > > > be
>>> > > > > > a
>>> > > > > > > quadratic (or worse) operation in SOLR.
>>> > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > I'm really appreciate if you comment on the questions
>>> > above.
>>> > > > > > > > > > PS: It's time to pitch, how much
>>> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
>>> "Commit-free
>>> > > > > > > > > > ExternalFileField" can help you?
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > It looks very interesting :) Does it make it possible
>>> to
>>> > > avoid
>>> > > > > > > > re-reading
>>> > > > > > > > > the EFF on every commit, and only re-read the values that
>>> > have
>>> > > > > > actually
>>> > > > > > > > > changed?
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > You don't need commit (in SOLR-4085) to reload file
>>> content,
>>> > but
>>> > > > > after
>>> > > > > > > > commit you need to read whole file and scan all key terms
>>> and
>>> > > > > postings.
>>> > > > > > > > That's because EFF sits on top of top level searcher. it's
>>> a
>>> > > > > Solr-like
>>> > > > > > > way.
>>> > > > > > > > In some future we might have per-segment EFF, in this case
>>> > > adding a
>>> > > > > > > segment
>>> > > > > > > > will trigger full file scan, but in the index only that new
>>> > > segment
>>> > > > > > will
>>> > > > > > > be
>>> > > > > > > > scanned. It should be faster. You know, straightforward
>>> sharing
>>> > > > > > internal
>>> > > > > > > > data structures between different index views/generations
>>> is
>>> > not
>>> > > > > > > possible.
>>> > > > > > > > If you are asking about applying delta changes on external
>>> file
>>> > > > > that's
>>> > > > > > > > something what we did ourselves http://goo.gl/P8GFq . This
>>> > > feature
>>> > > > > is
>>> > > > > > > much
>>> > > > > > > > more doubtful and vague, although it might be the next
>>> > > contribution
>>> > > > > > after
>>> > > > > > > > SOLR-4085.
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > /Martin
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>>> > m...@issuu.com>
>>> > > > > > wrote:
>>> > > > > > > > > >
>>> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not
>>> give
>>> > you
>>> > > > > what
>>> > > > > > > > you're
>>> > > > > > > > > > > hoping fore.
>>> > > > > > > > > > >
>>> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
>>> > > > > > > > > > >
>>> > > > > > > > > > > The EFF is placed in the parent of the index
>>> directory in
>>> > > > each
>>> > > > > > > core;
>>> > > > > > > > each
>>> > > > > > > > > > > core reads the entire EFF and picks out the IDs that
>>> it
>>> > is
>>> > > > > > > > responsible
>>> > > > > > > > > > for.
>>> > > > > > > > > > >
>>> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
>>> > (doesn't
>>> > > > > answer
>>> > > > > > > > > > queries)
>>> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that
>>> the
>>> > > time
>>> > > > to
>>> > > > > > > > re-read
>>> > > > > > > > > > the
>>> > > > > > > > > > > EFF is multiplied by the number of cores in use
>>> (i.e. the
>>> > > EFF
>>> > > > > is
>>> > > > > > > > re-read
>>> > > > > > > > > > by
>>> > > > > > > > > > > each core sequentially). The contents of the EFF
>>> become
>>> > > > active
>>> > > > > > > after
>>> > > > > > > > the
>>> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work
>>> here)
>>> > > after
>>> > > > > the
>>> > > > > > > > file
>>> > > > > > > > > > has
>>> > > > > > > > > > > been updated.
>>> > > > > > > > > > >
>>> > > > > > > > > > > In our case, the EFF was quite large - around 450MB
>>> - and
>>> > > we
>>> > > > > use
>>> > > > > > 16
>>> > > > > > > > > > shards,
>>> > > > > > > > > > > so when we triggered an external commit to force
>>> > > re-reading,
>>> > > > > the
>>> > > > > > > > whole
>>> > > > > > > > > > > system would block for several (10-15) minutes. This
>>> > won't
>>> > > > work
>>> > > > > > in
>>> > > > > > > a
>>> > > > > > > > > > > production environment. The reason for the size of
>>> the
>>> > EFF
>>> > > is
>>> > > > > > that
>>> > > > > > > we
>>> > > > > > > > > > have
>>> > > > > > > > > > > around 7M documents in the index; each document has
>>> a 45
>>> > > > > > character
>>> > > > > > > > ID.
>>> > > > > > > > > > >
>>> > > > > > > > > > > We got some help to try to fix the problem so that
>>> the
>>> > > > re-read
>>> > > > > of
>>> > > > > > > the
>>> > > > > > > > EFF
>>> > > > > > > > > > > proceeds in the background (see
>>> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985
>>> >
>>> > for
>>> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
>>> > re-read
>>> > > > > > proceeds
>>> > > > > > > > in
>>> > > > > > > > > > the
>>> > > > > > > > > > > background, the time required to launch solr now
>>> takes at
>>> > > > least
>>> > > > > > as
>>> > > > > > > > long
>>> > > > > > > > > > as
>>> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough
>>> for
>>> > our
>>> > > > > > needs.
>>> > > > > > > > > > >
>>> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
>>> > > (though
>>> > > > > you
>>> > > > > > > can
>>> > > > > > > > > > return
>>> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
>>> > also
>>> > > > > fixed
>>> > > > > > in
>>> > > > > > > > the
>>> > > > > > > > > > 4.1
>>> > > > > > > > > > > branch here <
>>> > > https://issues.apache.org/jira/browse/SOLR-4022
>>> > > > >.
>>> > > > > > > > > > >
>>> > > > > > > > > > > So: Even after these fixes, EFF performance is not
>>> that
>>> > > > great.
>>> > > > > > Our
>>> > > > > > > > > > solution
>>> > > > > > > > > > > is as follows: The actual value of the popularity
>>> measure
>>> > > > (say,
>>> > > > > > > > reads)
>>> > > > > > > > > > that
>>> > > > > > > > > > > we want to report to the user is inserted into the
>>> search
>>> > > > > > response
>>> > > > > > > > > > > post-query by our query front-end. This value will
>>> then
>>> > be
>>> > > > the
>>> > > > > > > > > > > authoritative value at the time of the query. The
>>> value
>>> > of
>>> > > > the
>>> > > > > > > > popularity
>>> > > > > > > > > > > measure that we use for boosting in the ranking of
>>> the
>>> > > search
>>> > > > > > > results
>>> > > > > > > > is
>>> > > > > > > > > > > only updated when the value has changed enough so
>>> that
>>> > the
>>> > > > > impact
>>> > > > > > > on
>>> > > > > > > > the
>>> > > > > > > > > > > boost will be significant (say, more than 2%). This
>>> does
>>> > > > > require
>>> > > > > > > > frequent
>>> > > > > > > > > > > re-indexing of the documents that have significant
>>> > changes
>>> > > in
>>> > > > > the
>>> > > > > > > > number
>>> > > > > > > > > > of
>>> > > > > > > > > > > reads, but at least we won't have to update a
>>> document if
>>> > > it
>>> > > > > > moves
>>> > > > > > > > from,
>>> > > > > > > > > > > say, 1000000 to 1000001 reads.
>>> > > > > > > > > > >
>>> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
>>> > > > > > > > > > >
>>> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>>> > > > > > simo...@apache.org
>>> > > > > > > >
>>> > > > > > > > > > wrote:
>>> > > > > > > > > > >
>>> > > > > > > > > > > > Hi all,
>>> > > > > > > > > > > > I'm planning to move a quite big Solr index to
>>> > SolrCloud.
>>> > > > > > > However,
>>> > > > > > > > in
>>> > > > > > > > > > > this
>>> > > > > > > > > > > > index, an external file field is used for
>>> popularity
>>> > > > ranking.
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Does SolrCloud supports external file fields? How
>>> does
>>> > it
>>> > > > > cope
>>> > > > > > > with
>>> > > > > > > > > > > > sharding and replication? Where should the external
>>> > file
>>> > > be
>>> > > > > > > placed
>>> > > > > > > > now
>>> > > > > > > > > > > that
>>> > > > > > > > > > > > the index folder is not local but in the cloud?
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Are there otherwise other best practices to deal
>>> with
>>> > the
>>> > > > use
>>> > > > > > > cases
>>> > > > > > > > > > > > external file fields were used for, like
>>> > > > popularity/ranking,
>>> > > > > in
>>> > > > > > > > > > > SolrCloud?
>>> > > > > > > > > > > > Custom ValueSources going to something external?
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Thanks in advance,
>>> > > > > > > > > > > > Simone
>>> > > > > > > > > > > >
>>> > > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > --
>>> > > > > > > > > > Sincerely yours
>>> > > > > > > > > > Mikhail Khludnev
>>> > > > > > > > > > Principal Engineer,
>>> > > > > > > > > > Grid Dynamics
>>> > > > > > > > > >
>>> > > > > > > > > > <http://www.griddynamics.com>
>>> > > > > > > > > >  <mkhlud...@griddynamics.com>
>>> > > > > > > > > >
>>> > > > > > > >  20.11.2012 18:06 пользователь "Martin Koch" <
>>> m...@issuu.com>
>>> > > > > написал:
>>> > > > > > > >
>>> > > > > > > > > Hi Mikhail
>>> > > > > > > > >
>>> > > > > > > > > Please see answers below.
>>> > > > > > > > >
>>> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
>>> > > > > > > > > mkhlud...@griddynamics.com> wrote:
>>> > > > > > > > >
>>> > > > > > > > > > Martin,
>>> > > > > > > > > >
>>> > > > > > > > > > Thank you for telling your own "war-story". It's really
>>> > > useful
>>> > > > > for
>>> > > > > > > > > > community.
>>> > > > > > > > > > The first question might seems not really conscious,
>>> but
>>> > > would
>>> > > > > you
>>> > > > > > > tell
>>> > > > > > > > > me
>>> > > > > > > > > > what blocks searching during EFF reload, when it's
>>> > triggered
>>> > > by
>>> > > > > > > handler
>>> > > > > > > > > or
>>> > > > > > > > > > by listener?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > We continuously index new documents using CommitWithin
>>> to get
>>> > > > > regular
>>> > > > > > > > > commits. However, we observed that the EFFs were not
>>> re-read,
>>> > > so
>>> > > > we
>>> > > > > > had
>>> > > > > > > > to
>>> > > > > > > > > do external commits (curl '.../solr/update?commit=true')
>>> to
>>> > > force
>>> > > > > > > reload.
>>> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
>>> why
>>> > > it's
>>> > > > > > doing
>>> > > > > > > > > that (it was related to SOLR-3985).
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > I don't really get the sentence about sequential
>>> commits
>>> > and
>>> > > > > number
>>> > > > > > > of
>>> > > > > > > > > > cores. Do I get right that file is replicated via
>>> > Zookeeper?
>>> > > > > > Doesn't
>>> > > > > > > it
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Again, this is observed behavior. When we issue a commit
>>> on a
>>> > > > > system
>>> > > > > > > > with a
>>> > > > > > > > > system with many solr cores using EFFs, the system blocks
>>> > for a
>>> > > > > long
>>> > > > > > > time
>>> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
>>> EFF
>>> > > is a
>>> > > > > > > symlink
>>> > > > > > > > > from each cores index dir to the actual file, which is
>>> > updated
>>> > > by
>>> > > > > an
>>> > > > > > > > > external process.
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > causes scalability problem or long time to reload?
>>> Will it
>>> > > help
>>> > > > > if
>>> > > > > > > > we'll
>>> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
>>> > values
>>> > > > from
>>> > > > > > > jdbc.
>>> > > > > > > > > ie.
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > I think the possibility of having some fields being
>>> retrieved
>>> > > > from
>>> > > > > an
>>> > > > > > > > > external, dynamically updatable store would be really
>>> > > > interesting.
>>> > > > > > This
>>> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
>>> > > product
>>> > > > > > (e.g.
>>> > > > > > > > > Cassandra).
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > why all cores can't read these values simultaneously?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Again, this is a solr implementation detail that I can't
>>> > answer
>>> > > > :)
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
>>> > index
>>> > > > term
>>> > > > > > > > order?
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > AFAIK it can impact load time.
>>> > > > > > > > > >
>>> > > > > > > > > Yes, it does.
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > Regarding your post-query solution can you tell me if
>>> query
>>> > > > found
>>> > > > > > > 10000
>>> > > > > > > > > > docs, but I need to display only first page with 100
>>> rows,
>>> > > > > whether
>>> > > > > > I
>>> > > > > > > > need
>>> > > > > > > > > > to pull all 10K results to frontend to order them by
>>> the
>>> > > rank?
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > In our architecture, the clients query an API that
>>> generates
>>> > > the
>>> > > > > SOLR
>>> > > > > > > > > query, retrieves the relevant additional fields that we
>>> > needs,
>>> > > > and
>>> > > > > > > > returns
>>> > > > > > > > > the relevant JSON to the front-end.
>>> > > > > > > > >
>>> > > > > > > > > In our use case, results are returned from SOLR by the
>>> 10's,
>>> > > not
>>> > > > by
>>> > > > > > the
>>> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
>>> returned
>>> > > > > > thousands
>>> > > > > > > of
>>> > > > > > > > > results, it would be up to the implementation of the api
>>> to
>>> > > > augment
>>> > > > > > > only
>>> > > > > > > > > the results that needed to be returned to the front-end.
>>> > > > > > > > >
>>> > > > > > > > > Even so, patching up a JSON structure with 10000 results
>>> > should
>>> > > > be
>>> > > > > > > > > possible.
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > > I'm really appreciate if you comment on the questions
>>> > above.
>>> > > > > > > > > > PS: It's time to pitch, how much
>>> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
>>> "Commit-free
>>> > > > > > > > > > ExternalFileField" can help you?
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > It looks very interesting :) Does it make it possible
>>> to
>>> > > avoid
>>> > > > > > > > re-reading
>>> > > > > > > > > the EFF on every commit, and only re-read the values that
>>> > have
>>> > > > > > actually
>>> > > > > > > > > changed?
>>> > > > > > > > >
>>> > > > > > > > > /Martin
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
>>> > m...@issuu.com>
>>> > > > > > wrote:
>>> > > > > > > > > >
>>> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not
>>> give
>>> > you
>>> > > > > what
>>> > > > > > > > you're
>>> > > > > > > > > > > hoping fore.
>>> > > > > > > > > > >
>>> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
>>> > > > > > > > > > >
>>> > > > > > > > > > > The EFF is placed in the parent of the index
>>> directory in
>>> > > > each
>>> > > > > > > core;
>>> > > > > > > > > each
>>> > > > > > > > > > > core reads the entire EFF and picks out the IDs that
>>> it
>>> > is
>>> > > > > > > > responsible
>>> > > > > > > > > > for.
>>> > > > > > > > > > >
>>> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
>>> > (doesn't
>>> > > > > answer
>>> > > > > > > > > > queries)
>>> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that
>>> the
>>> > > time
>>> > > > to
>>> > > > > > > > re-read
>>> > > > > > > > > > the
>>> > > > > > > > > > > EFF is multiplied by the number of cores in use
>>> (i.e. the
>>> > > EFF
>>> > > > > is
>>> > > > > > > > > re-read
>>> > > > > > > > > > by
>>> > > > > > > > > > > each core sequentially). The contents of the EFF
>>> become
>>> > > > active
>>> > > > > > > after
>>> > > > > > > > > the
>>> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work
>>> here)
>>> > > after
>>> > > > > the
>>> > > > > > > > file
>>> > > > > > > > > > has
>>> > > > > > > > > > > been updated.
>>> > > > > > > > > > >
>>> > > > > > > > > > > In our case, the EFF was quite large - around 450MB
>>> - and
>>> > > we
>>> > > > > use
>>> > > > > > 16
>>> > > > > > > > > > shards,
>>> > > > > > > > > > > so when we triggered an external commit to force
>>> > > re-reading,
>>> > > > > the
>>> > > > > > > > whole
>>> > > > > > > > > > > system would block for several (10-15) minutes. This
>>> > won't
>>> > > > work
>>> > > > > > in
>>> > > > > > > a
>>> > > > > > > > > > > production environment. The reason for the size of
>>> the
>>> > EFF
>>> > > is
>>> > > > > > that
>>> > > > > > > we
>>> > > > > > > > > > have
>>> > > > > > > > > > > around 7M documents in the index; each document has
>>> a 45
>>> > > > > > character
>>> > > > > > > > ID.
>>> > > > > > > > > > >
>>> > > > > > > > > > > We got some help to try to fix the problem so that
>>> the
>>> > > > re-read
>>> > > > > of
>>> > > > > > > the
>>> > > > > > > > > EFF
>>> > > > > > > > > > > proceeds in the background (see
>>> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985
>>> >
>>> > for
>>> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
>>> > re-read
>>> > > > > > proceeds
>>> > > > > > > > in
>>> > > > > > > > > > the
>>> > > > > > > > > > > background, the time required to launch solr now
>>> takes at
>>> > > > least
>>> > > > > > as
>>> > > > > > > > long
>>> > > > > > > > > > as
>>> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough
>>> for
>>> > our
>>> > > > > > needs.
>>> > > > > > > > > > >
>>> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
>>> > > (though
>>> > > > > you
>>> > > > > > > can
>>> > > > > > > > > > return
>>> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
>>> > also
>>> > > > > fixed
>>> > > > > > in
>>> > > > > > > > the
>>> > > > > > > > > > 4.1
>>> > > > > > > > > > > branch here <
>>> > > https://issues.apache.org/jira/browse/SOLR-4022
>>> > > > >.
>>> > > > > > > > > > >
>>> > > > > > > > > > > So: Even after these fixes, EFF performance is not
>>> that
>>> > > > great.
>>> > > > > > Our
>>> > > > > > > > > > solution
>>> > > > > > > > > > > is as follows: The actual value of the popularity
>>> measure
>>> > > > (say,
>>> > > > > > > > reads)
>>> > > > > > > > > > that
>>> > > > > > > > > > > we want to report to the user is inserted into the
>>> search
>>> > > > > > response
>>> > > > > > > > > > > post-query by our query front-end. This value will
>>> then
>>> > be
>>> > > > the
>>> > > > > > > > > > > authoritative value at the time of the query. The
>>> value
>>> > of
>>> > > > the
>>> > > > > > > > > popularity
>>> > > > > > > > > > > measure that we use for boosting in the ranking of
>>> the
>>> > > search
>>> > > > > > > results
>>> > > > > > > > > is
>>> > > > > > > > > > > only updated when the value has changed enough so
>>> that
>>> > the
>>> > > > > impact
>>> > > > > > > on
>>> > > > > > > > > the
>>> > > > > > > > > > > boost will be significant (say, more than 2%). This
>>> does
>>> > > > > require
>>> > > > > > > > > frequent
>>> > > > > > > > > > > re-indexing of the documents that have significant
>>> > changes
>>> > > in
>>> > > > > the
>>> > > > > > > > > number
>>> > > > > > > > > > of
>>> > > > > > > > > > > reads, but at least we won't have to update a
>>> document if
>>> > > it
>>> > > > > > moves
>>> > > > > > > > > from,
>>> > > > > > > > > > > say, 1000000 to 1000001 reads.
>>> > > > > > > > > > >
>>> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
>>> > > > > > > > > > >
>>> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
>>> > > > > > simo...@apache.org
>>> > > > > > > >
>>> > > > > > > > > > wrote:
>>> > > > > > > > > > >
>>> > > > > > > > > > > > Hi all,
>>> > > > > > > > > > > > I'm planning to move a quite big Solr index to
>>> > SolrCloud.
>>> > > > > > > However,
>>> > > > > > > > in
>>> > > > > > > > > > > this
>>> > > > > > > > > > > > index, an external file field is used for
>>> popularity
>>> > > > ranking.
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Does SolrCloud supports external file fields? How
>>> does
>>> > it
>>> > > > > cope
>>> > > > > > > with
>>> > > > > > > > > > > > sharding and replication? Where should the external
>>> > file
>>> > > be
>>> > > > > > > placed
>>> > > > > > > > > now
>>> > > > > > > > > > > that
>>> > > > > > > > > > > > the index folder is not local but in the cloud?
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Are there otherwise other best practices to deal
>>> with
>>> > the
>>> > > > use
>>> > > > > > > cases
>>> > > > > > > > > > > > external file fields were used for, like
>>> > > > popularity/ranking,
>>> > > > > in
>>> > > > > > > > > > > SolrCloud?
>>> > > > > > > > > > > > Custom ValueSources going to something external?
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Thanks in advance,
>>> > > > > > > > > > > > Simone
>>> > > > > > > > > > > >
>>> > > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > > > --
>>> > > > > > > > > > Sincerely yours
>>> > > > > > > > > > Mikhail Khludnev
>>> > > > > > > > > > Principal Engineer,
>>> > > > > > > > > > Grid Dynamics
>>> > > > > > > > > >
>>> > > > > > > > > > <http://www.griddynamics.com>
>>> > > > > > > > > >  <mkhlud...@griddynamics.com>
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > --
>>> > > > > > Sincerely yours
>>> > > > > > Mikhail Khludnev
>>> > > > > > Principal Engineer,
>>> > > > > > Grid Dynamics
>>> > > > > >
>>> > > > > > <http://www.griddynamics.com>
>>> > > > > >  <mkhlud...@griddynamics.com>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > Sincerely yours
>>> > > > Mikhail Khludnev
>>> > > > Principal Engineer,
>>> > > > Grid Dynamics
>>> > > >
>>> > > > <http://www.griddynamics.com>
>>> > > >  <mkhlud...@griddynamics.com>
>>> > > >
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Sincerely yours
>>> > Mikhail Khludnev
>>> > Principal Engineer,
>>> > Grid Dynamics
>>> >
>>> > <http://www.griddynamics.com>
>>> >  <mkhlud...@griddynamics.com>
>>> >
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mkhlud...@griddynamics.com>
>>
>>
>

Re: SolrCloud and exernal file fields

Reply via email to