Re: SolrCloud and exernal file fields

Mikhail Khludnev Tue, 27 Nov 2012 09:26:04 -0800

Martin,

It's still not clear to me whether you solve the problem completely or
partially:
Does reducing number of cores free some resources for searching during
commit?
Does the commiting one-by-one core prevents the "freeze"?


Thanks


On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch <m...@issuu.com> wrote:

> Mikhail
>
> To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug
> 3985). But this wasn't good enough, because SOLR would still take very long
> to restart when that was necessary.
>
> I don't see how we could throw more hardware at the problem without making
> it worse, really - the only solution here would be *fewer* shards, not
> more.
>
> IMO it would be ideal if the lucene/solr community could come up with a
> good way of updating fields in a document without reindexing. This could be
> by linking to some external data store, or in the lucene/solr internals. If
> it would make things easier, a good first step would be to have dynamically
> updateable numerical fields only.
>
> /Martin
>
> On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > Martin,
> >
> > I don't think solrconfig.xml shed any light on. I've just found what I
> > didn't get in your setup - the way of how to explicitly assigning core to
> > collection. Now, I realized most of details after all!
> > Ball is on your side, let us know whether you have managed your cores to
> > commit one by one to avoid freeze, or could you eliminate pauses by
> > allocating more hardware?
> > Thanks in advance!
> >
> >
> > On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch <m...@issuu.com> wrote:
> >
> > > Mikhail,
> > >
> > > PSB
> > >
> > > On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
> > > mkhlud...@griddynamics.com> wrote:
> > >
> > > > On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <m...@issuu.com> wrote:
> > > >
> > > > >
> > > > > I wasn't aware until now that it is possible to send a commit to
> one
> > > core
> > > > > only. What we observed was the effect of curl
> > > > > localhost:8080/solr/update?commit=true but perhaps we should
> > experiment
> > > > > with solr/coreN/update?commit=true. A quick trial run seems to
> > indicate
> > > > > that a commit to a single core causes commits on all cores.
> > > > >
> > > > You should see something like this in the log:
> > > > ... SolrCmdDistributor .... Distrib commit to: ...
> > > >
> > > > Yup, a commit towards a single core results in a commit on all cores.
> > >
> > >
> > > > >
> > > > >
> > > > > Perhaps I should clarify that we are using SOLR as a black box; we
> do
> > > not
> > > > > touch the code at all - we only install the distribution WAR file
> and
> > > > > proceed from there.
> > > > >
> > > > I still don't understand how you deploy/launch Solr. How many jettys
> > you
> > > > start whether you have -DzkRun -DzkHost -DnumShards=2  or you
> specifies
> > > > shards= param for every request and distributes updates yourself?
> What
> > > > collections do you create and with which settings?
> > > >
> > > > We let SOLR do the sharding using one collection with 16 SOLR cores
> > > holding one shard each. We launch only one instance of jetty with the
> > > folllowing arguments:
> > >
> > > -DnumShards=16
> > > -DzkHost=<zookeeperhost:port>
> > > -Xmx10G
> > > -Xms10G
> > > -Xmn2G
> > > -server
> > >
> > > Would you like to see the solrconfig.xml?
> > >
> > > /Martin
> > >
> > >
> > > > >
> > > > >
> > > > > > Also from my POV such deployments should start at least from *16*
> > > 4-way
> > > > > > vboxes, it's more expensive, but should be much better available
> > > during
> > > > > > cpu-consuming operations.
> > > > > >
> > > > >
> > > > > Do you mean that you recommend 16 hosts with 4 cores each? Or 4
> hosts
> > > > with
> > > > > 16 cores? Or am I misunderstanding something :) ?
> > > > >
> > > > I prefer to start from 16 hosts with 4 cores each.
> > > >
> > > >
> > > > >
> > > > >
> > > > > > Other details, if you use single jetty for all of them, are you
> > sure
> > > > that
> > > > > > jetty's threadpool doesn't limit requests? is it large enough?
> > > > > > You have 60G and set -Xmx=10G. are you sure that total size of
> > cores
> > > > > index
> > > > > > directories is less than 45G?
> > > > > >
> > > > > > The total index size is 230 GB, so it won't fit in ram, but we're
> > > using
> > > > > an
> > > > > SSD disk to minimize disk access time. We have tried putting the
> EFF
> > > > onto a
> > > > > ram disk, but this didn't have a measurable effect.
> > > > >
> > > > > Thanks,
> > > > > /Martin
> > > > >
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <m...@issuu.com>
> > wrote:
> > > > > >
> > > > > > > Mikhail
> > > > > > >
> > > > > > > PSB
> > > > > > >
> > > > > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > > > > > > mkhlud...@griddynamics.com> wrote:
> > > > > > >
> > > > > > > > Martin,
> > > > > > > >
> > > > > > > > Please find additional question from me below.
> > > > > > > >
> > > > > > > > Simone,
> > > > > > > >
> > > > > > > > I'm sorry for hijacking your thread. The only what I've heard
> > > about
> > > > > it
> > > > > > at
> > > > > > > > recent ApacheCon sessions is that Zookeeper is supposed to
> > > > replicate
> > > > > > > those
> > > > > > > > files as configs under solr home. And I'm really looking
> > forward
> > > to
> > > > > > know
> > > > > > > > how it works with huge files in production.
> > > > > > > >
> > > > > > > > Thank You, Guys!
> > > > > > > >
> > > > > > > > 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com>
> > > > написал:
> > > > > > > > >
> > > > > > > > > Hi Mikhail
> > > > > > > > >
> > > > > > > > > Please see answers below.
> > > > > > > > >
> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > > > > mkhlud...@griddynamics.com> wrote:
> > > > > > > > >
> > > > > > > > > > Martin,
> > > > > > > > > >
> > > > > > > > > > Thank you for telling your own "war-story". It's really
> > > useful
> > > > > for
> > > > > > > > > > community.
> > > > > > > > > > The first question might seems not really conscious, but
> > > would
> > > > > you
> > > > > > > tell
> > > > > > > > me
> > > > > > > > > > what blocks searching during EFF reload, when it's
> > triggered
> > > by
> > > > > > > handler
> > > > > > > > or
> > > > > > > > > > by listener?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > We continuously index new documents using CommitWithin to
> get
> > > > > regular
> > > > > > > > > commits. However, we observed that the EFFs were not
> re-read,
> > > so
> > > > we
> > > > > > had
> > > > > > > > to
> > > > > > > > > do external commits (curl '.../solr/update?commit=true') to
> > > force
> > > > > > > reload.
> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
> why
> > > it's
> > > > > > doing
> > > > > > > > > that (it was related to SOLR-3985).
> > > > > > > >
> > > > > > > > Is there a chance to get a thread dump when they are blocked?
> > > > > > > >
> > > > > > > >
> > > > > > > Well I could try to recreate the situation. But the setup is
> > fairly
> > > > > > simple:
> > > > > > > Create a large EFF in a largeish index with many shards. Issue
> a
> > > > > commit,
> > > > > > > and then try to do a search. Solr will not respond to the
> search
> > > > before
> > > > > > the
> > > > > > > commit has completed, and this will take a long time.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > I don't really get the sentence about sequential commits
> > and
> > > > > number
> > > > > > > of
> > > > > > > > > > cores. Do I get right that file is replicated via
> > Zookeeper?
> > > > > > Doesn't
> > > > > > > it
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Again, this is observed behavior. When we issue a commit
> on a
> > > > > system
> > > > > > > with
> > > > > > > > a
> > > > > > > > > system with many solr cores using EFFs, the system blocks
> > for a
> > > > > long
> > > > > > > time
> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
> EFF
> > > is a
> > > > > > > symlink
> > > > > > > > > from each cores index dir to the actual file, which is
> > updated
> > > by
> > > > > an
> > > > > > > > > external process.
> > > > > > > >
> > > > > > > > Hold on, I asked about Zookeeper because the subj mentions
> > > > SolrCloud.
> > > > > > > >
> > > > > > > > Do you use SolrCloud, SolrShards, or these cores are just
> > > replicas
> > > > of
> > > > > > the
> > > > > > > > same index?
> > > > > > > >
> > > > > > >
> > > > > > > Ah - we use solr 4 out of the box, so I guess this is
> SolrCloud.
> > > I'm
> > > > a
> > > > > > bit
> > > > > > > unsure about the terminology here, but we've got a single index
> > > > divided
> > > > > > > into 16 shard. Each shard is hosted in a solr core.
> > > > > > >
> > > > > > >
> > > > > > > > Also, about simlink - Don't you share that file via some NFS?
> > > > > > > >
> > > > > > > > No, we generate the EFF on the local solr host (there is only
> > one
> > > > > > > physical
> > > > > > > host that holds all shards), so there is no need for NFS or
> > copying
> > > > > files
> > > > > > > around. No need for Zookeeper either.
> > > > > > >
> > > > > > >
> > > > > > > > how many cores you run per box?
> > > > > > > >
> > > > > > > This box is a 16-virtual core (8 hyperthreaded cores)  with
> 60GB
> > of
> > > > > RAM.
> > > > > > We
> > > > > > > run 16 solr cores on this box in Jetty.
> > > > > > >
> > > > > > >
> > > > > > > > Do boxes has plenty of ram to cache filesystem beside of jvm
> > > heaps?
> > > > > > > >
> > > > > > > > Yes. We've allocated 10GB for jetty, and left the rest for
> the
> > > OS.
> > > > > > >
> > > > > > >
> > > > > > > > I assume you use 64 bit linux and mmap directory. Please
> > confirm
> > > > > that.
> > > > > > > >
> > > > > > > >
> > > > > > > We use 64-bit linux. I'm not sure about the mmap directory or
> > where
> > > > > that
> > > > > > > would be configured in solr - can you explain that?
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > causes scalability problem or long time to reload? Will
> it
> > > help
> > > > > if
> > > > > > > > we'll
> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
> > values
> > > > from
> > > > > > > jdbc.
> > > > > > > > ie.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think the possibility of having some fields being
> retrieved
> > > > from
> > > > > an
> > > > > > > > > external, dynamically updatable store would be really
> > > > interesting.
> > > > > > This
> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
> > > product
> > > > > > (e.g.
> > > > > > > > > Cassandra).
> > > > > > > >
> > > > > > > > Ok. Let's have it in mind as a possible direction.
> > > > > > > >
> > > > > > >
> > > > > > > Alternatively, an API that would allow updating a single field
> > for
> > > a
> > > > > > > document might be an option.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > why all cores can't read these values simultaneously?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Again, this is a solr implementation detail that I can't
> > answer
> > > > :)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
> > index
> > > > term
> > > > > > > > order?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > AFAIK it can impact load time.
> > > > > > > > > >
> > > > > > > > > Yes, it does
> > > > > > > >
> > > > > > > > Ok, I've got that you aware of it, and your IDs are just
> > strings,
> > > > not
> > > > > > > > integers.
> > > > > > > >
> > > > > > > >
> > > > > > > Yes, ids are strings.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Regarding your post-query solution can you tell me if
> query
> > > > found
> > > > > > > 10000
> > > > > > > > > > docs, but I need to display only first page with 100
> rows,
> > > > > whether
> > > > > > I
> > > > > > > > need
> > > > > > > > > > to pull all 10K results to frontend to order them by the
> > > rank?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > In our architecture, the clients query an API that
> generates
> > > the
> > > > > SOLR
> > > > > > > > > query, retrieves the relevant additional fields that we
> > needs,
> > > > and
> > > > > > > > returns
> > > > > > > > > the relevant JSON to the front-end.
> > > > > > > > >
> > > > > > > > > In our use case, results are returned from SOLR by the
> 10's,
> > > not
> > > > by
> > > > > > the
> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
> returned
> > > > > > thousands
> > > > > > > of
> > > > > > > > > results, it would be up to the implementation of the api to
> > > > augment
> > > > > > > only
> > > > > > > > > the results that needed to be returned to the front-end.
> > > > > > > > >
> > > > > > > > > Even so, patching up a JSON structure with 10000 results
> > should
> > > > be
> > > > > > > > > possible.
> > > > > > > >
> > > > > > > > You are right. I'm concerned anyway because retrieving whole
> > > result
> > > > > is
> > > > > > > > expensive, and not always possible.
> > > > > > > >
> > > > > > > >
> > > > > > > In our case, getting the whole result is almost impossible,
> > because
> > > > > that
> > > > > > > would be millions of documents, and returning the Nth result
> > seems
> > > to
> > > > > be
> > > > > > a
> > > > > > > quadratic (or worse) operation in SOLR.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > I'm really appreciate if you comment on the questions
> > above.
> > > > > > > > > > PS: It's time to pitch, how much
> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
> "Commit-free
> > > > > > > > > > ExternalFileField" can help you?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It looks very interesting :) Does it make it possible to
> > > avoid
> > > > > > > > re-reading
> > > > > > > > > the EFF on every commit, and only re-read the values that
> > have
> > > > > > actually
> > > > > > > > > changed?
> > > > > > > >
> > > > > > > >
> > > > > > > > You don't need commit (in SOLR-4085) to reload file content,
> > but
> > > > > after
> > > > > > > > commit you need to read whole file and scan all key terms and
> > > > > postings.
> > > > > > > > That's because EFF sits on top of top level searcher. it's a
> > > > > Solr-like
> > > > > > > way.
> > > > > > > > In some future we might have per-segment EFF, in this case
> > > adding a
> > > > > > > segment
> > > > > > > > will trigger full file scan, but in the index only that new
> > > segment
> > > > > > will
> > > > > > > be
> > > > > > > > scanned. It should be faster. You know, straightforward
> sharing
> > > > > > internal
> > > > > > > > data structures between different index views/generations is
> > not
> > > > > > > possible.
> > > > > > > > If you are asking about applying delta changes on external
> file
> > > > > that's
> > > > > > > > something what we did ourselves http://goo.gl/P8GFq . This
> > > feature
> > > > > is
> > > > > > > much
> > > > > > > > more doubtful and vague, although it might be the next
> > > contribution
> > > > > > after
> > > > > > > > SOLR-4085.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > /Martin
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
> > m...@issuu.com>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give
> > you
> > > > > what
> > > > > > > > you're
> > > > > > > > > > > hoping fore.
> > > > > > > > > > >
> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > > > > >
> > > > > > > > > > > The EFF is placed in the parent of the index directory
> in
> > > > each
> > > > > > > core;
> > > > > > > > each
> > > > > > > > > > > core reads the entire EFF and picks out the IDs that it
> > is
> > > > > > > > responsible
> > > > > > > > > > for.
> > > > > > > > > > >
> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
> > (doesn't
> > > > > answer
> > > > > > > > > > queries)
> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that the
> > > time
> > > > to
> > > > > > > > re-read
> > > > > > > > > > the
> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e.
> the
> > > EFF
> > > > > is
> > > > > > > > re-read
> > > > > > > > > > by
> > > > > > > > > > > each core sequentially). The contents of the EFF become
> > > > active
> > > > > > > after
> > > > > > > > the
> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here)
> > > after
> > > > > the
> > > > > > > > file
> > > > > > > > > > has
> > > > > > > > > > > been updated.
> > > > > > > > > > >
> > > > > > > > > > > In our case, the EFF was quite large - around 450MB -
> and
> > > we
> > > > > use
> > > > > > 16
> > > > > > > > > > shards,
> > > > > > > > > > > so when we triggered an external commit to force
> > > re-reading,
> > > > > the
> > > > > > > > whole
> > > > > > > > > > > system would block for several (10-15) minutes. This
> > won't
> > > > work
> > > > > > in
> > > > > > > a
> > > > > > > > > > > production environment. The reason for the size of the
> > EFF
> > > is
> > > > > > that
> > > > > > > we
> > > > > > > > > > have
> > > > > > > > > > > around 7M documents in the index; each document has a
> 45
> > > > > > character
> > > > > > > > ID.
> > > > > > > > > > >
> > > > > > > > > > > We got some help to try to fix the problem so that the
> > > > re-read
> > > > > of
> > > > > > > the
> > > > > > > > EFF
> > > > > > > > > > > proceeds in the background (see
> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
> > for
> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
> > re-read
> > > > > > proceeds
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > background, the time required to launch solr now takes
> at
> > > > least
> > > > > > as
> > > > > > > > long
> > > > > > > > > > as
> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough for
> > our
> > > > > > needs.
> > > > > > > > > > >
> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
> > > (though
> > > > > you
> > > > > > > can
> > > > > > > > > > return
> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
> > also
> > > > > fixed
> > > > > > in
> > > > > > > > the
> > > > > > > > > > 4.1
> > > > > > > > > > > branch here <
> > > https://issues.apache.org/jira/browse/SOLR-4022
> > > > >.
> > > > > > > > > > >
> > > > > > > > > > > So: Even after these fixes, EFF performance is not that
> > > > great.
> > > > > > Our
> > > > > > > > > > solution
> > > > > > > > > > > is as follows: The actual value of the popularity
> measure
> > > > (say,
> > > > > > > > reads)
> > > > > > > > > > that
> > > > > > > > > > > we want to report to the user is inserted into the
> search
> > > > > > response
> > > > > > > > > > > post-query by our query front-end. This value will then
> > be
> > > > the
> > > > > > > > > > > authoritative value at the time of the query. The value
> > of
> > > > the
> > > > > > > > popularity
> > > > > > > > > > > measure that we use for boosting in the ranking of the
> > > search
> > > > > > > results
> > > > > > > > is
> > > > > > > > > > > only updated when the value has changed enough so that
> > the
> > > > > impact
> > > > > > > on
> > > > > > > > the
> > > > > > > > > > > boost will be significant (say, more than 2%). This
> does
> > > > > require
> > > > > > > > frequent
> > > > > > > > > > > re-indexing of the documents that have significant
> > changes
> > > in
> > > > > the
> > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > > reads, but at least we won't have to update a document
> if
> > > it
> > > > > > moves
> > > > > > > > from,
> > > > > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > > > > >
> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > > > > simo...@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > > I'm planning to move a quite big Solr index to
> > SolrCloud.
> > > > > > > However,
> > > > > > > > in
> > > > > > > > > > > this
> > > > > > > > > > > > index, an external file field is used for popularity
> > > > ranking.
> > > > > > > > > > > >
> > > > > > > > > > > > Does SolrCloud supports external file fields? How
> does
> > it
> > > > > cope
> > > > > > > with
> > > > > > > > > > > > sharding and replication? Where should the external
> > file
> > > be
> > > > > > > placed
> > > > > > > > now
> > > > > > > > > > > that
> > > > > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > > > > >
> > > > > > > > > > > > Are there otherwise other best practices to deal with
> > the
> > > > use
> > > > > > > cases
> > > > > > > > > > > > external file fields were used for, like
> > > > popularity/ranking,
> > > > > in
> > > > > > > > > > > SolrCloud?
> > > > > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks in advance,
> > > > > > > > > > > > Simone
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Sincerely yours
> > > > > > > > > > Mikhail Khludnev
> > > > > > > > > > Principal Engineer,
> > > > > > > > > > Grid Dynamics
> > > > > > > > > >
> > > > > > > > > > <http://www.griddynamics.com>
> > > > > > > > > >  <mkhlud...@griddynamics.com>
> > > > > > > > > >
> > > > > > > >  20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com>
> > > > > написал:
> > > > > > > >
> > > > > > > > > Hi Mikhail
> > > > > > > > >
> > > > > > > > > Please see answers below.
> > > > > > > > >
> > > > > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > > > > mkhlud...@griddynamics.com> wrote:
> > > > > > > > >
> > > > > > > > > > Martin,
> > > > > > > > > >
> > > > > > > > > > Thank you for telling your own "war-story". It's really
> > > useful
> > > > > for
> > > > > > > > > > community.
> > > > > > > > > > The first question might seems not really conscious, but
> > > would
> > > > > you
> > > > > > > tell
> > > > > > > > > me
> > > > > > > > > > what blocks searching during EFF reload, when it's
> > triggered
> > > by
> > > > > > > handler
> > > > > > > > > or
> > > > > > > > > > by listener?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > We continuously index new documents using CommitWithin to
> get
> > > > > regular
> > > > > > > > > commits. However, we observed that the EFFs were not
> re-read,
> > > so
> > > > we
> > > > > > had
> > > > > > > > to
> > > > > > > > > do external commits (curl '.../solr/update?commit=true') to
> > > force
> > > > > > > reload.
> > > > > > > > > When this is done, solr blocks. I can't tell you exactly
> why
> > > it's
> > > > > > doing
> > > > > > > > > that (it was related to SOLR-3985).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > I don't really get the sentence about sequential commits
> > and
> > > > > number
> > > > > > > of
> > > > > > > > > > cores. Do I get right that file is replicated via
> > Zookeeper?
> > > > > > Doesn't
> > > > > > > it
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Again, this is observed behavior. When we issue a commit
> on a
> > > > > system
> > > > > > > > with a
> > > > > > > > > system with many solr cores using EFFs, the system blocks
> > for a
> > > > > long
> > > > > > > time
> > > > > > > > > (15 minutes).  We do NOT use zookeeper for anything. The
> EFF
> > > is a
> > > > > > > symlink
> > > > > > > > > from each cores index dir to the actual file, which is
> > updated
> > > by
> > > > > an
> > > > > > > > > external process.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > causes scalability problem or long time to reload? Will
> it
> > > help
> > > > > if
> > > > > > > > we'll
> > > > > > > > > > have, let's say ExternalDatabaseField which will pull
> > values
> > > > from
> > > > > > > jdbc.
> > > > > > > > > ie.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think the possibility of having some fields being
> retrieved
> > > > from
> > > > > an
> > > > > > > > > external, dynamically updatable store would be really
> > > > interesting.
> > > > > > This
> > > > > > > > > could be JDBC, something in-memory like redis, or a NoSql
> > > product
> > > > > > (e.g.
> > > > > > > > > Cassandra).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > why all cores can't read these values simultaneously?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Again, this is a solr implementation detail that I can't
> > answer
> > > > :)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Can you confirm that IDs in the file is ordered by the
> > index
> > > > term
> > > > > > > > order?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > AFAIK it can impact load time.
> > > > > > > > > >
> > > > > > > > > Yes, it does.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Regarding your post-query solution can you tell me if
> query
> > > > found
> > > > > > > 10000
> > > > > > > > > > docs, but I need to display only first page with 100
> rows,
> > > > > whether
> > > > > > I
> > > > > > > > need
> > > > > > > > > > to pull all 10K results to frontend to order them by the
> > > rank?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > In our architecture, the clients query an API that
> generates
> > > the
> > > > > SOLR
> > > > > > > > > query, retrieves the relevant additional fields that we
> > needs,
> > > > and
> > > > > > > > returns
> > > > > > > > > the relevant JSON to the front-end.
> > > > > > > > >
> > > > > > > > > In our use case, results are returned from SOLR by the
> 10's,
> > > not
> > > > by
> > > > > > the
> > > > > > > > > 1000's, so it is a manageable job. Even so, if solr
> returned
> > > > > > thousands
> > > > > > > of
> > > > > > > > > results, it would be up to the implementation of the api to
> > > > augment
> > > > > > > only
> > > > > > > > > the results that needed to be returned to the front-end.
> > > > > > > > >
> > > > > > > > > Even so, patching up a JSON structure with 10000 results
> > should
> > > > be
> > > > > > > > > possible.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > I'm really appreciate if you comment on the questions
> > above.
> > > > > > > > > > PS: It's time to pitch, how much
> > > > > > > > > > https://issues.apache.org/jira/browse/SOLR-4085
> "Commit-free
> > > > > > > > > > ExternalFileField" can help you?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It looks very interesting :) Does it make it possible to
> > > avoid
> > > > > > > > re-reading
> > > > > > > > > the EFF on every commit, and only re-read the values that
> > have
> > > > > > actually
> > > > > > > > > changed?
> > > > > > > > >
> > > > > > > > > /Martin
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <
> > m...@issuu.com>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Solr 4.0 does support using EFFs, but it might not give
> > you
> > > > > what
> > > > > > > > you're
> > > > > > > > > > > hoping fore.
> > > > > > > > > > >
> > > > > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > > > > >
> > > > > > > > > > > The EFF is placed in the parent of the index directory
> in
> > > > each
> > > > > > > core;
> > > > > > > > > each
> > > > > > > > > > > core reads the entire EFF and picks out the IDs that it
> > is
> > > > > > > > responsible
> > > > > > > > > > for.
> > > > > > > > > > >
> > > > > > > > > > > In the current 4.0.0 release of solr, solr blocks
> > (doesn't
> > > > > answer
> > > > > > > > > > queries)
> > > > > > > > > > > while re-reading the EFF. Even worse, it seems that the
> > > time
> > > > to
> > > > > > > > re-read
> > > > > > > > > > the
> > > > > > > > > > > EFF is multiplied by the number of cores in use (i.e.
> the
> > > EFF
> > > > > is
> > > > > > > > > re-read
> > > > > > > > > > by
> > > > > > > > > > > each core sequentially). The contents of the EFF become
> > > > active
> > > > > > > after
> > > > > > > > > the
> > > > > > > > > > > first EXTERNAL commit (commitWithin does NOT work here)
> > > after
> > > > > the
> > > > > > > > file
> > > > > > > > > > has
> > > > > > > > > > > been updated.
> > > > > > > > > > >
> > > > > > > > > > > In our case, the EFF was quite large - around 450MB -
> and
> > > we
> > > > > use
> > > > > > 16
> > > > > > > > > > shards,
> > > > > > > > > > > so when we triggered an external commit to force
> > > re-reading,
> > > > > the
> > > > > > > > whole
> > > > > > > > > > > system would block for several (10-15) minutes. This
> > won't
> > > > work
> > > > > > in
> > > > > > > a
> > > > > > > > > > > production environment. The reason for the size of the
> > EFF
> > > is
> > > > > > that
> > > > > > > we
> > > > > > > > > > have
> > > > > > > > > > > around 7M documents in the index; each document has a
> 45
> > > > > > character
> > > > > > > > ID.
> > > > > > > > > > >
> > > > > > > > > > > We got some help to try to fix the problem so that the
> > > > re-read
> > > > > of
> > > > > > > the
> > > > > > > > > EFF
> > > > > > > > > > > proceeds in the background (see
> > > > > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
> > for
> > > > > > > > > > > a fix on the 4.1 branch). However, even though the
> > re-read
> > > > > > proceeds
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > background, the time required to launch solr now takes
> at
> > > > least
> > > > > > as
> > > > > > > > long
> > > > > > > > > > as
> > > > > > > > > > > re-reading the EFFs. Again, this is not good enough for
> > our
> > > > > > needs.
> > > > > > > > > > >
> > > > > > > > > > > The next issue is that you cannot sort on EFF fields
> > > (though
> > > > > you
> > > > > > > can
> > > > > > > > > > return
> > > > > > > > > > > them as values using &fl=field(my_eff_field). This is
> > also
> > > > > fixed
> > > > > > in
> > > > > > > > the
> > > > > > > > > > 4.1
> > > > > > > > > > > branch here <
> > > https://issues.apache.org/jira/browse/SOLR-4022
> > > > >.
> > > > > > > > > > >
> > > > > > > > > > > So: Even after these fixes, EFF performance is not that
> > > > great.
> > > > > > Our
> > > > > > > > > > solution
> > > > > > > > > > > is as follows: The actual value of the popularity
> measure
> > > > (say,
> > > > > > > > reads)
> > > > > > > > > > that
> > > > > > > > > > > we want to report to the user is inserted into the
> search
> > > > > > response
> > > > > > > > > > > post-query by our query front-end. This value will then
> > be
> > > > the
> > > > > > > > > > > authoritative value at the time of the query. The value
> > of
> > > > the
> > > > > > > > > popularity
> > > > > > > > > > > measure that we use for boosting in the ranking of the
> > > search
> > > > > > > results
> > > > > > > > > is
> > > > > > > > > > > only updated when the value has changed enough so that
> > the
> > > > > impact
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > boost will be significant (say, more than 2%). This
> does
> > > > > require
> > > > > > > > > frequent
> > > > > > > > > > > re-indexing of the documents that have significant
> > changes
> > > in
> > > > > the
> > > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > > reads, but at least we won't have to update a document
> if
> > > it
> > > > > > moves
> > > > > > > > > from,
> > > > > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > > > > >
> > > > > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > > > > simo...@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > > I'm planning to move a quite big Solr index to
> > SolrCloud.
> > > > > > > However,
> > > > > > > > in
> > > > > > > > > > > this
> > > > > > > > > > > > index, an external file field is used for popularity
> > > > ranking.
> > > > > > > > > > > >
> > > > > > > > > > > > Does SolrCloud supports external file fields? How
> does
> > it
> > > > > cope
> > > > > > > with
> > > > > > > > > > > > sharding and replication? Where should the external
> > file
> > > be
> > > > > > > placed
> > > > > > > > > now
> > > > > > > > > > > that
> > > > > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > > > > >
> > > > > > > > > > > > Are there otherwise other best practices to deal with
> > the
> > > > use
> > > > > > > cases
> > > > > > > > > > > > external file fields were used for, like
> > > > popularity/ranking,
> > > > > in
> > > > > > > > > > > SolrCloud?
> > > > > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks in advance,
> > > > > > > > > > > > Simone
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Sincerely yours
> > > > > > > > > > Mikhail Khludnev
> > > > > > > > > > Principal Engineer,
> > > > > > > > > > Grid Dynamics
> > > > > > > > > >
> > > > > > > > > > <http://www.griddynamics.com>
> > > > > > > > > >  <mkhlud...@griddynamics.com>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sincerely yours
> > > > > > Mikhail Khludnev
> > > > > > Principal Engineer,
> > > > > > Grid Dynamics
> > > > > >
> > > > > > <http://www.griddynamics.com>
> > > > > >  <mkhlud...@griddynamics.com>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > > Principal Engineer,
> > > > Grid Dynamics
> > > >
> > > > <http://www.griddynamics.com>
> > > >  <mkhlud...@griddynamics.com>
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mkhlud...@griddynamics.com>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Re: SolrCloud and exernal file fields

Reply via email to