Re: SolrCloud and exernal file fields

Mikhail Khludnev Tue, 20 Nov 2012 10:23:19 -0800

Martin,

Please find additional question from me below.


Simone,

I'm sorry for hijacking your thread. The only what I've heard about it at
recent ApacheCon sessions is that Zookeeper is supposed to replicate those
files as configs under solr home. And I'm really looking forward to know
how it works with huge files in production.

Thank You, Guys!

20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com> написал:
>
> Hi Mikhail
>
> Please see answers below.
>
> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > Martin,
> >
> > Thank you for telling your own "war-story". It's really useful for
> > community.
> > The first question might seems not really conscious, but would you tell
me
> > what blocks searching during EFF reload, when it's triggered by handler
or
> > by listener?
> >
>
> We continuously index new documents using CommitWithin to get regular
> commits. However, we observed that the EFFs were not re-read, so we had to
> do external commits (curl '.../solr/update?commit=true') to force reload.
> When this is done, solr blocks. I can't tell you exactly why it's doing
> that (it was related to SOLR-3985).

Is there a chance to get a thread dump when they are blocked?


>
>
> > I don't really get the sentence about sequential commits and number of
> > cores. Do I get right that file is replicated via Zookeeper? Doesn't it
> >
>
> Again, this is observed behavior. When we issue a commit on a system with
a
> system with many solr cores using EFFs, the system blocks for a long time
> (15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
> from each cores index dir to the actual file, which is updated by an
> external process.

Hold on, I asked about Zookeeper because the subj mentions SolrCloud.

Do you use SolrCloud, SolrShards, or these cores are just replicas of the
same index?
Also, about simlink - Don't you share that file via some NFS?

how many cores you run per box?

Do boxes has plenty of ram to cache filesystem beside of jvm heaps?

I assume you use 64 bit linux and mmap directory. Please confirm that.


>
>
> > causes scalability problem or long time to reload? Will it help if we'll
> > have, let's say ExternalDatabaseField which will pull values from jdbc.
ie.
> >
>
> I think the possibility of having some fields being retrieved from an
> external, dynamically updatable store would be really interesting. This
> could be JDBC, something in-memory like redis, or a NoSql product (e.g.
> Cassandra).

Ok. Let's have it in mind as a possible direction.

>
>
> > why all cores can't read these values simultaneously?
> >
>
> Again, this is a solr implementation detail that I can't answer :)
>
>
> > Can you confirm that IDs in the file is ordered by the index term order?
> >
>
> Yes, we sorted the files (standard UNIX sort).
>
>
> > AFAIK it can impact load time.
> >
> Yes, it does

Ok, I've got that you aware of it, and your IDs are just strings, not
integers.


>
>
> > Regarding your post-query solution can you tell me if query found 10000
> > docs, but I need to display only first page with 100 rows, whether I
need
> > to pull all 10K results to frontend to order them by the rank?
> >
> >
> In our architecture, the clients query an API that generates the SOLR
> query, retrieves the relevant additional fields that we needs, and returns
> the relevant JSON to the front-end.
>
> In our use case, results are returned from SOLR by the 10's, not by the
> 1000's, so it is a manageable job. Even so, if solr returned thousands of
> results, it would be up to the implementation of the api to augment only
> the results that needed to be returned to the front-end.
>
> Even so, patching up a JSON structure with 10000 results should be
> possible.

You are right. I'm concerned anyway because retrieving whole result is
expensive, and not always possible.


>
>
> > I'm really appreciate if you comment on the questions above.
> > PS: It's time to pitch, how much
> > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > ExternalFileField" can help you?
> >
> >
> > It looks very interesting :) Does it make it possible to avoid
re-reading
> the EFF on every commit, and only re-read the values that have actually
> changed?


You don't need commit (in SOLR-4085) to reload file content, but after
commit you need to read whole file and scan all key terms and postings.
That's because EFF sits on top of top level searcher. it's a Solr-like way.
In some future we might have per-segment EFF, in this case adding a segment
will trigger full file scan, but in the index only that new segment will be
scanned. It should be faster. You know, straightforward sharing internal
data structures between different index views/generations is not possible.
If you are asking about applying delta changes on external file that's
something what we did ourselves http://goo.gl/P8GFq . This feature is much
more doubtful and vague, although it might be the next contribution after
SOLR-4085.

>
> /Martin
>
>
> >
> > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <m...@issuu.com> wrote:
> >
> > > Solr 4.0 does support using EFFs, but it might not give you what
you're
> > > hoping fore.
> > >
> > > We tried using Solr Cloud, and have given up again.
> > >
> > > The EFF is placed in the parent of the index directory in each core;
each
> > > core reads the entire EFF and picks out the IDs that it is responsible
> > for.
> > >
> > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > queries)
> > > while re-reading the EFF. Even worse, it seems that the time to
re-read
> > the
> > > EFF is multiplied by the number of cores in use (i.e. the EFF is
re-read
> > by
> > > each core sequentially). The contents of the EFF become active after
the
> > > first EXTERNAL commit (commitWithin does NOT work here) after the file
> > has
> > > been updated.
> > >
> > > In our case, the EFF was quite large - around 450MB - and we use 16
> > shards,
> > > so when we triggered an external commit to force re-reading, the whole
> > > system would block for several (10-15) minutes. This won't work in a
> > > production environment. The reason for the size of the EFF is that we
> > have
> > > around 7M documents in the index; each document has a 45 character ID.
> > >
> > > We got some help to try to fix the problem so that the re-read of the
EFF
> > > proceeds in the background (see
> > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > a fix on the 4.1 branch). However, even though the re-read proceeds in
> > the
> > > background, the time required to launch solr now takes at least as
long
> > as
> > > re-reading the EFFs. Again, this is not good enough for our needs.
> > >
> > > The next issue is that you cannot sort on EFF fields (though you can
> > return
> > > them as values using &fl=field(my_eff_field). This is also fixed in
the
> > 4.1
> > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > >
> > > So: Even after these fixes, EFF performance is not that great. Our
> > solution
> > > is as follows: The actual value of the popularity measure (say, reads)
> > that
> > > we want to report to the user is inserted into the search response
> > > post-query by our query front-end. This value will then be the
> > > authoritative value at the time of the query. The value of the
popularity
> > > measure that we use for boosting in the ranking of the search results
is
> > > only updated when the value has changed enough so that the impact on
the
> > > boost will be significant (say, more than 2%). This does require
frequent
> > > re-indexing of the documents that have significant changes in the
number
> > of
> > > reads, but at least we won't have to update a document if it moves
from,
> > > say, 1000000 to 1000001 reads.
> > >
> > > /Martin Koch - ISSUU - senior systems architect.
> > >
> > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <simo...@apache.org>
> > wrote:
> > >
> > > > Hi all,
> > > > I'm planning to move a quite big Solr index to SolrCloud. However,
in
> > > this
> > > > index, an external file field is used for popularity ranking.
> > > >
> > > > Does SolrCloud supports external file fields? How does it cope with
> > > > sharding and replication? Where should the external file be placed
now
> > > that
> > > > the index folder is not local but in the cloud?
> > > >
> > > > Are there otherwise other best practices to deal with the use cases
> > > > external file fields were used for, like popularity/ranking, in
> > > SolrCloud?
> > > > Custom ValueSources going to something external?
> > > >
> > > > Thanks in advance,
> > > > Simone
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mkhlud...@griddynamics.com>
> >
 20.11.2012 18:06 пользователь "Martin Koch" <m...@issuu.com> написал:

> Hi Mikhail
>
> Please see answers below.
>
> On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > Martin,
> >
> > Thank you for telling your own "war-story". It's really useful for
> > community.
> > The first question might seems not really conscious, but would you tell
> me
> > what blocks searching during EFF reload, when it's triggered by handler
> or
> > by listener?
> >
>
> We continuously index new documents using CommitWithin to get regular
> commits. However, we observed that the EFFs were not re-read, so we had to
> do external commits (curl '.../solr/update?commit=true') to force reload.
> When this is done, solr blocks. I can't tell you exactly why it's doing
> that (it was related to SOLR-3985).
>
>
> > I don't really get the sentence about sequential commits and number of
> > cores. Do I get right that file is replicated via Zookeeper? Doesn't it
> >
>
> Again, this is observed behavior. When we issue a commit on a system with a
> system with many solr cores using EFFs, the system blocks for a long time
> (15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
> from each cores index dir to the actual file, which is updated by an
> external process.
>
>
> > causes scalability problem or long time to reload? Will it help if we'll
> > have, let's say ExternalDatabaseField which will pull values from jdbc.
> ie.
> >
>
> I think the possibility of having some fields being retrieved from an
> external, dynamically updatable store would be really interesting. This
> could be JDBC, something in-memory like redis, or a NoSql product (e.g.
> Cassandra).
>
>
> > why all cores can't read these values simultaneously?
> >
>
> Again, this is a solr implementation detail that I can't answer :)
>
>
> > Can you confirm that IDs in the file is ordered by the index term order?
> >
>
> Yes, we sorted the files (standard UNIX sort).
>
>
> > AFAIK it can impact load time.
> >
> Yes, it does.
>
>
> > Regarding your post-query solution can you tell me if query found 10000
> > docs, but I need to display only first page with 100 rows, whether I need
> > to pull all 10K results to frontend to order them by the rank?
> >
> >
> In our architecture, the clients query an API that generates the SOLR
> query, retrieves the relevant additional fields that we needs, and returns
> the relevant JSON to the front-end.
>
> In our use case, results are returned from SOLR by the 10's, not by the
> 1000's, so it is a manageable job. Even so, if solr returned thousands of
> results, it would be up to the implementation of the api to augment only
> the results that needed to be returned to the front-end.
>
> Even so, patching up a JSON structure with 10000 results should be
> possible.
>
>
> > I'm really appreciate if you comment on the questions above.
> > PS: It's time to pitch, how much
> > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > ExternalFileField" can help you?
> >
> >
> > It looks very interesting :) Does it make it possible to avoid re-reading
> the EFF on every commit, and only re-read the values that have actually
> changed?
>
> /Martin
>
>
> >
> > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <m...@issuu.com> wrote:
> >
> > > Solr 4.0 does support using EFFs, but it might not give you what you're
> > > hoping fore.
> > >
> > > We tried using Solr Cloud, and have given up again.
> > >
> > > The EFF is placed in the parent of the index directory in each core;
> each
> > > core reads the entire EFF and picks out the IDs that it is responsible
> > for.
> > >
> > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > queries)
> > > while re-reading the EFF. Even worse, it seems that the time to re-read
> > the
> > > EFF is multiplied by the number of cores in use (i.e. the EFF is
> re-read
> > by
> > > each core sequentially). The contents of the EFF become active after
> the
> > > first EXTERNAL commit (commitWithin does NOT work here) after the file
> > has
> > > been updated.
> > >
> > > In our case, the EFF was quite large - around 450MB - and we use 16
> > shards,
> > > so when we triggered an external commit to force re-reading, the whole
> > > system would block for several (10-15) minutes. This won't work in a
> > > production environment. The reason for the size of the EFF is that we
> > have
> > > around 7M documents in the index; each document has a 45 character ID.
> > >
> > > We got some help to try to fix the problem so that the re-read of the
> EFF
> > > proceeds in the background (see
> > > here<https://issues.apache.org/jira/browse/SOLR-3985> for
> > > a fix on the 4.1 branch). However, even though the re-read proceeds in
> > the
> > > background, the time required to launch solr now takes at least as long
> > as
> > > re-reading the EFFs. Again, this is not good enough for our needs.
> > >
> > > The next issue is that you cannot sort on EFF fields (though you can
> > return
> > > them as values using &fl=field(my_eff_field). This is also fixed in the
> > 4.1
> > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > >
> > > So: Even after these fixes, EFF performance is not that great. Our
> > solution
> > > is as follows: The actual value of the popularity measure (say, reads)
> > that
> > > we want to report to the user is inserted into the search response
> > > post-query by our query front-end. This value will then be the
> > > authoritative value at the time of the query. The value of the
> popularity
> > > measure that we use for boosting in the ranking of the search results
> is
> > > only updated when the value has changed enough so that the impact on
> the
> > > boost will be significant (say, more than 2%). This does require
> frequent
> > > re-indexing of the documents that have significant changes in the
> number
> > of
> > > reads, but at least we won't have to update a document if it moves
> from,
> > > say, 1000000 to 1000001 reads.
> > >
> > > /Martin Koch - ISSUU - senior systems architect.
> > >
> > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <simo...@apache.org>
> > wrote:
> > >
> > > > Hi all,
> > > > I'm planning to move a quite big Solr index to SolrCloud. However, in
> > > this
> > > > index, an external file field is used for popularity ranking.
> > > >
> > > > Does SolrCloud supports external file fields? How does it cope with
> > > > sharding and replication? Where should the external file be placed
> now
> > > that
> > > > the index folder is not local but in the cloud?
> > > >
> > > > Are there otherwise other best practices to deal with the use cases
> > > > external file fields were used for, like popularity/ranking, in
> > > SolrCloud?
> > > > Custom ValueSources going to something external?
> > > >
> > > > Thanks in advance,
> > > > Simone
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mkhlud...@griddynamics.com>
> >
>

Re: SolrCloud and exernal file fields

Reply via email to