Re: Time-Routed Alias Not Distributing Wrongly Placed Docs

Gus Heck Thu, 29 Nov 2018 21:16:02 -0800

Hi John,

TRA's really do require that you index via the alias. Internally the code
is wrapping the Distributed Update Processor with an additional processor
to handle the time routing when (and only when) the TRA alias is detected.
If the alias is not used, none of the TRA code runs (by design, for
performance). TRA's have no capability at all to re-assign docs once they
are implemented since the process is data driven during update only, with
no internal maintenance threads (again by design).  It is not even
supported at this time to update the date on which the document was routed
via atomic updates for example. One would have to delete and re-index the
document (in that order, waiting for one to complete!) Adding some sort of
"fixer thread" is not something that would make much sense, since we don't
want to ever have the TRA's storing documents in the wrong place to
begin with.

TRA's are targeted at systems where new data items arrive regularly, can be
placed in the right place correctly up front and the timestamp is immutable
(typical for IOT readings, log or event based types of data for example).

I think you will probably need to follow up with Lucidworks to get them to
add a feature to allow TRA's as targets if TRA's still sound like they fit
your use case. (or pursue another solution without limitations on the
indexing target)

Frankly, it's a mystery to me how you even got any docs in the October
collection you list in your question. For anything to have been
distributed, it would have had to go through the alias. Also, how you have
more than one collection is a mystery unless you manually inserted a doc at
some point to cause collection creation perhaps?

It's also worth noting that without the routing and maintenance features
tied to the alias TRA's give very little benefit, and there are other ways
of solving this problem with external solutions. Dave, my co-presenter at
Activate 2018 talks about a couple of other options in the middle section
of our talk
https://www.youtube.com/watch?v=RB1-7Y5NQeI&index=59&list=PLU6n9Voqu_1HW8-VavVMa9lP8-oF8Oh5t&t=0s

The part describing TRA's in detail starts at 14 min and 17 to 23 min
discusses predecessors and alternatives

-Gus

On Tue, Nov 27, 2018 at 12:42 PM John Nashorn <nashornj...@gmail.com> wrote:

> Hello Everyone,
> I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5,
> cloud mode). As written in the Solr Manual, TRA expects documents to be
> indexed using its alias name, and not directly into the collections under
> it. Unfortunately, hive-solr doesn't allow using TRA names as indexing
> targets. So what I do is: I index data using the first collection created
> by TRA and expect Solr to distribute my data into its respective collection
> under the hood. This works to some extent, but a big portion of data stays
> in where they were indexed, ie. the first collection of the TRA. For
> example (approximate numbers):
>
> * coll_2018-07-01 => 800.000.000 docs
> * coll_2018-08-01 => 0 docs
> * coll_2018-09-01 => 0 docs
> * coll_2018-10-01 => 150.000.000 docs
> * coll_2018-11-01 => 0 docs
>
> Here, coll_2018-07-01 contains data that should normally be in the other
> four collections.
>
> Is there a way to make TRA scan (somehow intentionally) misplaced data and
> send them to their correct places?
>

-- 
http://www.the111shift.com

Re: Time-Routed Alias Not Distributing Wrongly Placed Docs

Reply via email to