Caching of dynamic external file fields

2018-06-28 Thread Zisis T.
In Solr there's /ExternalFileFieldReloader/ which is responsible for caching
the contents of external files whenever a new searcher is being warmed up. 

It happens that I've defined a dynamic field to be used as an
/ExternalField/ as in 
/* */ 

If you have a look inside the code /ExternalFileFieldReloader/ there's a
loop over the explicit schema fields through /schema.getFields()/ which
means that dynamic external fields do not get cached. 

Of course I did notice that once those files started getting bigger (~50MB)
and searches started timing out. 

Q1 : Is there a specific reason why only explicit fields are cached?
Q2 : I have extended /ExternalFileFieldReloader/ and added the capability to
also cache dynamic external file field contents and it seems to be working
fine. Does anyone think this is useful to make it into Solr codebase?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: External file fields

2018-02-02 Thread Chris Hostetter

: Interesting. I will definitely explore this. Just so I'm clear, we can 
: sort on docValues, but not filter? Is there any situation where external 
: file fields would work better than docValues?

For most field types that support docValues, you can still filter on it 
even if it's indexed="false" -- but the filtering may not be as efficient 
as using indexed values.  for numeric fields you certainly can.

One situation where ExternalFileFiled would probably preferable to doing 
inplace updates on docValues is when you know you need to update the value 
for *every* document in your collection in batch -- for large 
collections, looping over every doc and sending an atomic update would 
probably be slower then just replacing the external file.

Another example when i would probably choose external file field over 
docValues is if the "keyField" was not the same as my uniqueKey field ... 
ie: if i have millions of documents each with a category_id that has a 
cardinality of ~100 categories.  I could use 
the category_id field as the keyField to associate every doc w/some 
numeric "category_rank" value (that varies only per category).  If i 
need/want to tweak 1 of those 100 category_rank values updating the 
entire external file just to change that 1 value is still probably much 
easier then redundemntly putting that category_rank field in every 
doc and sending an atomic update to all ~10K docs that have 
same category_id,category_rank i want to change.


: 
: -Original Message-
: From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
: Sent: Friday, February 2, 2018 12:24 PM
: To: solr-user@lucene.apache.org
: Subject: RE: External file fields
: 
: 
: : I did look into updatable docValues, but my understanding is that the
: : field has to be non-indexed (indexed="false"). I need to be able to sort
: : on these values. External field fields are sortable.
: 
: YOu can absolutely sort on a field that is docValues="true" 
: indexed="false" ... that is much more efficient then sorting on a field that 
is docValues="false" index="true" -- in the later case solr has to build a 
fieldcache (aka: run-time-mock-docvalues) from the indexed values the first 
time you try to sort on the field after a searcher is opened
: 
: 
: 
: -Hoss
: http://www.lucidworks.com/
: 

-Hoss
http://www.lucidworks.com/


RE: External file fields

2018-02-02 Thread Brian Yee
Interesting. I will definitely explore this. Just so I'm clear, we can sort on 
docValues, but not filter? Is there any situation where external file fields 
would work better than docValues?

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Friday, February 2, 2018 12:24 PM
To: solr-user@lucene.apache.org
Subject: RE: External file fields


: I did look into updatable docValues, but my understanding is that the
: field has to be non-indexed (indexed="false"). I need to be able to sort
: on these values. External field fields are sortable.

YOu can absolutely sort on a field that is docValues="true" 
indexed="false" ... that is much more efficient then sorting on a field that is 
docValues="false" index="true" -- in the later case solr has to build a 
fieldcache (aka: run-time-mock-docvalues) from the indexed values the first 
time you try to sort on the field after a searcher is opened



-Hoss
http://www.lucidworks.com/


RE: External file fields

2018-02-02 Thread Chris Hostetter

: I did look into updatable docValues, but my understanding is that the 
: field has to be non-indexed (indexed="false"). I need to be able to sort 
: on these values. External field fields are sortable.

YOu can absolutely sort on a field that is docValues="true" 
indexed="false" ... that is much more efficient then sorting on a field 
that is docValues="false" index="true" -- in the later case solr has to 
build a fieldcache (aka: run-time-mock-docvalues) from the indexed values 
the first time you try to sort on the field after a searcher is opened



-Hoss
http://www.lucidworks.com/


Re: External file fields

2018-02-02 Thread Emir Arnautović
Hi Brian,
You should be able to sort on field with only sorted values.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Feb 2018, at 16:53, Brian Yee <b...@wayfair.com> wrote:
> 
> Hello Erick,
> 
> I did look into updatable docValues, but my understanding is that the field 
> has to be non-indexed (indexed="false"). I need to be able to sort on these 
> values. External field fields are sortable.
> https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates
> 
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Thursday, February 1, 2018 5:00 PM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: External file fields
> 
> Have you considered updateable docValues?
> 
> Best,
> Erick
> 
> On Thu, Feb 1, 2018 at 10:55 AM, Brian Yee <b...@wayfair.com> wrote:
>> Hello,
>> 
>> I want to use external file field to store frequently changing inventory and 
>> price data. I got a proof of concept working with a mock text file and this 
>> will suit my needs.
>> 
>> What is the best way to keep this file updated in a fast way. Ideally I 
>> would like to read changes from a Kafka queue and write to the file. But it 
>> seems like I would have to open the whole file, read the whole file, find 
>> the line I want to change, and write the whole file for every change. Is 
>> there a better way to do that? That approach seems like it would be 
>> difficult/slow if the file is several million lines long.
>> 
>> Also, once I come up with a way to update the file quickly, what is the best 
>> way to distribute the file to all the different solrcloud nodes in the 
>> correct directory?



RE: External file fields

2018-02-02 Thread Brian Yee
Hello Erick,

I did look into updatable docValues, but my understanding is that the field has 
to be non-indexed (indexed="false"). I need to be able to sort on these values. 
External field fields are sortable.
https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, February 1, 2018 5:00 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: External file fields

Have you considered updateable docValues?

Best,
Erick

On Thu, Feb 1, 2018 at 10:55 AM, Brian Yee <b...@wayfair.com> wrote:
> Hello,
>
> I want to use external file field to store frequently changing inventory and 
> price data. I got a proof of concept working with a mock text file and this 
> will suit my needs.
>
> What is the best way to keep this file updated in a fast way. Ideally I would 
> like to read changes from a Kafka queue and write to the file. But it seems 
> like I would have to open the whole file, read the whole file, find the line 
> I want to change, and write the whole file for every change. Is there a 
> better way to do that? That approach seems like it would be difficult/slow if 
> the file is several million lines long.
>
> Also, once I come up with a way to update the file quickly, what is the best 
> way to distribute the file to all the different solrcloud nodes in the 
> correct directory?


Re: External file fields

2018-02-02 Thread Charlie Hull

On 01/02/2018 18:55, Brian Yee wrote:

Hello,

I want to use external file field to store frequently changing inventory and 
price data. I got a proof of concept working with a mock text file and this 
will suit my needs.

What is the best way to keep this file updated in a fast way. Ideally I would 
like to read changes from a Kafka queue and write to the file. But it seems 
like I would have to open the whole file, read the whole file, find the line I 
want to change, and write the whole file for every change. Is there a better 
way to do that? That approach seems like it would be difficult/slow if the file 
is several million lines long.

Also, once I come up with a way to update the file quickly, what is the best 
way to distribute the file to all the different solrcloud nodes in the correct 
directory?

Another approach would be the XJoin plugin we wrote - if you wait a few 
days we should have an updated patch for Solr v6.5 and possibly v7. 
XJoin lets you filter/join/rank Solr results using an external data source.


http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/
http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/


Cheers

Charlie


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: External file fields

2018-02-02 Thread Emir Arnautović
Maybe you can try or extend Sematext’s Redis parser: 
https://github.com/sematext/solr-redis 
. Downside of this approach is another 
moving part - Redis.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 1 Feb 2018, at 19:55, Brian Yee  wrote:
> 
> Hello,
> 
> I want to use external file field to store frequently changing inventory and 
> price data. I got a proof of concept working with a mock text file and this 
> will suit my needs.
> 
> What is the best way to keep this file updated in a fast way. Ideally I would 
> like to read changes from a Kafka queue and write to the file. But it seems 
> like I would have to open the whole file, read the whole file, find the line 
> I want to change, and write the whole file for every change. Is there a 
> better way to do that? That approach seems like it would be difficult/slow if 
> the file is several million lines long.
> 
> Also, once I come up with a way to update the file quickly, what is the best 
> way to distribute the file to all the different solrcloud nodes in the 
> correct directory?



Re: External file fields

2018-02-01 Thread Erick Erickson
Have you considered updateable docValues?

Best,
Erick

On Thu, Feb 1, 2018 at 10:55 AM, Brian Yee  wrote:
> Hello,
>
> I want to use external file field to store frequently changing inventory and 
> price data. I got a proof of concept working with a mock text file and this 
> will suit my needs.
>
> What is the best way to keep this file updated in a fast way. Ideally I would 
> like to read changes from a Kafka queue and write to the file. But it seems 
> like I would have to open the whole file, read the whole file, find the line 
> I want to change, and write the whole file for every change. Is there a 
> better way to do that? That approach seems like it would be difficult/slow if 
> the file is several million lines long.
>
> Also, once I come up with a way to update the file quickly, what is the best 
> way to distribute the file to all the different solrcloud nodes in the 
> correct directory?


External file fields

2018-02-01 Thread Brian Yee
Hello,

I want to use external file field to store frequently changing inventory and 
price data. I got a proof of concept working with a mock text file and this 
will suit my needs.

What is the best way to keep this file updated in a fast way. Ideally I would 
like to read changes from a Kafka queue and write to the file. But it seems 
like I would have to open the whole file, read the whole file, find the line I 
want to change, and write the whole file for every change. Is there a better 
way to do that? That approach seems like it would be difficult/slow if the file 
is several million lines long.

Also, once I come up with a way to update the file quickly, what is the best 
way to distribute the file to all the different solrcloud nodes in the correct 
directory?


Re: Real Time Search and External File Fields

2016-10-10 Thread Mike Lissner
Thanks for the replies. I made the changes so that the external file field
is loaded per:


  
  

Re: Real Time Search and External File Fields

2016-10-09 Thread Shawn Heisey
On 10/8/2016 1:18 PM, Mike Lissner wrote:
> I want to make sure I understand this properly and document this for
> futurepeople that may find this thread. Here's what I interpret your
> advice to be:
> 0. Slacken my auto soft commit interval to something more like a minute. 

Yes, I would do this.  I would also increase autoCommit to something
between one and five minutes, with openSearcher set to false.  There's
nothing *wrong* with 15 seconds for autoCommit, but I want my server to
be doing less work during normal operation.

To answer a question you posed in a later message: Yes, it's common for
users to have a longer interval on autoSoftCommit than autoCommit. 
Remember the mantra in the URL about understanding commits:  Hard
commits are about durability, soft commits are about visibility.  Hard
commits when openSearcher is false are almost always *very* fast, so
it's typically not much of a burden to have them happen more frequently,
and thus have a better data durability guarantee.  Like I said above, I
generally use an autoCommit value between one and five minutes.

> I'm a bit confused about the example autowarmcount for the caches, which is
> 0. Why not set this to something higher? I guess it's a RAM utilization vs.
> speed tradeoff? A low number like 16 seems like it'd have minimal impact on
> RAM?

A low autowarmCount is generally chosen for one reason: commit speed. 
If the example configs have it set to zero, I'm sure this was done so
commits would proceed as fast as possible.  Large values can turn
opening a new searcher into a process that can take *minutes*.

On my index shards, the autowarmCount on my filterCache is *four*. 
That's it -- execute only four of the most recent filters in the cache
when a new searcher opens.  That warming *still* sometimes takes as long
as 20 seconds on the larger shards.  The filters used in queries on my
indexes are very large and very complex, and can match millions of
documents.  Pleading with the dev team to decrease query complexity
doesn't help.

On the idea of reusing the external file data when it doesn't change:  I
do not know if this is possible.  I have no idea how Solr and Lucene use
the data found in the external file, so it might be completely necessary
to re-load it every time.  You can open an issue in Jira to explore the
idea, but don't be too surprised if it doesn't go anywhere.

Thanks,
Shawn



Re: Real Time Search and External File Fields

2016-10-08 Thread Erick Erickson
I chose 16 as a place to start. You usually reach diminishing returns
pretty quickly, i feel it's a mistake to set your autowarm counts to, say
256 (and I've seen this in the thousands) unless you have some proof
that it's useful to bump higher.

But certainly if you set them to 16 and see spikes just after a searcher
is opened that aren't tolerable, feel free to make them larger.

You've hit on exactly why newSearcher and firstSearcher are there.
The theory behind autowarm counts is that the last N entries are
likely to be useful in the near future. There's no guarantee at all that
this is true and newSearcher/firstSearcher are certain to exercise
what _you_ think is most important.

As for why autowarm counts are set to 0 in the examples, there's no
overarching reason. Certainly if the soft commit interval is 1 second,
autowarming
is largely useless so having it also at 0 makes sense.

Best,
Erick

On Sat, Oct 8, 2016 at 12:31 PM, Walter Underwood  wrote:
> With time-oriented data, you can use an old trick (goes back to Infoseek in 
> 1995).
>
> Make a “today” collection that is very fresh. Nightly, migrate new documents 
> to
> the “not today” collection. The today collection will be small and can be 
> updated
> quickly. The archive collection will be large and slow to update, but who 
> cares?
>
> You can also send all docs to both collections and de-dupe.
>
> Every night, you start over with the “today” collection.
>
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Oct 8, 2016, at 12:18 PM, Mike Lissner  
>> wrote:
>>
>> On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson 
>> wrote:
>>
>>> What you haven't mentioned is how often you add new docs. Is it once a
>>> day? Steadily
>>> from 8:00 to 17:00?
>>>
>>
>> Alas, it's a steady trickle during business hours. We're ingesting court
>> documents as they're posted on court websites, then sending alerts as soon
>> as possible.
>>
>>
>>> Whatever, your soft commit really should be longer than your autowarm
>>> interval. Configure
>>> autowarming to reference queries (firstSearcher or newSearcher events
>>> or autowarm
>>> counts in queryResultCache and filterCache. Say 16 in each of these
>>> latter for a start) such
>>> that they cause the external file to load. That _should_ prevent any
>>> queries from being
>>> blocked since the autowarming will happen in the background and while
>>> it's happening
>>> incoming queries will be served by the old searcher.
>>>
>>
>> I want to make sure I understand this properly and document this for future
>> people that may find this thread. Here's what I interpret your advice to be:
>>
>> 0. Slacken my auto soft commit interval to something more like a minute.
>>
>> 1. Set up a query in the newSearcher listener that uses my external file
>> field.
>> 1a. Do the same in firstSearcher if I want newly started solr to warm up
>> before getting queries (this doesn't matter to me, so I'm skipping this).
>>
>> and/or
>>
>> 2. Set autowarmcount in queryResultCache and filterCache to 16 so that the
>> top 16 query results from the previous searcher are regenerated in the new
>> searcher.
>>
>> Doing #1 seems like a safe strategy since it's guaranteed to hit the
>> external file field. #2 feels like a bonus.
>>
>> I'm a bit confused about the example autowarmcount for the caches, which is
>> 0. Why not set this to something higher? I guess it's a RAM utilization vs.
>> speed tradeoff? A low number like 16 seems like it'd have minimal impact on
>> RAM?
>>
>> Thanks for all the great replies and for everything you do for Solr. I
>> truly appreciate your efforts.
>>
>> Mike
>


Re: Real Time Search and External File Fields

2016-10-08 Thread Walter Underwood
With time-oriented data, you can use an old trick (goes back to Infoseek in 
1995).

Make a “today” collection that is very fresh. Nightly, migrate new documents to 
the “not today” collection. The today collection will be small and can be 
updated
quickly. The archive collection will be large and slow to update, but who cares?

You can also send all docs to both collections and de-dupe.

Every night, you start over with the “today” collection.

Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 8, 2016, at 12:18 PM, Mike Lissner  
> wrote:
> 
> On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson 
> wrote:
> 
>> What you haven't mentioned is how often you add new docs. Is it once a
>> day? Steadily
>> from 8:00 to 17:00?
>> 
> 
> Alas, it's a steady trickle during business hours. We're ingesting court
> documents as they're posted on court websites, then sending alerts as soon
> as possible.
> 
> 
>> Whatever, your soft commit really should be longer than your autowarm
>> interval. Configure
>> autowarming to reference queries (firstSearcher or newSearcher events
>> or autowarm
>> counts in queryResultCache and filterCache. Say 16 in each of these
>> latter for a start) such
>> that they cause the external file to load. That _should_ prevent any
>> queries from being
>> blocked since the autowarming will happen in the background and while
>> it's happening
>> incoming queries will be served by the old searcher.
>> 
> 
> I want to make sure I understand this properly and document this for future
> people that may find this thread. Here's what I interpret your advice to be:
> 
> 0. Slacken my auto soft commit interval to something more like a minute.
> 
> 1. Set up a query in the newSearcher listener that uses my external file
> field.
> 1a. Do the same in firstSearcher if I want newly started solr to warm up
> before getting queries (this doesn't matter to me, so I'm skipping this).
> 
> and/or
> 
> 2. Set autowarmcount in queryResultCache and filterCache to 16 so that the
> top 16 query results from the previous searcher are regenerated in the new
> searcher.
> 
> Doing #1 seems like a safe strategy since it's guaranteed to hit the
> external file field. #2 feels like a bonus.
> 
> I'm a bit confused about the example autowarmcount for the caches, which is
> 0. Why not set this to something higher? I guess it's a RAM utilization vs.
> speed tradeoff? A low number like 16 seems like it'd have minimal impact on
> RAM?
> 
> Thanks for all the great replies and for everything you do for Solr. I
> truly appreciate your efforts.
> 
> Mike



Re: Real Time Search and External File Fields

2016-10-08 Thread Mike Lissner
On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson 
wrote:

> What you haven't mentioned is how often you add new docs. Is it once a
> day? Steadily
> from 8:00 to 17:00?
>

Alas, it's a steady trickle during business hours. We're ingesting court
documents as they're posted on court websites, then sending alerts as soon
as possible.


> Whatever, your soft commit really should be longer than your autowarm
> interval. Configure
> autowarming to reference queries (firstSearcher or newSearcher events
> or autowarm
> counts in queryResultCache and filterCache. Say 16 in each of these
> latter for a start) such
> that they cause the external file to load. That _should_ prevent any
> queries from being
> blocked since the autowarming will happen in the background and while
> it's happening
> incoming queries will be served by the old searcher.
>

I want to make sure I understand this properly and document this for future
people that may find this thread. Here's what I interpret your advice to be:

0. Slacken my auto soft commit interval to something more like a minute.

1. Set up a query in the newSearcher listener that uses my external file
field.
1a. Do the same in firstSearcher if I want newly started solr to warm up
before getting queries (this doesn't matter to me, so I'm skipping this).

and/or

2. Set autowarmcount in queryResultCache and filterCache to 16 so that the
top 16 query results from the previous searcher are regenerated in the new
searcher.

Doing #1 seems like a safe strategy since it's guaranteed to hit the
external file field. #2 feels like a bonus.

I'm a bit confused about the example autowarmcount for the caches, which is
0. Why not set this to something higher? I guess it's a RAM utilization vs.
speed tradeoff? A low number like 16 seems like it'd have minimal impact on
RAM?

Thanks for all the great replies and for everything you do for Solr. I
truly appreciate your efforts.

Mike


Re: Real Time Search and External File Fields

2016-10-08 Thread Mike Lissner
On Sat, Oct 8, 2016 at 8:46 AM Shawn Heisey  wrote:

> Most soft commit
> > documentation talks about setting up soft commits with  of
> about a
> > second.
>
> IMHO any documentation that recommends autoSoftCommit with a maxTime of
> one second is bad documentation, and needs to be fixed.  Where have you
> seen such a recommendation?


You know, I must have made that up, sorry. But the documentation you linked
to (on the Lucid Works blog) and the example file says 15 seconds for hard
commits, so it I think that got me thinking that soft commits could be more
frequent.

Should soft commits be less frequent than hard commits
(opensearcher=False)? If so, I didn't find that to be at all clear.


> right now Solr/Lucene has no
> way of knowing that your external file has not changed, so it must read
> the file every time it builds a searcher.


Is it crazy to file a feature request asking that Solr/Lucene keep the
modtime of this file and on reload it if it has changed? Seems like an easy
win.


>  I doubt this feature was
> designed to deal well with an extremely large external file like yours.
>

Perhaps not. It's probably worth mentioning that part of the reason the
file is so large is because pagerank uses very small and accurate floats.
So a typical line is:

1=9.50539603222e-08

Not something smaller like:

1=3.2

Pagerank also provides a value for every item in the index, so that makes
the file long. I'd suspect that anybody with a pagerank boosted index of
moderate size would have a similarly-sized file.


> If the info changes that infrequently, can you just incorporate it
> directly into the index with a standard field, with the info coming in
> as a part of your normal indexing process?


We've considered that, but whenever you re-run pagerank, it updates EVERY
value. So I guess we could try updating every doc in our index whenever we
run pagerank, but that's a nasty solution.


> It seems unlikely that Solr would stop serving queries while setting up
> a new searcher.  The old searcher should continue to serve requests
> until the new searcher is ready.  If this is happening, that definitely
> seems like a bug.
>

I'm positive I've observed this, though you're right, some queries still
seem to come through. Is it possible that queries relying on the field are
stopped while the field is loading? I've observed this two ways:

1. From the front end, things were stalling every time I was doing a hard
commit (opensearcher=true). I had hard commits coming in every ten minutes
via cron job, and sure enough, at ten, twenty, thirty...minutes after every
hour, I'd see stalls.

2. Watching the logs, I saw a flood of queries come through after the line:

Loaded external value source external_pagerank

Some queries were coming through before this line, but I think none of
those queries use the external file field (external_pagerank).

Mike


Re: Real Time Search and External File Fields

2016-10-08 Thread Shawn Heisey
On 10/7/2016 6:19 PM, Mike Lissner wrote:
> Soft commits seem to be exactly the thing for this, but whenever I open a
> new searcher (which soft commits seem to do), the external file is
> reloaded, and all queries are halted until it finishes loading. When I just
> measured, this took about 30 seconds to complete. Most soft commit
> documentation talks about setting up soft commits with  of about a
> second.

IMHO any documentation that recommends autoSoftCommit with a maxTime of
one second is bad documentation, and needs to be fixed.  Where have you
seen such a recommendation?  Unless the index is extremely small and has
been thoroughly optimized for NRT (which usually means *no*
autowarming), achieving commit times of less than one second is usually
not possible.  This is the page that usually comes out when people start
talking about commits:

http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

On the topic of one-second commit latency, that page has this to say:
"Set your soft commit interval to as long as you can stand. Don’t listen
to your product manager who says “we need no more than 1 second
latency”. Really. Push back hard and see if the /user/ is best served or
will even notice. Soft commits and NRT are pretty amazing, but they’re
not free."

The kind of intervals for autocommit and autosoftcommit that I like to
see is at LEAST one minute, and preferably longer if you can stand it to
be longer.

> Is there anything I can do to make the external file field not get reloaded
> constantly? It only changes about once a month, and I want to use soft
> commits to power the alerts feature.

Anytime you want changes to show up in your index, you need a new
searcher.  When you're using an external file field, part of that info
will come from that external source, and right now Solr/Lucene has no
way of knowing that your external file has not changed, so it must read
the file every time it builds a searcher.  I doubt this feature was
designed to deal well with an extremely large external file like yours. 
The code looks like it goes line by line reading the file, and although
I'm sure that process has been optimized as far as it can be, it still
takes a lot of time when there are millions of lines.

If the info changes that infrequently, can you just incorporate it
directly into the index with a standard field, with the info coming in
as a part of your normal indexing process?  I'm sure the performance
would be MUCH better if Solr didn't have to reference the external file.

It seems unlikely that Solr would stop serving queries while setting up
a new searcher.  The old searcher should continue to serve requests
until the new searcher is ready.  If this is happening, that definitely
seems like a bug.

Thanks,
Shawn



Re: Real Time Search and External File Fields

2016-10-07 Thread Erick Erickson
bq: Most soft commit
documentation talks about setting up soft commits with  of about a
second.

I think this is really a consequence of this being included in the
example configs
for illustrative purposes, personally I never liked this.

There is no one right answer. I've seen soft commit intervals from -1
(never soft commit)
to 1 second. The latter means most all of your caches are totally
useless and might
as well be turned off usually.

What you haven't mentioned is how often you add new docs. Is it once a
day? Steadily
from 8:00 to 17:00? All in three hours in the morning?

Whatever, your soft commit really should be longer than your autowarm
interval. Configure
autowarming to reference queries (firstSearcher or newSearcher events
or autowarm
counts in queryResultCache and filterCache. Say 16 in each of these
latter for a start) such
that they cause the external file to load. That _should_ prevent any
queries from being
blocked since the autowarming will happen in the background and while
it's happening
incoming queries will be served by the old searcher.

Best,
Erick

On Fri, Oct 7, 2016 at 5:19 PM, Mike Lissner
 wrote:
> I have an index of about 4M documents with an external file field
> configured to do boosting based on pagerank scores of each document. The
> pagerank file is about 93MB as of today -- it's pretty big.
>
> Each day, I add about 1,000 new documents to the index, and I need them to
> be available as soon as possible so that I can send out alerts to our users
> about new content (this is Google Alerts, essentially).
>
> Soft commits seem to be exactly the thing for this, but whenever I open a
> new searcher (which soft commits seem to do), the external file is
> reloaded, and all queries are halted until it finishes loading. When I just
> measured, this took about 30 seconds to complete. Most soft commit
> documentation talks about setting up soft commits with  of about a
> second.
>
> Is there anything I can do to make the external file field not get reloaded
> constantly? It only changes about once a month, and I want to use soft
> commits to power the alerts feature.
>
> Thanks,
>
> Mike


Real Time Search and External File Fields

2016-10-07 Thread Mike Lissner
I have an index of about 4M documents with an external file field
configured to do boosting based on pagerank scores of each document. The
pagerank file is about 93MB as of today -- it's pretty big.

Each day, I add about 1,000 new documents to the index, and I need them to
be available as soon as possible so that I can send out alerts to our users
about new content (this is Google Alerts, essentially).

Soft commits seem to be exactly the thing for this, but whenever I open a
new searcher (which soft commits seem to do), the external file is
reloaded, and all queries are halted until it finishes loading. When I just
measured, this took about 30 seconds to complete. Most soft commit
documentation talks about setting up soft commits with  of about a
second.

Is there anything I can do to make the external file field not get reloaded
constantly? It only changes about once a month, and I want to use soft
commits to power the alerts feature.

Thanks,

Mike


Re: Question about external file fields

2013-12-06 Thread Stefan Matheis
I guess you refer to this post? 
http://1opensourcelover.wordpress.com/2013/07/02/solr-external-file-fields/

If so .. he already provides at least one possible use case:

*snip*

We use Solr to serve our company’s browse pages. Our browse pages are similar 
to how a typical Stackoverflow tag page looks. That “browse” page has the 
question title (which links to the actual page that contains the question, 
comments and replies), view count, snippet of the question text, questioner’s 
profile info, tags and time information. One thing that can change quite 
frequently on such a page is the view count. I believe Stackoverflow uses Redis 
to keep track of the view counts, but we have to currently manage this in Solr, 
since Solr is our only datastore to serve these browse pages.

The problem before Solr 4.0 was that you could not update a single field in a 
document. You have to form the entire document first (either by querying Solr 
or using an alternate data source which contains all the info), update the view 
count and then post the entire document to Solr. With Solr 4+, you can do 
atomic update of a single field – the Solr server internally handles fetching 
the entire document, updating the field and updating its index. But atomic 
update comes with some caveats – you must store all your Solr fields (other 
than copyFields), which can increase your storage space and enable updateLog, 
which can slow down Solr start-up.

For this specific problem of updating a field more frequently than the rest of 
the document, external file fields (EFFs) can come in quite handy. They have 
one main restriction though – you cannot use them in your queries directly i.e. 
they cannot be used in the q parameter directly. But we will see how we can 
circumvent this problem at least partially using function query hacks.

*/snip*

another case, out of my head, might be product pricing or updates on stock 
count.

- Stefan  


On Thursday, December 5, 2013 at 11:11 PM, yriveiro wrote:

 Hi,
  
 I read this post http://1opensourcelover.wordpress.com/ about EEF's and I
 found very interesting.
  
 Can someone give me more use cases about the utility of EEF's?
  
 /Yago
  
  
  
 -
 Best regards
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Question-about-external-file-fields-tp4105213.html
 Sent from the Solr - User mailing list archive at Nabble.com 
 (http://Nabble.com).
  
  




Question about external file fields

2013-12-05 Thread yriveiro
Hi,

I read this post http://1opensourcelover.wordpress.com/ about EEF's and I
found very interesting.

Can someone give me more use cases about the utility of EEF's?

/Yago



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-external-file-fields-tp4105213.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Replicating files containing external file fields

2013-07-02 Thread Arun Rangarajan
Jack and Erick,
Thanks for your replies. I am able to replicate ext file fields by
specifying the relative paths for each individual file. confFiles in
solrconfig.xml is really long now with lot of ../ and I got 5 ext file
field files. Would be really nice if wild-cards were supported here :-).

About the reloadCache on slave: following
http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
I
set up listeners to reload the ext file fields after commits. Since the
slave replicationHandler issues a commit after it replicates the files (as
mentioned in
https://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F),
I believe the ext file fields get reloaded to the slave cache after
replication. This is exactly what I was looking for.


On Fri, Jun 28, 2013 at 5:08 PM, Jack Krupansky j...@basetechnology.comwrote:

 Yes, you need to list that EFF file in the confFiles list - only those
 listed files will be replicated.

 str
 name=confFilessolrconfig.**xml,data-config.xml,schema.**
 xml,stopwords.txt,synonyms.**txt,elevate.xml,
 /var/solr-data/List/external_***/str

 Oops... sorry, no wildcards... you must list the individual files.

 Technically, the path is supposed to be relative to the Solr collection
 conf directory, so you MAY have you may have to put lots of ../ in the
 path to get to the files, like:

 ../../../../solr-data/List/**external_1

 Tor each file.

 (This is what Erick was referring to.)

 Sorry, I don't have the answer to the reload question at the tip of my
 tongue.


 -- Jack Krupansky

 -Original Message- From: Arun Rangarajan
 Sent: Friday, June 28, 2013 7:42 PM

 To: solr-user@lucene.apache.org
 Subject: Re: Replicating files containing external file fields

 Jack,

 Here is the ReplicationHandler definition from solrconfig.xml:

 requestHandler name=/replication class=solr.**ReplicationHandler 
 lst name=master
 str name=enable${enable.master:**false}/str
 str name=replicateAfterstartup**/str
 str name=replicateAftercommit/**str
 str name=replicateAfter**optimize/str
 str
 name=confFilessolrconfig.**xml,data-config.xml,schema.**
 xml,stopwords.txt,synonyms.**txt,elevate.xml/str
 /lst
 lst name=slave
 str name=enable${enable.slave:**false}/str
 str name=masterUrlhttp://${**master.ip}:${master.port}/**solr/${
 solr.core.name}/replication/**str
 str name=pollInterval00:01:00/**str
 /lst
 /requestHandler

 The confFiles are under the dir:
 /var/solr/application-cores/**List/conf
 and the external file fields are like:
 /var/solr-data/List/external_*

 Should I add
 /var/solr-data/List/external_*
 to confFiles like this?

 str
 name=confFilessolrconfig.**xml,data-config.xml,schema.**
 xml,stopwords.txt,synonyms.**txt,elevate.xml,
 /var/solr-data/List/external_***/str


 Also, can you tell me when (or whether) I need to do reloadCache on the
 slave after the ext file fields are replicated?

 Thx.


 On Fri, Jun 28, 2013 at 10:13 AM, Jack Krupansky j...@basetechnology.com
 **wrote:

  Show us your confFiles directive. Maybe there is some subtle error in
 the file name.

 -- Jack Krupansky

 -Original Message- From: Arun Rangarajan
 Sent: Friday, June 28, 2013 1:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Replicating files containing external file fields


 Erick,
 Thx for your reply. The external file field fields are already under
 dataDir specified in solrconfig.xml. They are not getting replicated.
 (Solr version 4.2.1.)


 On Thu, Jun 27, 2013 at 10:50 AM, Erick Erickson erickerick...@gmail.com
 
 **wrote:


  Haven't tried this, but I _think_ you can use the

 confFiles trick with relative paths, see:
 http://wiki.apache.org/solr/SolrReplicationhttp://wiki.apache.org/solr/**SolrReplication
 http://wiki.**apache.org/solr/**SolrReplicationhttp://wiki.apache.org/solr/SolrReplication
 


 Or just put your EFF files in the data dir?

 Best
 Erick


 On Wed, Jun 26, 2013 at 9:01 PM, Arun Rangarajan
 arunrangara...@gmail.comwrote:

  From  
  https://wiki.apache.org/solr/SolrReplicationhttps://wiki.apache.org/solr/**SolrReplication
 https://wiki.**apache.org/solr/**SolrReplicationhttps://wiki.apache.org/solr/SolrReplicationI
  understand that

 index
  dir and any files under the conf dir can be replicated to slaves. I 
 want
 to
  know if there is any way the files under the data dir containing 
 external
  file fields can be replicated. These are not replicated by default.
  Currently we are running the ext file field reload script on both the
  master and the slave and then running reloadCache on each server once
 they
  are loaded.
 







Re: Replicating files containing external file fields

2013-06-28 Thread Arun Rangarajan
Erick,
Thx for your reply. The external file field fields are already under
dataDir specified in solrconfig.xml. They are not getting replicated.
(Solr version 4.2.1.)


On Thu, Jun 27, 2013 at 10:50 AM, Erick Erickson erickerick...@gmail.comwrote:

 Haven't tried this, but I _think_ you can use the
 confFiles trick with relative paths, see:
 http://wiki.apache.org/solr/SolrReplication

 Or just put your EFF files in the data dir?

 Best
 Erick


 On Wed, Jun 26, 2013 at 9:01 PM, Arun Rangarajan
 arunrangara...@gmail.comwrote:

  From https://wiki.apache.org/solr/SolrReplication I understand that
 index
  dir and any files under the conf dir can be replicated to slaves. I want
 to
  know if there is any way the files under the data dir containing external
  file fields can be replicated. These are not replicated by default.
  Currently we are running the ext file field reload script on both the
  master and the slave and then running reloadCache on each server once
 they
  are loaded.
 



Re: Replicating files containing external file fields

2013-06-28 Thread Jack Krupansky
Show us your confFiles directive. Maybe there is some subtle error in the 
file name.


-- Jack Krupansky

-Original Message- 
From: Arun Rangarajan

Sent: Friday, June 28, 2013 1:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Replicating files containing external file fields

Erick,
Thx for your reply. The external file field fields are already under
dataDir specified in solrconfig.xml. They are not getting replicated.
(Solr version 4.2.1.)


On Thu, Jun 27, 2013 at 10:50 AM, Erick Erickson 
erickerick...@gmail.comwrote:



Haven't tried this, but I _think_ you can use the
confFiles trick with relative paths, see:
http://wiki.apache.org/solr/SolrReplication

Or just put your EFF files in the data dir?

Best
Erick


On Wed, Jun 26, 2013 at 9:01 PM, Arun Rangarajan
arunrangara...@gmail.comwrote:

 From https://wiki.apache.org/solr/SolrReplication I understand that
index
 dir and any files under the conf dir can be replicated to slaves. I want
to
 know if there is any way the files under the data dir containing 
 external

 file fields can be replicated. These are not replicated by default.
 Currently we are running the ext file field reload script on both the
 master and the slave and then running reloadCache on each server once
they
 are loaded.






Re: Replicating files containing external file fields

2013-06-28 Thread Arun Rangarajan
Jack,

Here is the ReplicationHandler definition from solrconfig.xml:

requestHandler name=/replication class=solr.ReplicationHandler 
 lst name=master
str name=enable${enable.master:false}/str
 str name=replicateAfterstartup/str
 str name=replicateAftercommit/str
str name=replicateAfteroptimize/str
 str
name=confFilessolrconfig.xml,data-config.xml,schema.xml,stopwords.txt,synonyms.txt,elevate.xml/str
 /lst
lst name=slave
 str name=enable${enable.slave:false}/str
 str name=masterUrlhttp://${master.ip}:${master.port}/solr/${
solr.core.name}/replication/str
 str name=pollInterval00:01:00/str
 /lst
/requestHandler

The confFiles are under the dir:
/var/solr/application-cores/List/conf
and the external file fields are like:
/var/solr-data/List/external_*

Should I add
/var/solr-data/List/external_*
to confFiles like this?

str
name=confFilessolrconfig.xml,data-config.xml,schema.xml,stopwords.txt,synonyms.txt,elevate.xml,
/var/solr-data/List/external_*/str


Also, can you tell me when (or whether) I need to do reloadCache on the
slave after the ext file fields are replicated?

Thx.


On Fri, Jun 28, 2013 at 10:13 AM, Jack Krupansky j...@basetechnology.comwrote:

 Show us your confFiles directive. Maybe there is some subtle error in
 the file name.

 -- Jack Krupansky

 -Original Message- From: Arun Rangarajan
 Sent: Friday, June 28, 2013 1:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Replicating files containing external file fields


 Erick,
 Thx for your reply. The external file field fields are already under
 dataDir specified in solrconfig.xml. They are not getting replicated.
 (Solr version 4.2.1.)


 On Thu, Jun 27, 2013 at 10:50 AM, Erick Erickson erickerick...@gmail.com
 **wrote:

  Haven't tried this, but I _think_ you can use the
 confFiles trick with relative paths, see:
 http://wiki.apache.org/solr/**SolrReplicationhttp://wiki.apache.org/solr/SolrReplication

 Or just put your EFF files in the data dir?

 Best
 Erick


 On Wed, Jun 26, 2013 at 9:01 PM, Arun Rangarajan
 arunrangara...@gmail.com**wrote:

  From 
  https://wiki.apache.org/solr/**SolrReplicationhttps://wiki.apache.org/solr/SolrReplicationI
   understand that
 index
  dir and any files under the conf dir can be replicated to slaves. I want
 to
  know if there is any way the files under the data dir containing 
 external
  file fields can be replicated. These are not replicated by default.
  Currently we are running the ext file field reload script on both the
  master and the slave and then running reloadCache on each server once
 they
  are loaded.
 





Re: Replicating files containing external file fields

2013-06-28 Thread Jack Krupansky
Yes, you need to list that EFF file in the confFiles list - only those 
listed files will be replicated.


str
name=confFilessolrconfig.xml,data-config.xml,schema.xml,stopwords.txt,synonyms.txt,elevate.xml,
/var/solr-data/List/external_*/str

Oops... sorry, no wildcards... you must list the individual files.

Technically, the path is supposed to be relative to the Solr collection 
conf directory, so you MAY have you may have to put lots of ../ in the 
path to get to the files, like:


../../../../solr-data/List/external_1

Tor each file.

(This is what Erick was referring to.)

Sorry, I don't have the answer to the reload question at the tip of my 
tongue.


-- Jack Krupansky

-Original Message- 
From: Arun Rangarajan

Sent: Friday, June 28, 2013 7:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Replicating files containing external file fields

Jack,

Here is the ReplicationHandler definition from solrconfig.xml:

requestHandler name=/replication class=solr.ReplicationHandler 
lst name=master
str name=enable${enable.master:false}/str
str name=replicateAfterstartup/str
str name=replicateAftercommit/str
str name=replicateAfteroptimize/str
str
name=confFilessolrconfig.xml,data-config.xml,schema.xml,stopwords.txt,synonyms.txt,elevate.xml/str
/lst
lst name=slave
str name=enable${enable.slave:false}/str
str name=masterUrlhttp://${master.ip}:${master.port}/solr/${
solr.core.name}/replication/str
str name=pollInterval00:01:00/str
/lst
/requestHandler

The confFiles are under the dir:
/var/solr/application-cores/List/conf
and the external file fields are like:
/var/solr-data/List/external_*

Should I add
/var/solr-data/List/external_*
to confFiles like this?

str
name=confFilessolrconfig.xml,data-config.xml,schema.xml,stopwords.txt,synonyms.txt,elevate.xml,
/var/solr-data/List/external_*/str


Also, can you tell me when (or whether) I need to do reloadCache on the
slave after the ext file fields are replicated?

Thx.


On Fri, Jun 28, 2013 at 10:13 AM, Jack Krupansky 
j...@basetechnology.comwrote:



Show us your confFiles directive. Maybe there is some subtle error in
the file name.

-- Jack Krupansky

-Original Message- From: Arun Rangarajan
Sent: Friday, June 28, 2013 1:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Replicating files containing external file fields


Erick,
Thx for your reply. The external file field fields are already under
dataDir specified in solrconfig.xml. They are not getting replicated.
(Solr version 4.2.1.)


On Thu, Jun 27, 2013 at 10:50 AM, Erick Erickson erickerick...@gmail.com
**wrote:

 Haven't tried this, but I _think_ you can use the

confFiles trick with relative paths, see:
http://wiki.apache.org/solr/**SolrReplicationhttp://wiki.apache.org/solr/SolrReplication

Or just put your EFF files in the data dir?

Best
Erick


On Wed, Jun 26, 2013 at 9:01 PM, Arun Rangarajan
arunrangara...@gmail.com**wrote:

 From 
 https://wiki.apache.org/solr/**SolrReplicationhttps://wiki.apache.org/solr/SolrReplicationI 
 understand that

index
 dir and any files under the conf dir can be replicated to slaves. I 
 want

to
 know if there is any way the files under the data dir containing 
external
 file fields can be replicated. These are not replicated by default.
 Currently we are running the ext file field reload script on both the
 master and the slave and then running reloadCache on each server once
they
 are loaded.









Re: Replicating files containing external file fields

2013-06-27 Thread Erick Erickson
Haven't tried this, but I _think_ you can use the
confFiles trick with relative paths, see:
http://wiki.apache.org/solr/SolrReplication

Or just put your EFF files in the data dir?

Best
Erick


On Wed, Jun 26, 2013 at 9:01 PM, Arun Rangarajan
arunrangara...@gmail.comwrote:

 From https://wiki.apache.org/solr/SolrReplication I understand that index
 dir and any files under the conf dir can be replicated to slaves. I want to
 know if there is any way the files under the data dir containing external
 file fields can be replicated. These are not replicated by default.
 Currently we are running the ext file field reload script on both the
 master and the slave and then running reloadCache on each server once they
 are loaded.



Replicating files containing external file fields

2013-06-26 Thread Arun Rangarajan
From https://wiki.apache.org/solr/SolrReplication I understand that index
dir and any files under the conf dir can be replicated to slaves. I want to
know if there is any way the files under the data dir containing external
file fields can be replicated. These are not replicated by default.
Currently we are running the ext file field reload script on both the
master and the slave and then running reloadCache on each server once they
are loaded.