SolrDeletionPolicy & Core Reload

2021-01-02 Thread John Davis
Hi,

Does Core Reload pick up changes to SolrDeletionPolicy

in solrconfig.xml or does the solr server needs to be restarted?

And what would be the best way to check the current values
of SolrDeletionPolicy (eg maxCommitsToKeep, maxCommitAge) from the solr
admin console?

Thank you.


Blocking certain queries

2020-02-03 Thread John Davis
Hello,

Is there a way to block certain queries in solr? For eg a delete for *:* or
if there is a known query that causes problems, can these be blocked at the
solr server layer.


Solr Payloads

2019-09-20 Thread John Davis
We are using solr payload field and noticed the values extracted using
payload() sometimes don't match the value stored in the field. Is there a
lossy encoding for the payload value?

fq=payload_field:*, fl=payload_field,payload(payload_field, 573131)

"payload_field": "573131|*1568263581*",
"payload(payload_field, 573131)": *1568263550*
  ...
"payload_field": "573131|1568263582",
"payload(payload_field, 573131)": 1568263550


Field definition:

   
  
  
  
  
  
  

John


Re: Enabling/disabling docValues

2019-06-11 Thread John Davis
There is no way to match case insensitive without TextFields + no
tokenization. Its a long standing limitation of not being able to apply any
analyzers with str fields.

Thanks for pointing out the re-index page I've seen it. However sometimes
it is hard to re-index in a reasonable amount of time & resources, and if
we empower power users to understand the system better it will help making
more informed tradeoffs.

On Tue, Jun 11, 2019 at 6:52 AM Gus Heck  wrote:

> On Mon, Jun 10, 2019 at 10:53 PM John Davis 
> wrote:
>
> > You have made many assumptions which might not always be realistic a)
> > TextField is always tokenized
>
>
> Well, you could of course change configuration or code to do something else
> but this would be a very odd and misleading thing to do and we would expect
> you to have mentioned it.
>
>
> > b) Users care about precise counts and
>
>
> This is indeed use case dependent if you are talking about approximately
> correct (150 vs 152 etc), but it's pretty reasonable to say that gross
> errors (75 vs 153 or 0 vs 5 etc) more or less make faceting pointless.
>
>
> > c) Users have the luxury or ability to do a full re-index anytime.
>
>
> This is a state of affairs we consistently advise against. The reason we
> give the advice is precisely because one cannot change the schema out from
> under an existing index safely without rewriting the index. Without
> extremely careful design on your side (not using certain features and high
> storage requirements), your index will not retain enough information to
> re-remake itself. Therefore, it is a long standing bad practice to not have
> a separate canonical copy of the data and a means to re-index it (or a
> design where only the very most recent data is important, and a copy of
> that). There is a whole page dedicated to reindexing in the ref guide:
> https://lucene.apache.org/solr/guide/8_0/reindexing.html Here's a relevant
> bit from the current version:
>
> `There is no process in Solr for programmatically reindexing data. When we
> say "reindex", we mean, literally, "index it again". However you got the
> data into the index the first time, you will run that process again. It is
> strongly recommended that Solr users index their data in a repeatable,
> consistent way, so that the process can be easily repeated when the need
> for reindexing arises.`
>
>
> The ref guide has lots of nice info, maybe you should read it rather than
> snubbing one of the nicest and most knowledgeable committers on the project
> (who is helping you for free) by haughtily saying you'll go ask someone
> else... And if you've been left with this situation (no ability to reindex)
> by your predecessor you have our deepest sympathies, but it still doesn't
> change the fact that you need break it to management the your predecessor
> has lost the data required to maintain the system and you still need
> re-index whatever you can salvage somehow, or start fresh.
>
> When Erick is saying you shouldn't be asking that question... >90% of the
> time you really shouldn't be, and if you do pursue it, you'll just waste a
> lot of your own time.
>
>
> > On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson  >
> > wrote:
> >
> > > bq. Does lucene look at %docs in each state, or the first doc or
> > something
> > > else?
> > >
> > > Frankly I don’t care since no matter what, the results of faceting
> mixed
> > > definitions is not useful.
> > >
> > > tl;dr;
> > >
> > > “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> > > means just what I choose it to mean — neither more nor less.’
> > >
> > > So “undefined" in this case means “I don’t see any value at all in
> > chasing
> > > that info down” ;).
> > >
> > > Changing from regular text to SortableText means that the results will
> be
> > > inaccurate no matter what. For example, I have a doc with the value “my
> > dog
> > > has fleas”. When NOT using SortableText, there are multiple tokens so
> > facet
> > > counts would be:
> > >
> > > my (1)
> > > dog (1)
> > > has (1)
> > > fleas (1)
> > >
> > > But for SortableText will be:
> > >
> > > my dog has fleas (1)
> > >
> > > Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> > > doc1 was  indexed before switching to SortableText and doc2 after.
> > > Presumably  the output you want is:
> > >
> > > my dog has fleas (1)
> > > my cat has fleas (1)
> > >
> > 

Re: Enabling/disabling docValues

2019-06-10 Thread John Davis
You have made many assumptions which might not always be realistic a)
TextField is always tokenized b) Users care about precise counts and c)
Users have the luxury or ability to do a full re-index anytime. These are
real issues and there is no black/white solution. I will ask Lucene folks
on the actual implementation.

On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson 
wrote:

> bq. Does lucene look at %docs in each state, or the first doc or something
> else?
>
> Frankly I don’t care since no matter what, the results of faceting mixed
> definitions is not useful.
>
> tl;dr;
>
> “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> means just what I choose it to mean — neither more nor less.’
>
> So “undefined" in this case means “I don’t see any value at all in chasing
> that info down” ;).
>
> Changing from regular text to SortableText means that the results will be
> inaccurate no matter what. For example, I have a doc with the value “my dog
> has fleas”. When NOT using SortableText, there are multiple tokens so facet
> counts would be:
>
> my (1)
> dog (1)
> has (1)
> fleas (1)
>
> But for SortableText will be:
>
> my dog has fleas (1)
>
> Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> doc1 was  indexed before switching to SortableText and doc2 after.
> Presumably  the output you want is:
>
> my dog has fleas (1)
> my cat has fleas (1)
>
> But you can’t get that output.  There are three cases:
>
> 1> Lucene treats all documents as SortableText, faceting on the docValues
> parts. No facets on doc1
>
> my  cat has fleas (1)
>
> 2> Lucene treats all documents as tokenized, faceting on each individual
> token. Faceting is performed on the tokenized content of both,  docValues
> in doc2  ignored
>
> my  (2)
> dog (1)
> has (2)
> fleas (2)
> cat (1)
>
>
> 3> Lucene does the best it can, faceting on the tokens for docs without
> SortableText and docValues if the doc was indexed with Sortable text. doc1
> faceted on tokenized, doc2 on docValues
>
> my  (1)
> dog (1)
> has (1)
> fleas (1)
> my cat has fleas (1)
>
> Since none of those cases is what I want, there’s no point I can see in
> chasing down what actually happens….
>
> Best,
> Erick
>
> P.S. I _think_ Lucene tries to use the definition from the first segment,
> but since whether the lists of segments to be  merged don’t look at the
> field definitions at all. Whether the first segment in the list has
> SortableText or not will not be predictable in a general way even within a
> single run.
>
>
> > On Jun 9, 2019, at 6:53 PM, John Davis 
> wrote:
> >
> > Understood, however code is rarely random/undefined. Does lucene look at
> %
> > docs in each state, or the first doc or something else?
> >
> > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson 
> > wrote:
> >
> >> It’s basically undefined. When segments are merged that have dissimilar
> >> definitions like this what can Lucene do? Consider:
> >>
> >> Faceting on a text (not sortable) means that each individual token in
> the
> >> index is uninverted on the Java heap and the facets are computed for
> each
> >> individual term.
> >>
> >> Faceting on a SortableText field just has a single term per document,
> and
> >> that in the docValues structures as opposed to the inverted index.
> >>
> >> Now you change the value and start indexing. At some point a segment
> >> containing no docValues is merged with a segment containing docValues
> for
> >> the field. The resulting mixed segment is in this state. If you facet on
> >> the field, should the docs without docValues have each individual term
> >> counted? Or just the SortableText values in the docValues structure?
> >> Neither one is right.
> >>
> >> Also remember that Lucene has no notion of schema. That’s entirely
> imposed
> >> on Lucene by Solr carefully constructing low-level analysis chains.
> >>
> >> So I’d _strongly_ recommend you re-index your corpus to a new collection
> >> with the current definition, then perhaps use CREATEALIAS to seamlessly
> >> switch.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 9, 2019, at 12:50 PM, John Davis 
> >> wrote:
> >>>
> >>> Hi there,
> >>> We recently changed a field from TextField + no docValues to
> >>> SortableTextField which has docValues enabled by default. Once I did
> >> this I
> >>> do not see any facet values for the field. I know that once all the
> docs
> >>> are re-indexed facets should work again, however can someone clarify
> the
> >>> current logic of lucene/solr how facets will be computed when schema is
> >>> changed from no docValues to docValues and vice-versa?
> >>>
> >>> 1. Until ALL the docs are re-indexed, no facets will be returned?
> >>> 2. Once certain fraction of docs are re-indexed, those facets will be
> >>> returned?
> >>> 3. Something else?
> >>>
> >>>
> >>> Varun
> >>
> >>
>
>


Re: Enabling/disabling docValues

2019-06-09 Thread John Davis
Understood, however code is rarely random/undefined. Does lucene look at %
docs in each state, or the first doc or something else?

On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson 
wrote:

> It’s basically undefined. When segments are merged that have dissimilar
> definitions like this what can Lucene do? Consider:
>
> Faceting on a text (not sortable) means that each individual token in the
> index is uninverted on the Java heap and the facets are computed for each
> individual term.
>
> Faceting on a SortableText field just has a single term per document, and
> that in the docValues structures as opposed to the inverted index.
>
> Now you change the value and start indexing. At some point a segment
> containing no docValues is merged with a segment containing docValues for
> the field. The resulting mixed segment is in this state. If you facet on
> the field, should the docs without docValues have each individual term
> counted? Or just the SortableText values in the docValues structure?
> Neither one is right.
>
> Also remember that Lucene has no notion of schema. That’s entirely imposed
> on Lucene by Solr carefully constructing low-level analysis chains.
>
> So I’d _strongly_ recommend you re-index your corpus to a new collection
> with the current definition, then perhaps use CREATEALIAS to seamlessly
> switch.
>
> Best,
> Erick
>
> > On Jun 9, 2019, at 12:50 PM, John Davis 
> wrote:
> >
> > Hi there,
> > We recently changed a field from TextField + no docValues to
> > SortableTextField which has docValues enabled by default. Once I did
> this I
> > do not see any facet values for the field. I know that once all the docs
> > are re-indexed facets should work again, however can someone clarify the
> > current logic of lucene/solr how facets will be computed when schema is
> > changed from no docValues to docValues and vice-versa?
> >
> > 1. Until ALL the docs are re-indexed, no facets will be returned?
> > 2. Once certain fraction of docs are re-indexed, those facets will be
> > returned?
> > 3. Something else?
> >
> >
> > Varun
>
>


Enabling/disabling docValues

2019-06-09 Thread John Davis
Hi there,
We recently changed a field from TextField + no docValues to
SortableTextField which has docValues enabled by default. Once I did this I
do not see any facet values for the field. I know that once all the docs
are re-indexed facets should work again, however can someone clarify the
current logic of lucene/solr how facets will be computed when schema is
changed from no docValues to docValues and vice-versa?

1. Until ALL the docs are re-indexed, no facets will be returned?
2. Once certain fraction of docs are re-indexed, those facets will be
returned?
3. Something else?


Varun


Re: Solr Heap Usage

2019-06-07 Thread John Davis
What would be the best way to understand where heap is being used?

On Tue, Jun 4, 2019 at 9:31 PM Greg Harris  wrote:

> Just a couple of points I’d make here. I did some testing a while back in
> which if no commit is made, (hard or soft) there are internal memory
> structures holding tlogs and it will continue to get worse the more docs
> that come in. I don’t know if that’s changed in further versions. I’d
> recommend doing commits with some amount of frequency in indexing heavy
> apps, otherwise you are likely to have heap issues. I personally would
> advocate for some of the points already made. There are too many variables
> going on here and ways to modify stuff to make sizing decisions and think
> you’re doing anything other than a pure guess if you don’t test and
> monitor. I’d advocate for a process in which testing is done regularly to
> figure out questions like number of shards/replicas, heap size, memory etc.
> Hard data, good process and regular testing will trump guesswork every time
>
> Greg
>
> On Tue, Jun 4, 2019 at 9:22 AM John Davis 
> wrote:
>
> > You might want to test with softcommit of hours vs 5m for heavy indexing
> +
> > light query -- even though there is internal memory structure overhead
> for
> > no soft commits, in our testing a 5m soft commit (via commitWithin) has
> > resulted in a very very large heap usage which I suspect is because of
> > other overhead associated with it.
> >
> > On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson 
> > wrote:
> >
> > > I need to update that, didn’t understand the bits about retaining
> > internal
> > > memory structures at the time.
> > >
> > > > On Jun 4, 2019, at 2:10 AM, John Davis 
> > > wrote:
> > > >
> > > > Erick - These conflict, what's changed?
> > > >
> > > > So if I were going to recommend settings, they’d be something like
> > this:
> > > > Do a hard commit with openSearcher=false every 60 seconds.
> > > > Do a soft commit every 5 minutes.
> > > >
> > > > vs
> > > >
> > > > Index-heavy, Query-light
> > > > Set your soft commit interval quite long, up to the maximum latency
> you
> > > can
> > > > stand for documents to be visible. This could be just a couple of
> > minutes
> > > > or much longer. Maybe even hours with the capability of issuing a
> hard
> > > > commit (openSearcher=true) or soft commit on demand.
> > > >
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <
> erickerick...@gmail.com
> > >
> > > > wrote:
> > > >
> > > >>> I've looked through SolrJ, DIH and others -- is the bottomline
> > > >>> across all of them to "batch updates" and not commit as long as
> > > possible?
> > > >>
> > > >> Of course it’s more complicated than that ;)….
> > > >>
> > > >> But to start, yes, I urge you to batch. Here’s some stats:
> > > >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> > > >>
> > > >> Note that at about 100 docs/batch you hit diminishing returns.
> > > _However_,
> > > >> that test was run on a single shard collection, so if you have 10
> > shards
> > > >> you’d
> > > >> have to send 1,000 docs/batch. I wouldn’t sweat that number much,
> just
> > > >> don’t
> > > >> send one at a time. And there are the usual gotchas if your
> documents
> > > are
> > > >> 1M .vs. 1K.
> > > >>
> > > >> About committing. No, don’t hold off as long as possible. When you
> > > commit,
> > > >> segments are merged. _However_, the default 100M internal buffer
> size
> > > means
> > > >> that segments are written anyway even if you don’t hit a commit
> point
> > > when
> > > >> you have 100M of index data, and merges happen anyway. So you won’t
> > save
> > > >> anything on merging by holding off commits.
> > > >> And you’ll incur penalties. Here’s more than you want to know about
> > > >> commits:
> > > >>
> > > >>
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-

Re: Solr Heap Usage

2019-06-04 Thread John Davis
You might want to test with softcommit of hours vs 5m for heavy indexing +
light query -- even though there is internal memory structure overhead for
no soft commits, in our testing a 5m soft commit (via commitWithin) has
resulted in a very very large heap usage which I suspect is because of
other overhead associated with it.

On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson 
wrote:

> I need to update that, didn’t understand the bits about retaining internal
> memory structures at the time.
>
> > On Jun 4, 2019, at 2:10 AM, John Davis 
> wrote:
> >
> > Erick - These conflict, what's changed?
> >
> > So if I were going to recommend settings, they’d be something like this:
> > Do a hard commit with openSearcher=false every 60 seconds.
> > Do a soft commit every 5 minutes.
> >
> > vs
> >
> > Index-heavy, Query-light
> > Set your soft commit interval quite long, up to the maximum latency you
> can
> > stand for documents to be visible. This could be just a couple of minutes
> > or much longer. Maybe even hours with the capability of issuing a hard
> > commit (openSearcher=true) or soft commit on demand.
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> >
> >
> >
> > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson 
> > wrote:
> >
> >>> I've looked through SolrJ, DIH and others -- is the bottomline
> >>> across all of them to "batch updates" and not commit as long as
> possible?
> >>
> >> Of course it’s more complicated than that ;)….
> >>
> >> But to start, yes, I urge you to batch. Here’s some stats:
> >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> >>
> >> Note that at about 100 docs/batch you hit diminishing returns.
> _However_,
> >> that test was run on a single shard collection, so if you have 10 shards
> >> you’d
> >> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> >> don’t
> >> send one at a time. And there are the usual gotchas if your documents
> are
> >> 1M .vs. 1K.
> >>
> >> About committing. No, don’t hold off as long as possible. When you
> commit,
> >> segments are merged. _However_, the default 100M internal buffer size
> means
> >> that segments are written anyway even if you don’t hit a commit point
> when
> >> you have 100M of index data, and merges happen anyway. So you won’t save
> >> anything on merging by holding off commits.
> >> And you’ll incur penalties. Here’s more than you want to know about
> >> commits:
> >>
> >>
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >>
> >> But some key take-aways… If for some reason Solr abnormally
> >> terminates, the accumulated documents since the last hard
> >> commit are replayed. So say you don’t commit for an hour of
> >> furious indexing and someone does a “kill -9”. When you restart
> >> Solr it’ll try to re-index all the docs for the last hour. Hard commits
> >> with openSearcher=false aren’t all that expensive. I usually set mine
> >> for a minute and forget about it.
> >>
> >> Transaction logs hold a window, _not_ the entire set of operations
> >> since time began. When you do a hard commit, the current tlog is
> >> closed and a new one opened and ones that are “too old” are deleted. If
> >> you never commit you have a huge transaction log to no good purpose.
> >>
> >> Also, while indexing, in order to accommodate “Real Time Get”, all
> >> the docs indexed since the last searcher was opened have a pointer
> >> kept in memory. So if you _never_ open a new searcher, that internal
> >> structure can get quite large. So in bulk-indexing operations, I
> >> suggest you open a searcher every so often.
> >>
> >> Opening a new searcher isn’t terribly expensive if you have no
> autowarming
> >> going on. Autowarming as defined in solrconfig.xml in filterCache,
> >> queryResultCache
> >> etc.
> >>
> >> So if I were going to recommend settings, they’d be something like this:
> >> Do a hard commit with openSearcher=false every 60 seconds.
> >> Do a soft commit every 5 minutes.
> >>
> >> I’d actually be surprised if you were able to measure differences
> between
> >> those settings and just hard commit with openSearcher=true every 60
> >> seconds and soft commit at -1 (never)…
> >

Re: Solr Heap Usage

2019-06-04 Thread John Davis
Erick - These conflict, what's changed?

So if I were going to recommend settings, they’d be something like this:
Do a hard commit with openSearcher=false every 60 seconds.
Do a soft commit every 5 minutes.

vs

Index-heavy, Query-light
Set your soft commit interval quite long, up to the maximum latency you can
stand for documents to be visible. This could be just a couple of minutes
or much longer. Maybe even hours with the capability of issuing a hard
commit (openSearcher=true) or soft commit on demand.
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/




On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson 
wrote:

> > I've looked through SolrJ, DIH and others -- is the bottomline
> > across all of them to "batch updates" and not commit as long as possible?
>
> Of course it’s more complicated than that ;)….
>
> But to start, yes, I urge you to batch. Here’s some stats:
> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
>
> Note that at about 100 docs/batch you hit diminishing returns. _However_,
> that test was run on a single shard collection, so if you have 10 shards
> you’d
> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> don’t
> send one at a time. And there are the usual gotchas if your documents are
> 1M .vs. 1K.
>
> About committing. No, don’t hold off as long as possible. When you commit,
> segments are merged. _However_, the default 100M internal buffer size means
> that segments are written anyway even if you don’t hit a commit point when
> you have 100M of index data, and merges happen anyway. So you won’t save
> anything on merging by holding off commits.
> And you’ll incur penalties. Here’s more than you want to know about
> commits:
>
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> But some key take-aways… If for some reason Solr abnormally
> terminates, the accumulated documents since the last hard
> commit are replayed. So say you don’t commit for an hour of
> furious indexing and someone does a “kill -9”. When you restart
> Solr it’ll try to re-index all the docs for the last hour. Hard commits
> with openSearcher=false aren’t all that expensive. I usually set mine
> for a minute and forget about it.
>
> Transaction logs hold a window, _not_ the entire set of operations
> since time began. When you do a hard commit, the current tlog is
> closed and a new one opened and ones that are “too old” are deleted. If
> you never commit you have a huge transaction log to no good purpose.
>
> Also, while indexing, in order to accommodate “Real Time Get”, all
> the docs indexed since the last searcher was opened have a pointer
> kept in memory. So if you _never_ open a new searcher, that internal
> structure can get quite large. So in bulk-indexing operations, I
> suggest you open a searcher every so often.
>
> Opening a new searcher isn’t terribly expensive if you have no autowarming
> going on. Autowarming as defined in solrconfig.xml in filterCache,
> queryResultCache
> etc.
>
> So if I were going to recommend settings, they’d be something like this:
> Do a hard commit with openSearcher=false every 60 seconds.
> Do a soft commit every 5 minutes.
>
> I’d actually be surprised if you were able to measure differences between
> those settings and just hard commit with openSearcher=true every 60
> seconds and soft commit at -1 (never)…
>
> Best,
> Erick
>
> > On Jun 2, 2019, at 3:35 PM, John Davis 
> wrote:
> >
> > If we assume there is no query load then effectively this boils down to
> > most effective way for adding a large number of documents to the solr
> > index. I've looked through SolrJ, DIH and others -- is the bottomline
> > across all of them to "batch updates" and not commit as long as possible?
> >
> > On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson 
> > wrote:
> >
> >> Oh, there are about a zillion reasons ;).
> >>
> >> First of all, most tools that show heap usage also count uncollected
> >> garbage. So your 10G could actually be much less “live” data. Quick way
> to
> >> test is to attach jconsole to the running Solr and hit the button that
> >> forces a full GC.
> >>
> >> Another way is to reduce your heap when you start Solr (on a test system
> >> of course) until bad stuff happens, if you reduce it to very close to
> what
> >> Solr needs, you’ll get slower as more and more cycles are spent on GC,
> if
> >> you reduce it a little more you’ll get OOMs.
> >>
> >> You can take heap dumps of course to see where all the memory is being
> >>

Adding Multiple JSON Documents

2019-06-02 Thread John Davis
Hi there,

I was looking at the solr documentation for indexing multiple documents via
json and noticed inconsistency in the docs.

Should the POST url be /update/*json/docs *instead of just /update. It does
look like former does work, unless both will work just fine?

https://lucene.apache.org/solr/guide/7_3/uploading-data-with-index-handlers.html#adding-multiple-json-documents
Adding Multiple JSON Documents


Adding multiple documents at one time via JSON can be done via a JSON Array
of JSON Objects, where each object represents a document:

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/*update*' --data-binary '[
{"id": "1","title": "Doc 1"  },  {"id": "2","title":
"Doc 2"  }]'


Re: Solr Heap Usage

2019-06-02 Thread John Davis
If we assume there is no query load then effectively this boils down to
most effective way for adding a large number of documents to the solr
index. I've looked through SolrJ, DIH and others -- is the bottomline
across all of them to "batch updates" and not commit as long as possible?

On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson 
wrote:

> Oh, there are about a zillion reasons ;).
>
> First of all, most tools that show heap usage also count uncollected
> garbage. So your 10G could actually be much less “live” data. Quick way to
> test is to attach jconsole to the running Solr and hit the button that
> forces a full GC.
>
> Another way is to reduce your heap when you start Solr (on a test system
> of course) until bad stuff happens, if you reduce it to very close to what
> Solr needs, you’ll get slower as more and more cycles are spent on GC, if
> you reduce it a little more you’ll get OOMs.
>
> You can take heap dumps of course to see where all the memory is being
> used, but that’s tricky as it also includes garbage.
>
> I’ve seen cache sizes (filterCache in particular) be something that uses
> lots of memory, but that requires queries to be fired. Each filterCache
> entry can take up to roughly maxDoc/8 bytes + overhead….
>
> A classic error is to sort, group or facet on a docValues=false field.
> Starting with Solr 7.6, you can add an option to fields to throw an error
> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
>
> In short, there’s not enough information until you dive in and test
> bunches of stuff to tell.
>
> Best,
> Erick
>
>
> > On Jun 2, 2019, at 2:22 AM, John Davis 
> wrote:
> >
> > This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
> > index.My hypothesis was merging segments was trying to read it all but if
> > that's not the case I am out of ideas. The one caveat is we are trying to
> > add the documents quickly (~1g an hour) but if lucene does write 100m
> > segments and does streaming merge it shouldn't matter?
> >
> > On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood 
> > wrote:
> >
> >>> On May 31, 2019, at 11:27 PM, John Davis 
> >> wrote:
> >>>
> >>> 2. Merging segments - does solr load the entire segment in memory or
> >> chunks
> >>> of it? if later how large are these chunks
> >>
> >> No, it does not read the entire segment into memory.
> >>
> >> A fundamental part of the Lucene design is streaming posting lists into
> >> memory and processing them sequentially. The same amount of memory is
> >> needed for small or large segments. Each posting list is in document-id
> >> order. The merge is a merge of sorted lists, writing a new posting list
> in
> >> document-id order.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
>
>


Re: Solr Heap Usage

2019-06-02 Thread John Davis
This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
index.My hypothesis was merging segments was trying to read it all but if
that's not the case I am out of ideas. The one caveat is we are trying to
add the documents quickly (~1g an hour) but if lucene does write 100m
segments and does streaming merge it shouldn't matter?

On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood 
wrote:

> > On May 31, 2019, at 11:27 PM, John Davis 
> wrote:
> >
> > 2. Merging segments - does solr load the entire segment in memory or
> chunks
> > of it? if later how large are these chunks
>
> No, it does not read the entire segment into memory.
>
> A fundamental part of the Lucene design is streaming posting lists into
> memory and processing them sequentially. The same amount of memory is
> needed for small or large segments. Each posting list is in document-id
> order. The merge is a merge of sorted lists, writing a new posting list in
> document-id order.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Solr Heap Usage

2019-06-01 Thread John Davis
I've read a bunch of the wiki's on solr heap usage and wanted to confirm my
understanding of what all does solr use the heap for:

1. Indexing new documents - until committed? if not how long are the new
documents kept in heap?

2. Merging segments - does solr load the entire segment in memory or chunks
of it? if later how large are these chunks

3. Queries, facets, caches - anything else major?

John


Re: Facet count incorrect

2019-05-23 Thread John Davis
Reindexing to alias is not always easy if it requires 2x resources. Just to
be clear the issues you mentioned are mostly around faceting because we
haven't seen any other search/retrieval issues. Or is that not accurate?

On Wed, May 22, 2019 at 5:12 PM Erick Erickson 
wrote:

> 1> I strongly recommend you re-index into a new collection and switch to
> it with a collection alias rather than try to re-index all the docs.
> Segment merging with the same field with dissimilar definitions is not
> guaranteed to do the right thing.
>
> 2> No. There a few (very few) things that don’t require starting fresh.
> You can do some things like add a lowercasefilter, add or remove a field
> totally and the like. Even then you’ll go through a period of mixed-up
> results until the reindex is complete. But changing the type, changing from
> multiValued to singleValued or vice versa (particularly with docValues)
> etc. are all “fraught”.
>
> My usual reply is “if you’re going to reindex everything anyway, why not
> just do it to a new collection and alias when you’re done?” It’s much safer.
>
> Best,
> Erick
>
> > On May 22, 2019, at 3:06 PM, John Davis 
> wrote:
> >
> > Hi there -
> > Our facet counts are incorrect for a particular field and I suspect it is
> > because we changed the type of the field from StrField to TextField. Two
> > questions:
> >
> > 1. If we do re-index all the documents in the index, would these counts
> get
> > fixed?
> > 2. Is there a "safe" way of changing field types that generally works?
> >
> > *Old type:*
> >   > docValues="true" multiValued="true"/>
> >
> > *New type:*
> >   > omitNorms="true" omitTermFreqAndPositions="true" indexed="true"
> > stored="true" positionIncrementGap="100" sortMissingLast="true"
> > multiValued="true">
> > 
> >  
> >  
> >
> >  
>
>


Facet count incorrect

2019-05-22 Thread John Davis
Hi there -
Our facet counts are incorrect for a particular field and I suspect it is
because we changed the type of the field from StrField to TextField. Two
questions:

1. If we do re-index all the documents in the index, would these counts get
fixed?
2. Is there a "safe" way of changing field types that generally works?

*Old type:*
  

*New type:*
  

  
  

  


Re: Optimizing fq query performance

2019-04-18 Thread John Davis
FYI
https://issues.apache.org/jira/browse/SOLR-11437
https://issues.apache.org/jira/browse/SOLR-12488

On Thu, Apr 18, 2019 at 7:24 AM Shawn Heisey  wrote:

> On 4/17/2019 11:49 PM, John Davis wrote:
> > I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
> > *] doesn't seem materially different compared to has_field:1. If no one
> > knows why Lucene optimizes one but not another, it's not clear whether it
> > even optimizes one to be sure.
>
> Queries using a boolean field will be even faster than the all-inclusive
> range query ... but they require work at index time to function
> properly.  If you can do it this way, that's definitely preferred.  I
> was providing you with something that would work even without the
> separate boolean field.
>
> If the cardinality of the field you're searching is very low (only a few
> possible values for that field across the whole index) then a wildcard
> query can be fast.  It is only when the cardinality is high that the
> wildcard query is slow.  Still, it is better to use the range query for
> determining whether the field exists, unless you have a separate boolean
> field for that purpose, in which case the boolean query will be a little
> bit faster.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
*] doesn't seem materially different compared to has_field:1. If no one
knows why Lucene optimizes one but not another, it's not clear whether it
even optimizes one to be sure.

On Wed, Apr 17, 2019 at 4:27 PM Shawn Heisey  wrote:

> On 4/17/2019 1:21 PM, John Davis wrote:
> > If what you describe is the case for range query [* TO *], why would
> lucene
> > not optimize field:* similar way?
>
> I don't know.  Low level lucene operation is a mystery to me.
>
> I have seen first-hand that the range query is MUCH faster than the
> wildcard query.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
If what you describe is the case for range query [* TO *], why would lucene
not optimize field:* similar way?

On Wed, Apr 17, 2019 at 10:36 AM Shawn Heisey  wrote:

> On 4/17/2019 10:51 AM, John Davis wrote:
> > Can you clarify why field:[* TO *] is lot more efficient than field:*
>
> It's a range query.  For every document, Lucene just has to answer two
> questions -- is the value more than any possible value and is the value
> less than any possible value.  The answer will be yes if the field
> exists, and no if it doesn't.  With one million documents, there are two
> million questions that Lucene has to answer.  Which probably seems like
> a lot ... but keep reading.  (Side note:  It wouldn't surprise me if
> Lucene has an optimization specifically for the all inclusive range such
> that it actually only asks one question, not two)
>
> With a wildcard query, there are as many questions as there are values
> in the field.  Every question is asked for every single document.  So if
> you have a million documents and there are three hundred thousand
> different values contained in the field across the whole index, that's
> 300 billion questions.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
Can you clarify why field:[* TO *] is lot more efficient than field:*

On Sun, Apr 14, 2019 at 12:14 PM Shawn Heisey  wrote:

> On 4/13/2019 12:58 PM, John Davis wrote:
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
>
> All filters cover the entire index, unless the query parser that you're
> using implements the PostFilter interface, the filter cost is set high
> enough, and caching is disabled.  All three of those conditions must be
> met in order for a filter to only run on results instead of the entire
> index.
>
> http://yonik.com/advanced-filter-caching-in-solr/
> https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/
>
> Most query parsers don't implement the PostFilter interface.  The lucene
> and edismax parsers do not implement PostFilter.  Unless you've
> specified the query parser in the fq parameter, it will use the lucene
> query parser, and it cannot be a PostFilter.
>
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
>
> If the point of the "field1:*" query clause is "make sure field1 exists
> in the document" then you would be a lot better off with this query clause:
>
> field1:[* TO *]
>
> This is an all-inclusive range query.  It works with all field types
> where I have tried it, and that includes TextField types.   It will be a
> lot more efficient than the wildcard query.
>
> Here's what happens with "field1:*".  If the cardinality of field1 is
> ten million different values, then the query that gets constructed for
> Lucene will literally contain ten million values.  And every single one
> of them will need to be compared to every document.  That's a LOT of
> comparisons.  Wildcard queries are normally very slow.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-14 Thread John Davis
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)

This feels like something could be optimized internally by tracking
existence of the field in a doc instead of making users index yet another
field to track existence?

BTW does this same behavior apply for tlong fields too where the value
might be more continuous vs discrete strings?

On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley  wrote:

> More constrained but matching the same set of documents just guarantees
> that there is more information to evaluate per document matched.
> For your specific case, you can optimize fq = 'field1:* AND field2:value'
> to =field1:*=field2:value
> This will at least cause field1:* to be cached and reused if it's a common
> pattern.
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)
> One can optimize this by indexing a term in a different field to turn it
> into a single term query (i.e. exists:field1)
>
> -Yonik
>
> On Sat, Apr 13, 2019 at 2:58 PM John Davis 
> wrote:
>
> > Hi there,
> >
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
> >
> > If we assume that the result set is identical between the two queries and
> > field1 is in general more frequent in the index, we noticed query1 takes
> > 100x longer than query2. In case it matters field1 is of type tlongs
> while
> > field2 is a string.
> >
> > Any tips for optimizing this?
> >
> > John
> >
>


Optimizing fq query performance

2019-04-13 Thread John Davis
Hi there,

We noticed a sizable performance degradation when we add certain fq filters
to the query even though the result set does not change between the two
queries. I would've expected solr to optimize internally by picking the
most constrained fq filter first, but maybe my understanding is wrong.
Here's an example:

query1: fq = 'field1:* AND field2:value'
query2: fq = 'field2:value'

If we assume that the result set is identical between the two queries and
field1 is in general more frequent in the index, we noticed query1 takes
100x longer than query2. In case it matters field1 is of type tlongs while
field2 is a string.

Any tips for optimizing this?

John


Re: What causes new searcher to be created?

2019-03-10 Thread John Davis
We do add commitWithin=XX when indexing updates, I take it that triggers
new searcher when the commit is made? I was under the wrong impression that
autoCommit openSearcher=false would control those too.

On Sat, Mar 9, 2019 at 9:00 PM Erick Erickson 
wrote:

> Nothing should be opening new searchers in that case unless
> the commit is happening from outside. “Outside” here is a SorlJ
> program that either commits or specifies a commitWithin for an
> add. By default, post.jar also issues a commit at the end.
>
> I’d look at whatever is adding new documents to the system. Does
> your Solr log show any updates and what are the parameters if so?
>
> BTW, the setting for hard commit openSearcher=false _only_ applies
> to autocommits. The default behavior of an explicit commit from
> elsewhere will open a new searcher.
>
> > My assumption is that until a new searcher is created all the
> > newly indexed docs will not be visible
>
> This should be the case. So regardless of what the admin says, _can_
> you see newly indexed documents?
>
> Best,
> Erick
>
> > On Mar 9, 2019, at 7:24 PM, John Davis 
> wrote:
> >
> > Hi there,
> > I couldn't find an answer to this in the docs: if openSearcher is set to
> > false in the autocommit with no softcommits, what triggers a new one to
> be
> > created? My assumption is that until a new searcher is created all the
> > newly indexed docs will not be visible. Based on the solr admin console I
> > do see a new one being created every few minutes but I could not find the
> > parameter that controls it.
> >
> > John
>
>


What causes new searcher to be created?

2019-03-09 Thread John Davis
Hi there,
I couldn't find an answer to this in the docs: if openSearcher is set to
false in the autocommit with no softcommits, what triggers a new one to be
created? My assumption is that until a new searcher is created all the
newly indexed docs will not be visible. Based on the solr admin console I
do see a new one being created every few minutes but I could not find the
parameter that controls it.

John


Ignored fields and copyfield

2018-08-06 Thread John Davis
Hi there,
If a field is set as "ignored" (indexed=false, stored=false) can it be used
for another field as part of copyfield directive which might index/store it.

John


Index size by document fields

2018-08-04 Thread John Davis
Hi,
Is there a way to monitor the size of the index broken by individual fields
across documents? I understand there are different parts - the inverted
index and the stored fields - and an estimate would be good start.

Thanks
John


Re: Sort by payload value

2018-05-25 Thread John Davis
Hi Erik - Solr is tokenizing correctly as you can see it return the payload
field value along with the full payload and they match on the particular
field. The field does have a lowercase filter as you can see in the
definition. Changing it to single word query doesn't fix it either..

On Fri, May 25, 2018 at 8:22 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> My first guess (and it's a total guess) is that you either have a case
> problem or
> you're tokenizing the string. Does your field definition lower-case the
> tokens?
> If it's a string type then certainly not.
>
> Quick test would be to try your query with a value that matches case
> and has no spaces,
> maybe "Portals". If that gives you the correct sort then you have a
> place to start
>
> Adding =query will help a bit, although it won't show you the
> guts of the payload
> calcs.
>
> FYI, ties are broken by the internal Lucene doc ID. If the theory that
> you are getting
> no matches, then your sort order is determined by this value which you
> don't really
> have much access to.
>
> Best,
> Erick
>
> On Thu, May 24, 2018 at 7:29 PM, John Davis <johndavis925...@gmail.com>
> wrote:
> > Hello,
> >
> > We are trying to use payload values as described in [1] and are running
> > into issues when issuing *sort by* payload value.  Would appreciate any
> > pointers to what we might be doing wrong. We are running solr 6.6.0.
> >
> > * Here's the payload value definition:
> >
> > indexed="true"
> > class="solr.TextField">
> >   
> >> pattern="[A-Za-z0-9][^|]*[|][0-9.]+" group="0"/>
> >> encoder="float"/>
> >   
> >   
> >   
> >
> > * Query with sort by does not return documents sorted by the payload
> value:
> >
> > {
> >   "responseHeader":{
> > "status":0,
> > "QTime":82,
> > "params":{
> >   "q":"*:*",
> >   "indent":"on",
> >   "fl":"industry_value,${indexp}",
> > *  "indexp":"payload(industry_value, 'internet services', 0)",*
> >   "fq":["{!frange l=0.1}${indexp}",
> > "industry_value:*"],
> > *  "sort":"${indexp} asc",*
> >   "rows":"10",
> >   "wt":"json"}},
> >   "response":{"numFound":102668,"start":0,"docs":[
> >   {
> > "industry_value":"Startup|13.3890410959
> Collaboration|12.3863013699
> > Document Management|12.3863013699 Chat|12.3863013699 Video
> > Conferencing|12.3863013699 Finance|1.0 Payments|1.0 Internet|1.0 Internet
> > Services|1.0 Top Companies|1.0",
> >
> > "payload(industry_value, 'internet services', 0)":*1.0*},
> >
> >   {
> > "industry_value":"Hardware|16.7616438356 Messaging and
> > Telecommunications|6.71780821918 Mobility|6.71780821918
> > Startup|6.71780821918 Analytics|6.71780821918 Development
> > Platforms|6.71780821918 Mobile Commerce|6.71780821918 Mobile
> > Security|6.71780821918 Privacy and Security|6.71780821918 Information
> > Security|6.71780821918 Cyber Security|6.71780821918 Finance|6.71780821918
> > Collaboration|6.71780821918 Enterprise|6.71780821918
> > Messaging|6.71780821918 Internet Services|6.71780821918 Information
> > Technology|6.71780821918 Contact Management|6.71780821918
> > Mobile|6.71780821918 Mobile Enterprise|6.71780821918 Data
> > Security|6.71780821918 Data and Analytics|6.71780821918
> > Security|6.71780821918",
> >
> > "payload(industry_value, 'internet services', 0)":*6.7178082*},
> >
> >   {
> > "industry_value":"Startup|4.46301369863
> Advertising|1.24657534247
> > Content and Publishing|0.917808219178 Internet|0.917808219178 Social
> Media
> > Platforms|0.917808219178 Content Discovery|0.917808219178 Media and
> > Entertainment|0.917808219178 Social Media|0.917808219178 Sales and
> > Marketing|0.917808219178 Internet Services|0.917808219178 Advertising
> > Platforms|0.917808219178 Social Media Management|0.917808219178
> > Mobile|0.328767123288 Food and Beverage|0.252054794521 Real
> > Estate|0.252054794521 Consumer Goods|0.252054794521 FMCG|0.252054794521
> > Home Services|0.252054794521 Consumer|0.252054794521
> > Enterprise|0.167123287671",
> >
> > "payload(industry_value, 'internet services', 0)":*0.91780823*},
> >
> > {
> > "industry_value":"Startup|8.55068493151 Media and
> > Entertainment|5.54794520548 Transportation|5.54794520548
> > Ticketing|5.54794520548 Travel|5.54794520548 Travel and
> > Tourism|5.54794520548 Events|5.54794520548 Cloud Computing|2.33698630137
> > Collaboration|2.33698630137 Platforms|2.33698630137
> > Enterprise|2.33698630137 Internet Services|2.33698630137 Top
> > Companies|2.33698630137 Developer Tools|2.33698630137 Operating
> > Systems|2.33698630137 Search|1.83287671233 Internet|1.83287671233
> > Technology|1.83287671233 Portals|1.83287671233 Email|1.83287671233
> > Photography|1.83287671233",
> >
> > "payload(industry_value, 'internet services', 0)":*2.3369863*},
> >
> >
> > [1] https://lucidworks.com/2017/09/14/solr-payloads/
>


Sort by payload value

2018-05-24 Thread John Davis
Hello,

We are trying to use payload values as described in [1] and are running
into issues when issuing *sort by* payload value.  Would appreciate any
pointers to what we might be doing wrong. We are running solr 6.6.0.

* Here's the payload value definition:

   
  
  
  
  
  
  

* Query with sort by does not return documents sorted by the payload value:

{
  "responseHeader":{
"status":0,
"QTime":82,
"params":{
  "q":"*:*",
  "indent":"on",
  "fl":"industry_value,${indexp}",
*  "indexp":"payload(industry_value, 'internet services', 0)",*
  "fq":["{!frange l=0.1}${indexp}",
"industry_value:*"],
*  "sort":"${indexp} asc",*
  "rows":"10",
  "wt":"json"}},
  "response":{"numFound":102668,"start":0,"docs":[
  {
"industry_value":"Startup|13.3890410959 Collaboration|12.3863013699
Document Management|12.3863013699 Chat|12.3863013699 Video
Conferencing|12.3863013699 Finance|1.0 Payments|1.0 Internet|1.0 Internet
Services|1.0 Top Companies|1.0",

"payload(industry_value, 'internet services', 0)":*1.0*},

  {
"industry_value":"Hardware|16.7616438356 Messaging and
Telecommunications|6.71780821918 Mobility|6.71780821918
Startup|6.71780821918 Analytics|6.71780821918 Development
Platforms|6.71780821918 Mobile Commerce|6.71780821918 Mobile
Security|6.71780821918 Privacy and Security|6.71780821918 Information
Security|6.71780821918 Cyber Security|6.71780821918 Finance|6.71780821918
Collaboration|6.71780821918 Enterprise|6.71780821918
Messaging|6.71780821918 Internet Services|6.71780821918 Information
Technology|6.71780821918 Contact Management|6.71780821918
Mobile|6.71780821918 Mobile Enterprise|6.71780821918 Data
Security|6.71780821918 Data and Analytics|6.71780821918
Security|6.71780821918",

"payload(industry_value, 'internet services', 0)":*6.7178082*},

  {
"industry_value":"Startup|4.46301369863 Advertising|1.24657534247
Content and Publishing|0.917808219178 Internet|0.917808219178 Social Media
Platforms|0.917808219178 Content Discovery|0.917808219178 Media and
Entertainment|0.917808219178 Social Media|0.917808219178 Sales and
Marketing|0.917808219178 Internet Services|0.917808219178 Advertising
Platforms|0.917808219178 Social Media Management|0.917808219178
Mobile|0.328767123288 Food and Beverage|0.252054794521 Real
Estate|0.252054794521 Consumer Goods|0.252054794521 FMCG|0.252054794521
Home Services|0.252054794521 Consumer|0.252054794521
Enterprise|0.167123287671",

"payload(industry_value, 'internet services', 0)":*0.91780823*},

{
"industry_value":"Startup|8.55068493151 Media and
Entertainment|5.54794520548 Transportation|5.54794520548
Ticketing|5.54794520548 Travel|5.54794520548 Travel and
Tourism|5.54794520548 Events|5.54794520548 Cloud Computing|2.33698630137
Collaboration|2.33698630137 Platforms|2.33698630137
Enterprise|2.33698630137 Internet Services|2.33698630137 Top
Companies|2.33698630137 Developer Tools|2.33698630137 Operating
Systems|2.33698630137 Search|1.83287671233 Internet|1.83287671233
Technology|1.83287671233 Portals|1.83287671233 Email|1.83287671233
Photography|1.83287671233",

"payload(industry_value, 'internet services', 0)":*2.3369863*},


[1] https://lucidworks.com/2017/09/14/solr-payloads/


Solr needs a restart to recover from "No space left on device"

2018-02-06 Thread John Davis
Hi there!

We ran out of disk on our solr instance. However even after cleaning up the
disk solr server did not realize that there is free disk available. It only
got fixed after a restart.

Is this a known issue? Or are there workarounds that don't require a
restart?

Thanks
John


Matching within list fields

2018-01-29 Thread John Davis
Hi there!

We have a use case where we'd like to search within a list field, however
the search should not match across different elements in the list field --
all terms should match a single element in the list.

For eg if the field is a list of comments on a product, search should be
able to find a comment that matches all the terms.

Short of creating separate documents for each element in the list, is there
any other efficient way of accomplishing this?

Thanks
John


Re: SolrCloud

2017-12-15 Thread John Davis
Thanks Erick. I agree SolrCloud is better than master/slave, however we
have some questions between managing replicas separately vs with solrcloud.
For eg how much overhead do SolrCloud nodes have wrt memory/cpu/disk in
order to be able to sync pending index updates to other replicas? What
monitoring and safeguards are in place out of the box so too many pending
updates for unreachable replicas don't make the alive ones fall over? Or a
new replica doesn't overwhelm existing replica.

Of course everything works great when things are running well but when
things go south our preference would be for solr to not fall over as first
priority.

On Fri, Dec 15, 2017 at 9:41 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> The main advantage in SolrCloud in your setup is HA/DR. You say you
> have multiple replicas and shards. Either you have to index to each
> replica separately or you use master/slave replication. In either case
> you have to manage and fix the case where some node goes down. If
> you're using master/slave, if the master goes down you need to get in
> there and fix it, reassign the master, make config changes, restart
> Solr to pick them up, make sure you pick up any missed updates and all
> that.
>
> in SolrCloud that is managed for you. Plus, let's say you want to
> increase QPS capacity. In SolrCloud all you do is use the collections
> API ADDREPLICA command and you're done. It gets created (and you can
> specify exactly what node if you want), the index gets copied, new
> updates are automatically routed to it and it starts serving requests
> when it's synchronized all automagically. Symmetrically you can
> DELETEREPLICA if you have too much capacity.
>
> The price here is you have to get comfortable with maintaining
> ZooKeeper admittedly.
>
> Also in the 7x world you have different types of replicas, TLOG, PULL
> and NRT that combine some of the features of master/slave with
> SolrCloud.
>
> Generally my rule of thumb is the minute you get beyond a single shard
> you should move to SolrCloud. If all your data fits in one Solr core
> then it's less clear-cut, master/slave can work just fine. It Depends
> (tm) of course.
>
> Your use case is "implicit" (being renamed "manual") routing when you
> create your Solr collection. There are pros and cons here, but that's
> beyond the scope of your question. Your infrastructure should port
> pretty directly to SolrCloud. The short form is that all your indexing
> and/or querying is happening on a single node when using manual
> routing rather than in parallel. Of course executing parallel
> sub-queries imposes its own overhead.
>
> If your use-case for having these on a single shard it to segregate
> the data by some set (say users), you might want to consider just
> using separate _collections_ in SolrCloud where old_shard ==
> new_collection, basically all your routing is the same. You can create
> aliases pointing to multiple collections or specify multiple
> collections on the query, don't know if that fits your use case or not
> though.
>
>
> Best,
> Erick
>
> On Fri, Dec 15, 2017 at 9:03 AM, John Davis <johndavis925...@gmail.com>
> wrote:
> > Hello,
> > We are thinking about migrating to SolrCloud. Our current setup is:
> > 1. Multiple replicas and shards.
> > 2. Each query typically hits a single shard only.
> > 3. We have an external system that assigns a document to a shard based on
> > it's origin and is also used by solr clients when querying to find the
> > correct shard to query.
> >
> > It looks like the biggest advantage of SolrCloud is #3 - to route
> document
> > to the correct shard & replicas when indexing and to route query
> similarly.
> > Given we already have a fairly reliable system to do this, are there
> other
> > benefits from migrating to SolrCloud?
> >
> > Thanks,
> > John
>


SolrCloud

2017-12-15 Thread John Davis
Hello,
We are thinking about migrating to SolrCloud. Our current setup is:
1. Multiple replicas and shards.
2. Each query typically hits a single shard only.
3. We have an external system that assigns a document to a shard based on
it's origin and is also used by solr clients when querying to find the
correct shard to query.

It looks like the biggest advantage of SolrCloud is #3 - to route document
to the correct shard & replicas when indexing and to route query similarly.
Given we already have a fairly reliable system to do this, are there other
benefits from migrating to SolrCloud?

Thanks,
John


Solr index size statistics

2017-12-02 Thread John Davis
Hello,
Is there a way to get index size statistics for a given solr instance? For
eg broken by each field stored or indexed. The only things I know of is
running du on the index data files and getting counts per field
indexed/stored, however each field can be quite different wrt size.

Thanks
John


Re: Facets based on sampling

2017-10-24 Thread John Davis
On Tue, Oct 24, 2017 at 8:37 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq:  It is a bit surprising why facet computation
>  is so slow even when the query matches hundreds of docs.
>
> The number of terms in the field over all docs also comes into play.
> Say you're faceting over a field that has 100,000,000 unique values
> across all docs, that's a lot of bookkeeping.
>
>
100M unique values might be across all docs, and unless the faceting
implementation is really naive I cannot see how that can come into play
when the query matches a fraction of those.



> Best,
> Erick
>
>
> On Tue, Oct 24, 2017 at 1:08 AM, Emir Arnautović
> <emir.arnauto...@sematext.com> wrote:
> > Hi John,
> > Did you mean “docValues don’t work for analysed fields” since it works
> for multivalue string (or other supported types) fields. What you need to
> do is to convert your analysed field to multivalue string field - that
> requires changes in indexing flow.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 23 Oct 2017, at 21:08, John Davis <johndavis925...@gmail.com> wrote:
> >>
> >> Docvalues don't work for multivalued fields. I just started a separate
> >> thread with more debug info. It is a bit surprising why facet
> computation
> >> is so slow even when the query matches hundreds of docs.
> >>
> >> On Mon, Oct 23, 2017 at 6:53 AM, alessandro.benedetti <
> a.benede...@sease.io>
> >> wrote:
> >>
> >>> Hi John,
> >>> first of all, I may state the obvious, but have you tried docValues ?
> >>>
> >>> Apart from that a friend of mine ( Diego Ceccarelli) was discussing a
> >>> probabilistic implementation similar to the hyperloglog[1] to
> approximate
> >>> facets counting.
> >>> I didn't have time to take a look in details / implement anything yet.
> >>> But it is on our To Do list :)
> >>> He may add some info here.
> >>>
> >>> Cheers
> >>>
> >>>
> >>>
> >>>
> >>> [1]
> >>> https://blog.yld.io/2017/04/19/hyperloglog-a-
> probabilistic-data-structure/
> >>>
> >>>
> >>>
> >>> -
> >>> ---
> >>> Alessandro Benedetti
> >>> Search Consultant, R Software Engineer, Director
> >>> Sease Ltd. - www.sease.io
> >>> --
> >>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >>>
> >
>


Re: Facets based on sampling

2017-10-23 Thread John Davis
Docvalues don't work for multivalued fields. I just started a separate
thread with more debug info. It is a bit surprising why facet computation
is so slow even when the query matches hundreds of docs.

On Mon, Oct 23, 2017 at 6:53 AM, alessandro.benedetti 
wrote:

> Hi John,
> first of all, I may state the obvious, but have you tried docValues ?
>
> Apart from that a friend of mine ( Diego Ceccarelli) was discussing a
> probabilistic implementation similar to the hyperloglog[1] to approximate
> facets counting.
> I didn't have time to take a look in details / implement anything yet.
> But it is on our To Do list :)
> He may add some info here.
>
> Cheers
>
>
>
>
> [1]
> https://blog.yld.io/2017/04/19/hyperloglog-a-probabilistic-data-structure/
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Really slow facet performance in 6.6

2017-10-23 Thread John Davis
Hello,

We are seeing really slow facet performance with new solr release. This is
on an index of 2M documents. A few things we've tried:

1. method=uif however that didn't help much (the facet fields have
docValues=false since they are multi-valued). Debug info below.

2. changing query (q=) that selects what documents to compute facets on
didn't help a lot, except repeating the same query was fast presumably due
to exact cache hits.

Sample debug info:

“timing”: {
“prepare”: {
“debug”: {
“time”: 0.0
},
“expand”: {
“time”: 0.0
},
“facet”: {
“time”: 0.0
},
“facet_module”: {
“time”: 0.0
},
“highlight”: {
“time”: 0.0
},
“mlt”: {
“time”: 0.0
},
“query”: {
“time”: 0.0
},
“stats”: {
“time”: 0.0
},
“terms”: {
“time”: 0.0
},
“time”: 0.0
},
“process”: {
“debug”: {
“time”: 87.0
},
“expand”: {
“time”: 0.0
},
“facet”: {
“time”: 9814.0
},
“facet_module”: {
“time”: 0.0
},
“highlight”: {
“time”: 0.0
},
“mlt”: {
“time”: 0.0
},
“query”: {
“time”: 20.0
},
“stats”: {
“time”: 0.0
},
“terms”: {
“time”: 0.0
},
“time”: 9922.0
},
“time”: 9923.0
}
},

"facet-debug": {
"elapse": 8310,
"sub-facet": [
{
"action": "field facet",
"elapse": 8310,
"maxThreads": 2,
"processor": "SimpleFacets",
"sub-facet": [
{},
{
"appliedMethod": "UIF",
"field": "school",
"inputDocSetSize": 476,
"requestedMethod": "UIF"
},
{
"appliedMethod": "UIF",
"elapse": 2575,
"field": "work",
"inputDocSetSize": 476,
"requestedMethod": "UIF"
},
{
"appliedMethod": "UIF",
"elapse": 8310,
"field": "level",
"inputDocSetSize": 476,
"requestedMethod": "UIF"
}
]
}

Thanks
John


Re: Facets based on sampling

2017-10-20 Thread John Davis
Hi Yonik,
Any update on sampling based facets. The current faceting is really slow
for fields with high cardinality even with method=uif. Or are there
alternative work-arounds to only look at N docs when computing facets?

On Fri, Nov 4, 2016 at 4:43 PM, Yonik Seeley <ysee...@gmail.com> wrote:

> Sampling has been on my TODO list for the JSON Facet API.
> How much it would help depends on where the bottlenecks are, but that
> in conjunction with a hashing approach to collection (assuming field
> cardinality is high) should definitely help.
>
> -Yonik
>
>
> On Fri, Nov 4, 2016 at 3:02 PM, John Davis <johndavis925...@gmail.com>
> wrote:
> > Hi,
> > I am trying to improve the performance of queries with facets. I
> understand
> > that for queries with high facet cardinality and large number results the
> > current facet computation algorithms can be slow as they are trying to
> loop
> > across all docs and facet values.
> >
> > Does there exist an option to compute facets by just looking at the top-n
> > results instead of all of them or a sample of results based on some query
> > parameters? I couldn't find one and if it does not exist, has this come
> up
> > before? This would definitely not be a precise facet count but using
> > reasonable sampling algorithms we should be able to extrapolate well.
> >
> > Thank you in advance for any advice!
> >
> > John
>


Schemaless detecting multivalued fields

2017-10-19 Thread John Davis
Hi,
I know about the schemaless configuration defaulting to multivalued fields
of the corresponding type.

I was just wondering if there was a way to first detect if the incoming
value is list or singleton, and based on it pick the corresponding types.
Ideally if the value is an long then use tlong while if it is list of longs
then use tlongS.

Thanks!
John


Re: Empty facets on TextField

2017-01-06 Thread John Davis
We've hit this issue again since solr defaults new fields to string type
which has docvalues. Changing those to be lowercase text does not remove
the docvalues and breaks faceting. Is there a way to remove docvalues for a
field w/o starting fresh?

On Tue, Oct 18, 2016 at 8:19 PM, Yonik Seeley <ysee...@gmail.com> wrote:

> Actually, a delete-by-query of *:* may also be hit-or-miss on replicas
> in a solr cloud setup because of reorders.
> If it does work, you should see something in the logs at the INFO
> level like "REMOVING ALL DOCUMENTS FROM INDEX"
>
> -Yonik
>
> On Tue, Oct 18, 2016 at 11:02 PM, Yonik Seeley <ysee...@gmail.com> wrote:
> > A delete-by-query of *:* may do it (because it special cases to
> > removing the index).
> > The underlying issue is when lucene merges a segment without docvalues
> > with a segment that has them.
> > -Yonik
> >
> >
> > On Tue, Oct 18, 2016 at 10:09 PM, John Davis <johndavis925...@gmail.com>
> wrote:
> >> Thanks. Is there a way around to not starting fresh and forcing the
> reindex
> >> to remove docValues?
> >>
> >> On Tue, Oct 18, 2016 at 6:56 PM, Yonik Seeley <ysee...@gmail.com>
> wrote:
> >>>
> >>> This sounds like you didn't actually start fresh, but just reindexed
> your
> >>> data.
> >>> This would mean that docValues would still exist in the index for this
> >>> field (just with no values), and that normal faceting would use those.
> >>> Forcing facet.method=enum forces the use of the index instead of
> >>> docvalues (or the fieldcache if the field is configured w/o
> >>> docvalues).
> >>>
> >>> -Yonik
> >>>
> >>> On Tue, Oct 18, 2016 at 9:43 PM, John Davis <johndavis925...@gmail.com
> >
> >>> wrote:
> >>> > Hi,
> >>> >
> >>> > I have converted one of my fields from StrField to TextField and am
> not
> >>> > getting back any facets for that field. Here's the exact
> configuration
> >>> > of
> >>> > the TextField. I have tested it with 6.2.0 on a fresh instance and it
> >>> > repros consistently. From reading through past archives and
> >>> > documentation,
> >>> > it feels like this should just work. I would appreciate any input.
> >>> >
> >>> >  >>> > omitTermFreqAndPositions="true" indexed="true" stored="true"
> >>> > positionIncrementGap="100" sortMissingLast="true" multiValued="true">
> >>> > 
> >>> >   
> >>> >   
> >>> > 
> >>> >   
> >>> >
> >>> >
> >>> > Search
> >>> > query:
> >>> > /select/?facet.field=FACET_FIELD_NAME=on=on&
> q=QUERY_STRING=json
> >>> >
> >>> > Interestingly facets are returned if I change facet.method to enum
> >>> > instead
> >>> > of default fc.
> >>> >
> >>> > John
> >>
> >>
>


Facets based on sampling

2016-11-04 Thread John Davis
Hi,
I am trying to improve the performance of queries with facets. I understand
that for queries with high facet cardinality and large number results the
current facet computation algorithms can be slow as they are trying to loop
across all docs and facet values.

Does there exist an option to compute facets by just looking at the top-n
results instead of all of them or a sample of results based on some query
parameters? I couldn't find one and if it does not exist, has this come up
before? This would definitely not be a precise facet count but using
reasonable sampling algorithms we should be able to extrapolate well.

Thank you in advance for any advice!

John


Re: Empty facets on TextField

2016-10-18 Thread John Davis
Thanks. Is there a way around to not starting fresh and forcing the reindex
to remove docValues?

On Tue, Oct 18, 2016 at 6:56 PM, Yonik Seeley <ysee...@gmail.com> wrote:

> This sounds like you didn't actually start fresh, but just reindexed your
> data.
> This would mean that docValues would still exist in the index for this
> field (just with no values), and that normal faceting would use those.
> Forcing facet.method=enum forces the use of the index instead of
> docvalues (or the fieldcache if the field is configured w/o
> docvalues).
>
> -Yonik
>
> On Tue, Oct 18, 2016 at 9:43 PM, John Davis <johndavis925...@gmail.com>
> wrote:
> > Hi,
> >
> > I have converted one of my fields from StrField to TextField and am not
> > getting back any facets for that field. Here's the exact configuration of
> > the TextField. I have tested it with 6.2.0 on a fresh instance and it
> > repros consistently. From reading through past archives and
> documentation,
> > it feels like this should just work. I would appreciate any input.
> >
> >  > omitTermFreqAndPositions="true" indexed="true" stored="true"
> > positionIncrementGap="100" sortMissingLast="true" multiValued="true">
> > 
> >   
> >   
> > 
> >   
> >
> >
> > Search
> > query: /select/?facet.field=FACET_FIELD_NAME=on=on&
> q=QUERY_STRING=json
> >
> > Interestingly facets are returned if I change facet.method to enum
> instead
> > of default fc.
> >
> > John
>


Empty facets on TextField

2016-10-18 Thread John Davis
Hi,

I have converted one of my fields from StrField to TextField and am not
getting back any facets for that field. Here's the exact configuration of
the TextField. I have tested it with 6.2.0 on a fresh instance and it
repros consistently. From reading through past archives and documentation,
it feels like this should just work. I would appreciate any input.



  
  

  


Search
query: 
/select/?facet.field=FACET_FIELD_NAME=on=on=QUERY_STRING=json

Interestingly facets are returned if I change facet.method to enum instead
of default fc.

John