Re: A question about solr filter cache

2020-02-17 Thread Hongxu Ma
@Vadim Ivanov

Thank you!

From: Vadim Ivanov 
Sent: Tuesday, February 18, 2020 15:27
To: solr-user@lucene.apache.org 
Subject: RE: A question about solr filter cache

Hi!
Yes, it may depends on Solr version
Solr 8.3 Admin filterCache page stats looks like:

stats:
CACHE.searcher.filterCache.cleanupThread:false
CACHE.searcher.filterCache.cumulative_evictions:0
CACHE.searcher.filterCache.cumulative_hitratio:0.94
CACHE.searcher.filterCache.cumulative_hits:198
CACHE.searcher.filterCache.cumulative_idleEvictions:0
CACHE.searcher.filterCache.cumulative_inserts:12
CACHE.searcher.filterCache.cumulative_lookups:210
CACHE.searcher.filterCache.evictions:0
CACHE.searcher.filterCache.hitratio:1
CACHE.searcher.filterCache.hits:84
CACHE.searcher.filterCache.idleEvictions:0
CACHE.searcher.filterCache.inserts:0
CACHE.searcher.filterCache.lookups:84
CACHE.searcher.filterCache.maxRamMB:-1
CACHE.searcher.filterCache.ramBytesUsed:70768
CACHE.searcher.filterCache.size:12
CACHE.searcher.filterCache.warmupTime:1

> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Tuesday, February 18, 2020 5:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: A question about solr filter cache
>
> @Erick Erickson and @Mikhail Khludnev
>
> got it, the explanation is very clear.
>
> Thank you for your help.
> 
> From: Hongxu Ma 
> Sent: Tuesday, February 18, 2020 10:22
> To: Vadim Ivanov ; solr-
> u...@lucene.apache.org 
> Subject: Re: A question about solr filter cache
>
> Thank you @Vadim Ivanov
> I know that admin page, but I cannot find the memory usage of filter cache
> (only has "CACHE.searcher.filterCache.size", I think it's the used slot
number
> of filtercache)
>
> There is my output (solr version 7.3.1):
>
> filterCache
>
>   *
>
> class:
> org.apache.solr.search.FastLRUCache
>   *
>
> description:
> Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460,
> acceptableSize=486, cleanupThread=false)
>   *   stats:
>  *
>
> CACHE.searcher.filterCache.cumulative_evictions:
> 0
>  *
>
> CACHE.searcher.filterCache.cumulative_hitratio:
> 0.5
>  *
>
> CACHE.searcher.filterCache.cumulative_hits:
> 1
>  *
>
> CACHE.searcher.filterCache.cumulative_inserts:
> 1
>  *
>
> CACHE.searcher.filterCache.cumulative_lookups:
> 2
>  *
>
> CACHE.searcher.filterCache.evictions:
> 0
>  *
>
> CACHE.searcher.filterCache.hitratio:
> 0.5
>  *
>
> CACHE.searcher.filterCache.hits:
> 1
>  *
>
> CACHE.searcher.filterCache.inserts:
> 1
>  *
>
> CACHE.searcher.filterCache.lookups:
> 2
>  *
>
> CACHE.searcher.filterCache.size:
> 1
>  *
>
> CACHE.searcher.filterCache.warmupTime:
> 0
>
>
>
> 
> From: Vadim Ivanov 
> Sent: Monday, February 17, 2020 17:51
> To: solr-user@lucene.apache.org 
> Subject: RE: A question about solr filter cache
>
> You can easily check amount of RAM used by core filterCache in Admin UI:
> Choose core - Plugins/Stats - Cache - filterCache It shows useful
information
> on configuration, statistics and current RAM usage by filter cache, as
well as
> some examples of current filtercaches in RAM Core, for ex, with 10 mln
docs
> uses 1.3 MB of Ram for every filterCache
>
>
> > -Original Message-
> > From: Hongxu Ma [mailto:inte...@outlook.com]
> > Sent: Monday, February 17, 2020 12:13 PM
> > To: solr-user@lucene.apache.org
> > Subject: A question about solr filter cache
> >
> > Hi
> > I want to know the internal of solr filter cache, especially its
> > memory
> usage.
> >
> > I googled some pages:
> > https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> > https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.h
> > tml
> > (Erick Erickson's answer)
> >
> > All of them said its structure is: fq => a bitmap (total doc number
> > bits),
> but I
> > think it's not so simple, reason:
> > Given total doc number is 1 billion, each filter cache entry will use
> nearly
> > 1GB(10/8 bit), it's too big and very easy to make solr OOM (I
> > have
> a
> > 1 billion doc cluster, looks it works well)
> >
> > And I also checked solr node, but cannot find the details (only saw
> > using DocSets structure)
> >
> > So far, I guess:
> >
> >   *   degenerate into an doc id array/list when the bitmap is sparse
> >   *   using some compressed bitmap, e.g. roaring bitmaps
> >
> > which one is correct? or another answer, thanks you very much!
>




RE: A question about solr filter cache

2020-02-17 Thread Vadim Ivanov
Hi!
Yes, it may depends on Solr version
Solr 8.3 Admin filterCache page stats looks like:

stats:
CACHE.searcher.filterCache.cleanupThread:false
CACHE.searcher.filterCache.cumulative_evictions:0
CACHE.searcher.filterCache.cumulative_hitratio:0.94
CACHE.searcher.filterCache.cumulative_hits:198
CACHE.searcher.filterCache.cumulative_idleEvictions:0
CACHE.searcher.filterCache.cumulative_inserts:12
CACHE.searcher.filterCache.cumulative_lookups:210
CACHE.searcher.filterCache.evictions:0
CACHE.searcher.filterCache.hitratio:1
CACHE.searcher.filterCache.hits:84
CACHE.searcher.filterCache.idleEvictions:0
CACHE.searcher.filterCache.inserts:0
CACHE.searcher.filterCache.lookups:84
CACHE.searcher.filterCache.maxRamMB:-1
CACHE.searcher.filterCache.ramBytesUsed:70768
CACHE.searcher.filterCache.size:12
CACHE.searcher.filterCache.warmupTime:1

> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Tuesday, February 18, 2020 5:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: A question about solr filter cache
> 
> @Erick Erickson and @Mikhail Khludnev
> 
> got it, the explanation is very clear.
> 
> Thank you for your help.
> 
> From: Hongxu Ma 
> Sent: Tuesday, February 18, 2020 10:22
> To: Vadim Ivanov ; solr-
> u...@lucene.apache.org 
> Subject: Re: A question about solr filter cache
> 
> Thank you @Vadim Ivanov
> I know that admin page, but I cannot find the memory usage of filter cache
> (only has "CACHE.searcher.filterCache.size", I think it's the used slot
number
> of filtercache)
> 
> There is my output (solr version 7.3.1):
> 
> filterCache
> 
>   *
> 
> class:
> org.apache.solr.search.FastLRUCache
>   *
> 
> description:
> Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460,
> acceptableSize=486, cleanupThread=false)
>   *   stats:
>  *
> 
> CACHE.searcher.filterCache.cumulative_evictions:
> 0
>  *
> 
> CACHE.searcher.filterCache.cumulative_hitratio:
> 0.5
>  *
> 
> CACHE.searcher.filterCache.cumulative_hits:
> 1
>  *
> 
> CACHE.searcher.filterCache.cumulative_inserts:
> 1
>  *
> 
> CACHE.searcher.filterCache.cumulative_lookups:
> 2
>  *
> 
> CACHE.searcher.filterCache.evictions:
> 0
>  *
> 
> CACHE.searcher.filterCache.hitratio:
> 0.5
>  *
> 
> CACHE.searcher.filterCache.hits:
> 1
>  *
> 
> CACHE.searcher.filterCache.inserts:
> 1
>  *
> 
> CACHE.searcher.filterCache.lookups:
> 2
>  *
> 
> CACHE.searcher.filterCache.size:
> 1
>  *
> 
> CACHE.searcher.filterCache.warmupTime:
> 0
> 
> 
> 
> 
> From: Vadim Ivanov 
> Sent: Monday, February 17, 2020 17:51
> To: solr-user@lucene.apache.org 
> Subject: RE: A question about solr filter cache
> 
> You can easily check amount of RAM used by core filterCache in Admin UI:
> Choose core - Plugins/Stats - Cache - filterCache It shows useful
information
> on configuration, statistics and current RAM usage by filter cache, as
well as
> some examples of current filtercaches in RAM Core, for ex, with 10 mln
docs
> uses 1.3 MB of Ram for every filterCache
> 
> 
> > -Original Message-
> > From: Hongxu Ma [mailto:inte...@outlook.com]
> > Sent: Monday, February 17, 2020 12:13 PM
> > To: solr-user@lucene.apache.org
> > Subject: A question about solr filter cache
> >
> > Hi
> > I want to know the internal of solr filter cache, especially its
> > memory
> usage.
> >
> > I googled some pages:
> > https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> > https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.h
> > tml
> > (Erick Erickson's answer)
> >
> > All of them said its structure is: fq => a bitmap (total doc number
> > bits),
> but I
> > think it's not so simple, reason:
> > Given total doc number is 1 billion, each filter cache entry will use
> nearly
> > 1GB(10/8 bit), it's too big and very easy to make solr OOM (I
> > have
> a
> > 1 billion doc cluster, looks it works well)
> >
> > And I also checked solr node, but cannot find the details (only saw
> > using DocSets structure)
> >
> > So far, I guess:
> >
> >   *   degenerate into an doc id array/list when the bitmap is sparse
> >   *   using some compressed bitmap, e.g. roaring bitmaps
> >
> > which one is correct? or another answer, thanks you very much!
> 




Re: A question about solr filter cache

2020-02-17 Thread Hongxu Ma
@Erick Erickson and @Mikhail Khludnev

got it, the explanation is very clear.

Thank you for your help.

From: Hongxu Ma 
Sent: Tuesday, February 18, 2020 10:22
To: Vadim Ivanov ; 
solr-user@lucene.apache.org 
Subject: Re: A question about solr filter cache

Thank you @Vadim Ivanov
I know that admin page, but I cannot find the memory usage of filter cache 
(only has "CACHE.searcher.filterCache.size", I think it's the used slot number 
of filtercache)

There is my output (solr version 7.3.1):

filterCache

  *

class:
org.apache.solr.search.FastLRUCache
  *

description:
Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460, 
acceptableSize=486, cleanupThread=false)
  *   stats:
 *

CACHE.searcher.filterCache.cumulative_evictions:
0
 *

CACHE.searcher.filterCache.cumulative_hitratio:
0.5
 *

CACHE.searcher.filterCache.cumulative_hits:
1
 *

CACHE.searcher.filterCache.cumulative_inserts:
1
 *

CACHE.searcher.filterCache.cumulative_lookups:
2
 *

CACHE.searcher.filterCache.evictions:
0
 *

CACHE.searcher.filterCache.hitratio:
0.5
 *

CACHE.searcher.filterCache.hits:
1
 *

CACHE.searcher.filterCache.inserts:
1
 *

CACHE.searcher.filterCache.lookups:
2
 *

CACHE.searcher.filterCache.size:
1
 *

CACHE.searcher.filterCache.warmupTime:
0




From: Vadim Ivanov 
Sent: Monday, February 17, 2020 17:51
To: solr-user@lucene.apache.org 
Subject: RE: A question about solr filter cache

You can easily check amount of RAM used by core filterCache in Admin UI:
Choose core - Plugins/Stats - Cache - filterCache
It shows useful information on configuration, statistics and current RAM
usage by filter cache,
as well as some examples of current filtercaches in RAM
Core, for ex, with 10 mln docs uses 1.3 MB of Ram for every filterCache


> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Monday, February 17, 2020 12:13 PM
> To: solr-user@lucene.apache.org
> Subject: A question about solr filter cache
>
> Hi
> I want to know the internal of solr filter cache, especially its memory
usage.
>
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
> (Erick Erickson's answer)
>
> All of them said its structure is: fq => a bitmap (total doc number bits),
but I
> think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use
nearly
> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have
a
> 1 billion doc cluster, looks it works well)
>
> And I also checked solr node, but cannot find the details (only saw using
> DocSets structure)
>
> So far, I guess:
>
>   *   degenerate into an doc id array/list when the bitmap is sparse
>   *   using some compressed bitmap, e.g. roaring bitmaps
>
> which one is correct? or another answer, thanks you very much!




Re: A question about solr filter cache

2020-02-17 Thread Hongxu Ma
Thank you @Vadim Ivanov
I know that admin page, but I cannot find the memory usage of filter cache 
(only has "CACHE.searcher.filterCache.size", I think it's the used slot number 
of filtercache)

There is my output (solr version 7.3.1):

filterCache

  *

class:
org.apache.solr.search.FastLRUCache
  *

description:
Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460, 
acceptableSize=486, cleanupThread=false)
  *   stats:
 *

CACHE.searcher.filterCache.cumulative_evictions:
0
 *

CACHE.searcher.filterCache.cumulative_hitratio:
0.5
 *

CACHE.searcher.filterCache.cumulative_hits:
1
 *

CACHE.searcher.filterCache.cumulative_inserts:
1
 *

CACHE.searcher.filterCache.cumulative_lookups:
2
 *

CACHE.searcher.filterCache.evictions:
0
 *

CACHE.searcher.filterCache.hitratio:
0.5
 *

CACHE.searcher.filterCache.hits:
1
 *

CACHE.searcher.filterCache.inserts:
1
 *

CACHE.searcher.filterCache.lookups:
2
 *

CACHE.searcher.filterCache.size:
1
 *

CACHE.searcher.filterCache.warmupTime:
0




From: Vadim Ivanov 
Sent: Monday, February 17, 2020 17:51
To: solr-user@lucene.apache.org 
Subject: RE: A question about solr filter cache

You can easily check amount of RAM used by core filterCache in Admin UI:
Choose core - Plugins/Stats - Cache - filterCache
It shows useful information on configuration, statistics and current RAM
usage by filter cache,
as well as some examples of current filtercaches in RAM
Core, for ex, with 10 mln docs uses 1.3 MB of Ram for every filterCache


> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Monday, February 17, 2020 12:13 PM
> To: solr-user@lucene.apache.org
> Subject: A question about solr filter cache
>
> Hi
> I want to know the internal of solr filter cache, especially its memory
usage.
>
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
> (Erick Erickson's answer)
>
> All of them said its structure is: fq => a bitmap (total doc number bits),
but I
> think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use
nearly
> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have
a
> 1 billion doc cluster, looks it works well)
>
> And I also checked solr node, but cannot find the details (only saw using
> DocSets structure)
>
> So far, I guess:
>
>   *   degenerate into an doc id array/list when the bitmap is sparse
>   *   using some compressed bitmap, e.g. roaring bitmaps
>
> which one is correct? or another answer, thanks you very much!




Best Practises around relevance tuning per query

2020-02-17 Thread Ashwin Ramesh
Hi,

We are in the process of applying a scoring model to our search results. In
particular, we would like to add scores for documents per query and user
context.

For example, we want to have a score from 500 to 1 for the top 500
documents for the query “dog” for users who speak US English.

We believe it becomes infeasible to store these scores in Solr because we
want to update the scores regularly, and the number of scores increases
rapidly with increased user attributes.

One solution we explored was to store these scores in a secondary data
store, and use this at Solr query time with a boost function such as:

`bf=mul(termfreq(id,’ID-1'),500) mul(termfreq(id,'ID-2'),499) …
mul(termfreq(id,'ID-500'),1)`

We have over a hundred thousand documents in one Solr collection, and about
fifty million in another Solr collection. We have some queries for which
roughly 80% of the results match, although this is an edge case. We wanted
to know the worst case performance, so we tested with such a query. For
both of these collections we found the a message similar to the following
in the Solr cloud logs (tested on a laptop):

Elapsed time: 5020. Exceeded allowed search time: 5000 ms.

We then tried using the following boost, which seemed simpler:

`boost=if(query($qq), 10, 1)&qq=id:(ID-1 OR ID-2 OR … OR ID-500)`

We then saw the following in the Solr cloud logs:

`The request took too long to iterate over terms.`

All responses above took over 5000 milliseconds to return.

We are considering Solr’s re-ranker, but I don’t know how we would use this
without pushing all the query-context-document scores to Solr.


The alternative solution that we are currently considering involves
invoking multiple solr queries.

This means we would make a request to solr to fetch the top N results (id,
score) for the query. E.g. q=dog, fq=featureA:foo, fq=featureB=bar, limit=N.

Another request would be made using a filter query with a set of doc ids
that we know are high value for the user’s query. E.g. q=*:*,
fq=featureA:foo, fq=featureB:bar, fq=id:(d1, d2, d3), limit=N.

We would then do a reranking phase in our service layer.

Do you have any suggestions for known patterns of how we can store and
retrieve scores per user context and query?

Regards,
Ash & Spirit.

-- 
**
** Empowering the world to design
Also, we're 
hiring. Apply here! 
 
  
   
    













Re: Metadata info on Stored Fields

2020-02-17 Thread Edward Ribeiro
Sorry, my fault,

I bypassed this excerpt of yours: " do I get the file name included in each
snippet fragment - this again needs exploring on my end". No, the solution
I proposed doesn't address that. :(

Edward

Em seg, 17 de fev de 2020 14:03, Srijan  escreveu:

> You know what, I think I missed a major description in my earlier email. I
> want to be able to return additional data from stored fields alongside the
> snippets during highlighting. In this case, the filename where this snippet
> came from. Not sure your approach would address that.
>
> On Mon, Feb 17, 2020, 10:44 Edward Ribeiro 
> wrote:
>
> > Hi,
> >
> > You may try to create two kinds of docs forming a parent-child
> relationship
> > without nesting. Like
> >
> > 
> > 894
> > parent
> >
> > ...
> > 
> >
> > 
> > 3213
> > child
> > 894
> > xxx
> >  portion of file 1
> >  remaining portion of file 1
> > ...
> > 
> >
> > Then you can add metadata for each child doc. The search can be done on
> > child docs but if you need to group you can use the join query parser (it
> > has some limitations though) or grouping by parent_id.
> >
> > Cheers,
> > Edward
> >
> >
> > Em seg, 17 de fev de 2020 12:25, Srijan  escreveu:
> >
> > > Hi,
> > >
> > > I have a data model where the operational "Object" can have one or more
> > > files attached. Indexing these objects in Solr means indexing all
> > metadata
> > > info and the contents of the files. For file contents what I have right
> > now
> > > is a single multi-valued field (for each locale)
> > >
> > > Example:
> > > 
> > > xxx
> > > yyy
> > >  portion of file 1
> > >  remaining portion of file 1
> > >  portion of file 2
> > >  contents from file 2 again...
> > > ...
> > > 
> > >
> > > Search is easy and everything's been working fine. We recently
> introduced
> > > highlighting functionality on these file content fields. Again,
> straight
> > > forward use-case. Next requirement is where things get a little tricky.
> > We
> > > want to be able to return the name of the file ( generalizing this - or
> > > some other metadata info related to the file content field). If our
> data
> > > model had a 1:1 relation between our operational object and the file it
> > > contains, the file name would have been just another field on the main
> > doc
> > > but unfortunately that's not the case - each file content field could
> > > belong to any file.
> > >
> > > There are a couple of potential solutions I have been thinking of:
> > > 1. Use nested docs to preserve the logical grouping of file content and
> > the
> > > file info where this content is coming from. This could potentially
> work
> > > but I haven't done any testing yet (I know highlighting doesn't work on
> > > nested docs for example)
> > >
> > > 2. Encode the file name in the file content fields themselves. The file
> > > name will be removed during indexing but will be stored. How do I get
> the
> > > file name included in each snippet fragment - this again needs
> exploring
> > on
> > > my end
> > >
> > > Another approach I have been thinking is extending the StoredField to
> > also
> > > store additional meta data information. So basically when a stored
> field
> > is
> > > retrieved, or a fragment is returned, I also have additional
> information
> > > associated with the stored field. Can someone tell me this is a
> terrible
> > > idea and I should not be pursuing.
> > >
> > > Is there something else I can try?
> > >
> > > Thanks a lot,
> > > Srijan
> > >
> >
>


Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood
Make phrases into single tokens at indexing and query time. Let the engine do
the rest of the work.

For example, “subunits of the army” can become “subunitsofthearmy” or 
“subunits_of_the_army”.
We used patterns to choose phrases, so “word word”, “word glue word”, or “word 
glue glue word”
could become phrases.

Nutch did something like this, but used it for filtering down the candidates 
for matching,
then used regular Lucene scoring for ranking.

The Infoseek Ultra index used these phrase terms but did not store positions.

The idea came from early DNA search engines.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 10:53 AM, David Hastings  
> wrote:
> 
> interesting, i cant seem to find anything on Phrase IDF, dont suppose you
> have a link or two i could look at by chance?
> 
> On Mon, Feb 17, 2020 at 1:48 PM Walter Underwood 
> wrote:
> 
>> At Infoseek, we used “glue words” to build phrase tokens. It was really
>> effective.
>> Phrase IDF is powerful stuff.
>> 
>> Luckily for you, the patent on that has expired. :-)
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 17, 2020, at 10:46 AM, David Hastings <
>> hastings.recurs...@gmail.com> wrote:
>>> 
>>> i use stop words for building shingles into "interesting phrases" for my
>>> machine teacher/students, so i wouldnt say theres no reason, however my
>> use
>>> case is very specific.  Otherwise yeah, theyre gone for all practical
>>> reasons/search scenarios.
>>> 
>>> On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
>>> wrote:
>>> 
 Why are you using stopwords? I would need a really, really good reason
>> to
 use those.
 
 Stopwords are an obsolete technique from 16-bit processors. I’ve never
 used them and
 I’ve been a search engineer since 1997.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
 wrote:
> 
> Hi
> 
> I've run into an issue with creating a Managed Stopwords list that has
 the
> same name as a previously deleted list. Going through the same flow
>> with
> Managed Synonyms doesn't result in this unexpected behaviour. Am I
 missing
> something or did I discover a bug in Solr?
> 
> On a newly started solr with the techproducts core:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl
 http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> 
> The second PUT request results in a status 500 with error
> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> 
> Similar requests for synonyms work fine, no matter how many times I
 repeat
> the CREATE/DELETE/RELOAD cycle:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl
 http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> 
> Reloading after creating the Stopwords list but not after deleting it
 works
> without error too on a fresh techproducts core (you'll have to remove
>> the
> directory from disk and create the core again after running the
>> previous
> commands).
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl
 http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> curl -X DELETE
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
>> '{"class":"org.apache.solr.rest.schema

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread David Hastings
interesting, i cant seem to find anything on Phrase IDF, dont suppose you
have a link or two i could look at by chance?

On Mon, Feb 17, 2020 at 1:48 PM Walter Underwood 
wrote:

> At Infoseek, we used “glue words” to build phrase tokens. It was really
> effective.
> Phrase IDF is powerful stuff.
>
> Luckily for you, the patent on that has expired. :-)
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 17, 2020, at 10:46 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > i use stop words for building shingles into "interesting phrases" for my
> > machine teacher/students, so i wouldnt say theres no reason, however my
> use
> > case is very specific.  Otherwise yeah, theyre gone for all practical
> > reasons/search scenarios.
> >
> > On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
> > wrote:
> >
> >> Why are you using stopwords? I would need a really, really good reason
> to
> >> use those.
> >>
> >> Stopwords are an obsolete technique from 16-bit processors. I’ve never
> >> used them and
> >> I’ve been a search engineer since 1997.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
> >> wrote:
> >>>
> >>> Hi
> >>>
> >>> I've run into an issue with creating a Managed Stopwords list that has
> >> the
> >>> same name as a previously deleted list. Going through the same flow
> with
> >>> Managed Synonyms doesn't result in this unexpected behaviour. Am I
> >> missing
> >>> something or did I discover a bug in Solr?
> >>>
> >>> On a newly started solr with the techproducts core:
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl -X DELETE
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>>
> >>> The second PUT request results in a status 500 with error
> >>> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> >>>
> >>> Similar requests for synonyms work fine, no matter how many times I
> >> repeat
> >>> the CREATE/DELETE/RELOAD cycle:
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> >>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>> curl -X DELETE
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> >>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>>
> >>> Reloading after creating the Stopwords list but not after deleting it
> >> works
> >>> without error too on a fresh techproducts core (you'll have to remove
> the
> >>> directory from disk and create the core again after running the
> previous
> >>> commands).
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> >>> curl -X DELETE
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>>
> >>> And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
> >>> CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the
> cycle
> >>> can be completed twice. (Again, on a freshly created techproducts
> core.)
> >>> Only the third attempt to create a list results in an error. Synonyms
> can
> >>> still be created and deleted repeatedly after this.
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl -X DELETE
> >>>
> >>
> http://localhost:8983/solr/

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood
At Infoseek, we used “glue words” to build phrase tokens. It was really 
effective.
Phrase IDF is powerful stuff.

Luckily for you, the patent on that has expired. :-)

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 10:46 AM, David Hastings  
> wrote:
> 
> i use stop words for building shingles into "interesting phrases" for my
> machine teacher/students, so i wouldnt say theres no reason, however my use
> case is very specific.  Otherwise yeah, theyre gone for all practical
> reasons/search scenarios.
> 
> On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
> wrote:
> 
>> Why are you using stopwords? I would need a really, really good reason to
>> use those.
>> 
>> Stopwords are an obsolete technique from 16-bit processors. I’ve never
>> used them and
>> I’ve been a search engineer since 1997.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
>> wrote:
>>> 
>>> Hi
>>> 
>>> I've run into an issue with creating a Managed Stopwords list that has
>> the
>>> same name as a previously deleted list. Going through the same flow with
>>> Managed Synonyms doesn't result in this unexpected behaviour. Am I
>> missing
>>> something or did I discover a bug in Solr?
>>> 
>>> On a newly started solr with the techproducts core:
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X DELETE
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> 
>>> The second PUT request results in a status 500 with error
>>> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
>>> 
>>> Similar requests for synonyms work fine, no matter how many times I
>> repeat
>>> the CREATE/DELETE/RELOAD cycle:
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl -X DELETE
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> 
>>> Reloading after creating the Stopwords list but not after deleting it
>> works
>>> without error too on a fresh techproducts core (you'll have to remove the
>>> directory from disk and create the core again after running the previous
>>> commands).
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
>>> curl -X DELETE
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> 
>>> And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
>>> CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
>>> can be completed twice. (Again, on a freshly created techproducts core.)
>>> Only the third attempt to create a list results in an error. Synonyms can
>>> still be created and deleted repeatedly after this.
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X DELETE
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl -X DELETE
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread David Hastings
i use stop words for building shingles into "interesting phrases" for my
machine teacher/students, so i wouldnt say theres no reason, however my use
case is very specific.  Otherwise yeah, theyre gone for all practical
reasons/search scenarios.

On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
wrote:

> Why are you using stopwords? I would need a really, really good reason to
> use those.
>
> Stopwords are an obsolete technique from 16-bit processors. I’ve never
> used them and
> I’ve been a search engineer since 1997.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
> wrote:
> >
> > Hi
> >
> > I've run into an issue with creating a Managed Stopwords list that has
> the
> > same name as a previously deleted list. Going through the same flow with
> > Managed Synonyms doesn't result in this unexpected behaviour. Am I
> missing
> > something or did I discover a bug in Solr?
> >
> > On a newly started solr with the techproducts core:
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >
> > The second PUT request results in a status 500 with error
> > msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> >
> > Similar requests for synonyms work fine, no matter how many times I
> repeat
> > the CREATE/DELETE/RELOAD cycle:
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl -X DELETE
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >
> > Reloading after creating the Stopwords list but not after deleting it
> works
> > without error too on a fresh techproducts core (you'll have to remove the
> > directory from disk and create the core again after running the previous
> > commands).
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >
> > And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
> > CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
> > can be completed twice. (Again, on a freshly created techproducts core.)
> > Only the third attempt to create a list results in an error. Synonyms can
> > still be created and deleted repeatedly after this.
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl -X DELETE
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json'

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood
Why are you using stopwords? I would need a really, really good reason to use 
those.

Stopwords are an obsolete technique from 16-bit processors. I’ve never used 
them and
I’ve been a search engineer since 1997.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 7:31 AM, Thomas Corthals  wrote:
> 
> Hi
> 
> I've run into an issue with creating a Managed Stopwords list that has the
> same name as a previously deleted list. Going through the same flow with
> Managed Synonyms doesn't result in this unexpected behaviour. Am I missing
> something or did I discover a bug in Solr?
> 
> On a newly started solr with the techproducts core:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> 
> The second PUT request results in a status 500 with error
> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> 
> Similar requests for synonyms work fine, no matter how many times I repeat
> the CREATE/DELETE/RELOAD cycle:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> 
> Reloading after creating the Stopwords list but not after deleting it works
> without error too on a fresh techproducts core (you'll have to remove the
> directory from disk and create the core again after running the previous
> commands).
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> 
> And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
> CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
> can be completed twice. (Again, on a freshly created techproducts core.)
> Only the third attempt to create a list results in an error. Synonyms can
> still be created and deleted repeatedly after this.
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> htt

Re: Metadata info on Stored Fields

2020-02-17 Thread Srijan
You know what, I think I missed a major description in my earlier email. I
want to be able to return additional data from stored fields alongside the
snippets during highlighting. In this case, the filename where this snippet
came from. Not sure your approach would address that.

On Mon, Feb 17, 2020, 10:44 Edward Ribeiro  wrote:

> Hi,
>
> You may try to create two kinds of docs forming a parent-child relationship
> without nesting. Like
>
> 
> 894
> parent
>
> ...
> 
>
> 
> 3213
> child
> 894
> xxx
>  portion of file 1
>  remaining portion of file 1
> ...
> 
>
> Then you can add metadata for each child doc. The search can be done on
> child docs but if you need to group you can use the join query parser (it
> has some limitations though) or grouping by parent_id.
>
> Cheers,
> Edward
>
>
> Em seg, 17 de fev de 2020 12:25, Srijan  escreveu:
>
> > Hi,
> >
> > I have a data model where the operational "Object" can have one or more
> > files attached. Indexing these objects in Solr means indexing all
> metadata
> > info and the contents of the files. For file contents what I have right
> now
> > is a single multi-valued field (for each locale)
> >
> > Example:
> > 
> > xxx
> > yyy
> >  portion of file 1
> >  remaining portion of file 1
> >  portion of file 2
> >  contents from file 2 again...
> > ...
> > 
> >
> > Search is easy and everything's been working fine. We recently introduced
> > highlighting functionality on these file content fields. Again, straight
> > forward use-case. Next requirement is where things get a little tricky.
> We
> > want to be able to return the name of the file ( generalizing this - or
> > some other metadata info related to the file content field). If our data
> > model had a 1:1 relation between our operational object and the file it
> > contains, the file name would have been just another field on the main
> doc
> > but unfortunately that's not the case - each file content field could
> > belong to any file.
> >
> > There are a couple of potential solutions I have been thinking of:
> > 1. Use nested docs to preserve the logical grouping of file content and
> the
> > file info where this content is coming from. This could potentially work
> > but I haven't done any testing yet (I know highlighting doesn't work on
> > nested docs for example)
> >
> > 2. Encode the file name in the file content fields themselves. The file
> > name will be removed during indexing but will be stored. How do I get the
> > file name included in each snippet fragment - this again needs exploring
> on
> > my end
> >
> > Another approach I have been thinking is extending the StoredField to
> also
> > store additional meta data information. So basically when a stored field
> is
> > retrieved, or a fragment is returned, I also have additional information
> > associated with the stored field. Can someone tell me this is a terrible
> > idea and I should not be pursuing.
> >
> > Is there something else I can try?
> >
> > Thanks a lot,
> > Srijan
> >
>


Re: Metadata info on Stored Fields

2020-02-17 Thread Edward Ribeiro
Hi,

You may try to create two kinds of docs forming a parent-child relationship
without nesting. Like


894
parent

...



3213
child
894
xxx
 portion of file 1
 remaining portion of file 1
...


Then you can add metadata for each child doc. The search can be done on
child docs but if you need to group you can use the join query parser (it
has some limitations though) or grouping by parent_id.

Cheers,
Edward


Em seg, 17 de fev de 2020 12:25, Srijan  escreveu:

> Hi,
>
> I have a data model where the operational "Object" can have one or more
> files attached. Indexing these objects in Solr means indexing all metadata
> info and the contents of the files. For file contents what I have right now
> is a single multi-valued field (for each locale)
>
> Example:
> 
> xxx
> yyy
>  portion of file 1
>  remaining portion of file 1
>  portion of file 2
>  contents from file 2 again...
> ...
> 
>
> Search is easy and everything's been working fine. We recently introduced
> highlighting functionality on these file content fields. Again, straight
> forward use-case. Next requirement is where things get a little tricky. We
> want to be able to return the name of the file ( generalizing this - or
> some other metadata info related to the file content field). If our data
> model had a 1:1 relation between our operational object and the file it
> contains, the file name would have been just another field on the main doc
> but unfortunately that's not the case - each file content field could
> belong to any file.
>
> There are a couple of potential solutions I have been thinking of:
> 1. Use nested docs to preserve the logical grouping of file content and the
> file info where this content is coming from. This could potentially work
> but I haven't done any testing yet (I know highlighting doesn't work on
> nested docs for example)
>
> 2. Encode the file name in the file content fields themselves. The file
> name will be removed during indexing but will be stored. How do I get the
> file name included in each snippet fragment - this again needs exploring on
> my end
>
> Another approach I have been thinking is extending the StoredField to also
> store additional meta data information. So basically when a stored field is
> retrieved, or a fragment is returned, I also have additional information
> associated with the stored field. Can someone tell me this is a terrible
> idea and I should not be pursuing.
>
> Is there something else I can try?
>
> Thanks a lot,
> Srijan
>


Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Thomas Corthals
Hi

I've run into an issue with creating a Managed Stopwords list that has the
same name as a previously deleted list. Going through the same flow with
Managed Synonyms doesn't result in this unexpected behaviour. Am I missing
something or did I discover a bug in Solr?

On a newly started solr with the techproducts core:

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist

The second PUT request results in a status 500 with error
msg "java.util.LinkedHashMap cannot be cast to java.util.List".

Similar requests for synonyms work fine, no matter how many times I repeat
the CREATE/DELETE/RELOAD cycle:

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap

Reloading after creating the Stopwords list but not after deleting it works
without error too on a fresh techproducts core (you'll have to remove the
directory from disk and create the core again after running the previous
commands).

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist

And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
can be completed twice. (Again, on a freshly created techproducts core.)
Only the third attempt to create a list results in an error. Synonyms can
still be created and deleted repeatedly after this.

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl http://localhost:8983/solr/admin/cores?action=RELOAD\&core=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist

The same successes/errors occur when running each cycle against a different
core if the cores share the same configset.

Any ideas on what might be going wrong?


Metadata info on Stored Fields

2020-02-17 Thread Srijan
Hi,

I have a data model where the operational "Object" can have one or more
files attached. Indexing these objects in Solr means indexing all metadata
info and the contents of the files. For file contents what I have right now
is a single multi-valued field (for each locale)

Example:

xxx
yyy
 portion of file 1
 remaining portion of file 1
 portion of file 2
 contents from file 2 again...
...


Search is easy and everything's been working fine. We recently introduced
highlighting functionality on these file content fields. Again, straight
forward use-case. Next requirement is where things get a little tricky. We
want to be able to return the name of the file ( generalizing this - or
some other metadata info related to the file content field). If our data
model had a 1:1 relation between our operational object and the file it
contains, the file name would have been just another field on the main doc
but unfortunately that's not the case - each file content field could
belong to any file.

There are a couple of potential solutions I have been thinking of:
1. Use nested docs to preserve the logical grouping of file content and the
file info where this content is coming from. This could potentially work
but I haven't done any testing yet (I know highlighting doesn't work on
nested docs for example)

2. Encode the file name in the file content fields themselves. The file
name will be removed during indexing but will be stored. How do I get the
file name included in each snippet fragment - this again needs exploring on
my end

Another approach I have been thinking is extending the StoredField to also
store additional meta data information. So basically when a stored field is
retrieved, or a fragment is returned, I also have additional information
associated with the stored field. Can someone tell me this is a terrible
idea and I should not be pursuing.

Is there something else I can try?

Thanks a lot,
Srijan


Re: A question about solr filter cache

2020-02-17 Thread Erick Erickson
That’s the upper limit of a filter cache entry (maxDoc/8). For low numbers of 
hits,
more space-efficient structures are used. Specifically a list of doc IDs is 
kept. So say
you have an fq clause that marks 10 doc. The filterCache entry is closer to 40 
bytes
+ sizeof(query object) etc.

Still, it’s what you have to be prepared for.

filterCache is local to the core. So if you have 8 replicas they’d each have 
128M docs
or so and each filterCache entry would be bounded by about 128M/8

Checking the filterCache via the admin UI is a way to find current usage, but 
be sure it’s
full. The memory is allocated as needed, not up front.

All that said, you’re certainly right, the filterCache can certainly lead to 
OOMs.
What I try to emphasize to people is that they cannot allocate huge filterCaches
without considering memory implications...

Best,
Erick

> On Feb 17, 2020, at 4:51 AM, Vadim Ivanov  
> wrote:
> 
> You can easily check amount of RAM used by core filterCache in Admin UI:
> Choose core - Plugins/Stats - Cache - filterCache
> It shows useful information on configuration, statistics and current RAM
> usage by filter cache,
> as well as some examples of current filtercaches in RAM
> Core, for ex, with 10 mln docs uses 1.3 MB of Ram for every filterCache
> 
> 
>> -Original Message-
>> From: Hongxu Ma [mailto:inte...@outlook.com]
>> Sent: Monday, February 17, 2020 12:13 PM
>> To: solr-user@lucene.apache.org
>> Subject: A question about solr filter cache
>> 
>> Hi
>> I want to know the internal of solr filter cache, especially its memory
> usage.
>> 
>> I googled some pages:
>> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
>> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
>> (Erick Erickson's answer)
>> 
>> All of them said its structure is: fq => a bitmap (total doc number bits),
> but I
>> think it's not so simple, reason:
>> Given total doc number is 1 billion, each filter cache entry will use
> nearly
>> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have
> a
>> 1 billion doc cluster, looks it works well)
>> 
>> And I also checked solr node, but cannot find the details (only saw using
>> DocSets structure)
>> 
>> So far, I guess:
>> 
>>  *   degenerate into an doc id array/list when the bitmap is sparse
>>  *   using some compressed bitmap, e.g. roaring bitmaps
>> 
>> which one is correct? or another answer, thanks you very much!
> 
> 



RE: A question about solr filter cache

2020-02-17 Thread Vadim Ivanov
You can easily check amount of RAM used by core filterCache in Admin UI:
Choose core - Plugins/Stats - Cache - filterCache
It shows useful information on configuration, statistics and current RAM
usage by filter cache,
as well as some examples of current filtercaches in RAM
Core, for ex, with 10 mln docs uses 1.3 MB of Ram for every filterCache


> -Original Message-
> From: Hongxu Ma [mailto:inte...@outlook.com]
> Sent: Monday, February 17, 2020 12:13 PM
> To: solr-user@lucene.apache.org
> Subject: A question about solr filter cache
> 
> Hi
> I want to know the internal of solr filter cache, especially its memory
usage.
> 
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
> (Erick Erickson's answer)
> 
> All of them said its structure is: fq => a bitmap (total doc number bits),
but I
> think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use
nearly
> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have
a
> 1 billion doc cluster, looks it works well)
> 
> And I also checked solr node, but cannot find the details (only saw using
> DocSets structure)
> 
> So far, I guess:
> 
>   *   degenerate into an doc id array/list when the bitmap is sparse
>   *   using some compressed bitmap, e.g. roaring bitmaps
> 
> which one is correct? or another answer, thanks you very much!




Re: A question about solr filter cache

2020-02-17 Thread Mikhail Khludnev
Hello,
The former
https://github.com/apache/lucene-solr/blob/188f620208012ba1d726b743c5934abf01988d57/solr/core/src/java/org/apache/solr/search/DocSetCollector.java#L84
More efficient sets (roaring and/or elias-fano, iirc) present in Lucene,
but not yet being used in Solr.

On Mon, Feb 17, 2020 at 1:13 AM Hongxu Ma  wrote:

> Hi
> I want to know the internal of solr filter cache, especially its memory
> usage.
>
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html
> (Erick Erickson's answer)
>
> All of them said its structure is: fq => a bitmap (total doc number bits),
> but I think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use
> nearly 1GB(10/8 bit), it's too big and very easy to make solr OOM
> (I have a 1 billion doc cluster, looks it works well)
>
> And I also checked solr node, but cannot find the details (only saw using
> DocSets structure)
>
> So far, I guess:
>
>   *   degenerate into an doc id array/list when the bitmap is sparse
>   *   using some compressed bitmap, e.g. roaring bitmaps
>
> which one is correct? or another answer, thanks you very much!
>
>

-- 
Sincerely yours
Mikhail Khludnev


Re: A question about solr filter cache

2020-02-17 Thread Nicolas Franck
If 1GB would make solr go out of memory by using a filter query cache,
then it would have already happened during the initial upload of the
solr documents. Imagine the amount of memory you need for one billion 
documents..
A filter cache would be the least of your problems. 1GB is small in comparison
to the entire solr index.

> On 17 Feb 2020, at 10:13, Hongxu Ma  wrote:
> 
> Hi
> I want to know the internal of solr filter cache, especially its memory usage.
> 
> I googled some pages:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html 
> (Erick Erickson's answer)
> 
> All of them said its structure is: fq => a bitmap (total doc number bits), 
> but I think it's not so simple, reason:
> Given total doc number is 1 billion, each filter cache entry will use nearly 
> 1GB(10/8 bit), it's too big and very easy to make solr OOM (I have a 
> 1 billion doc cluster, looks it works well)
> 
> And I also checked solr node, but cannot find the details (only saw using 
> DocSets structure)
> 
> So far, I guess:
> 
>  *   degenerate into an doc id array/list when the bitmap is sparse
>  *   using some compressed bitmap, e.g. roaring bitmaps
> 
> which one is correct? or another answer, thanks you very much!
> 



A question about solr filter cache

2020-02-17 Thread Hongxu Ma
Hi
I want to know the internal of solr filter cache, especially its memory usage.

I googled some pages:
https://teaspoon-consulting.com/articles/solr-cache-tuning.html
https://lucene.472066.n3.nabble.com/Solr-Filter-Cache-Size-td4120912.html 
(Erick Erickson's answer)

All of them said its structure is: fq => a bitmap (total doc number bits), but 
I think it's not so simple, reason:
Given total doc number is 1 billion, each filter cache entry will use nearly 
1GB(10/8 bit), it's too big and very easy to make solr OOM (I have a 1 
billion doc cluster, looks it works well)

And I also checked solr node, but cannot find the details (only saw using 
DocSets structure)

So far, I guess:

  *   degenerate into an doc id array/list when the bitmap is sparse
  *   using some compressed bitmap, e.g. roaring bitmaps

which one is correct? or another answer, thanks you very much!