Re: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Tim Casey
People usually want to do some analysis during index time.  This analysis
should be considered 'expensive', compared to any single query run.  You
can think of it as indexing every day, over a 86400 second day, vs a 200 ms
query time.

Normally, you want to index as honestly as possible.  That is, you want to
take what you are given and put it in the index they way it comes.  You do
this with a particular analyzer.  This produces a token stream, which is
then indexed.  (Solr does things way more complicated now, like two tokens
with the same index position and so on.  But a simple model to give a
foundational explanation.)

On the query side you can try all kinds of crazy things to find what you
want.  You can build synonyms at this point and query for them all.  You
can stem words, and query and so on.  You can build distance queries, two
words nearish to each other.

If you produce more tokens at index time, you are increasing the over all
documents returned, and assuming a single set of documents is the desired
search result, this will result in lower precision.  You will not always be
able to find the thing you want in the fixed set of early query results.
The only way to fix this is at index time.  It is much easier to make this
adjustment at query time.  Instead of stemming, make the query more exact
hopefully increasing precision.

This difference in cost leads to a tendency, over the time of a search
universe, to tend towards more complex queries and less complex indexing.

I would recommend avoiding indexing tricks for this reason.  If they are
required, and I am sure they are, then you may want to segment the queries
in such a way as to be able to answer over generation over the required
recall.  So, segment the differences by field.  Put time tokens in a time
field, so you dont get names of people 'june' while searching for 'jun',
for instance.

tim



On Thu, Sep 10, 2020 at 10:08 AM Walter Underwood 
wrote:

> It is very common for us to do more processing in the index analysis
> chain. In general, we do that when we want additional terms in the index to
> be searchable. Some examples:
>
> * synonyms: If the book title is “EMT” add “Emergency Medical Technician”.
> * ngrams: For prefix matching, generate all edge ngrams, for example for
> “french” add “f”, “fr” “fre”, “fren”, and “frenc”.
> * shingles: Make pairs, so the query “babysitter” can match “baby sitter”.
> * split on delimiters: break up compounds, so “baby sitter” can match
> “baby-sitter”. Do this before shingles and you get matches for
> “babysitter”, “baby-sitter”, and “baby sitter”.
> * remove HTML: we rarely see HTML in queries, but we never know when
> someone will get clever with the source text, sigh.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Sep 10, 2020, at 9:48 AM, Erick Erickson 
> wrote:
> >
> > When you want to do something different and index and query time. There,
> an answer that’s almost, but not quite, completely useless while being
> accurate ;)
> >
> > A concrete example is synonyms as have been mentioned. Say you have an
> index-time synonym definition of
> > A,B,C
> >
> > These three tokens will be “stacked” in the index wherever any of them
> are found.
> > A query "q=field:B” would find a document with any of the three tokens
> in the original. It would be wasteful for the query to be transformed into
> “q=field:(A B C)”…
> >
> > And take a very close look at WordDelimiterGraphFilterFactory. I’m
> pretty sure you’ll find the parameters are different. Say the parameters
> for the input 123-456-7890 cause WDGFF to add
> > 123, 456, 7890, 1234567890 to the index. Again, at query time you don’t
> need to repeat and have all of those tokens in the search itself.
> >
> > Best,
> > Erick
> >
> >> On Sep 10, 2020, at 12:41 PM, Alexandre Rafalovitch 
> wrote:
> >>
> >> There are a lot of different use cases and the separate analyzers for
> >> indexing and query is part of the Solr power. For example, you could
> >> apply ngram during indexing time to generate multiple substrings. But
> >> you don't want to do that during the query, because otherwise you are
> >> matching on 'shared prefix' instead of on what user entered. Thinking
> >> phone number directory where people may enter any suffix and you want
> >> to match it.
> >> See for example
> >>
> https://www.slideshare.net/arafalov/rapid-solr-schema-development-phone-directory
> >> , starting slide 16 onwards.
> >>
> >> Or, for non-production but fun use case:
> >>
> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
> >> (search phonetically mapped Thai text in English).
> >>
> >> Similarly, you may want to apply synonyms at query time only if you
> >> want to avoid diluting some relevancy. Or at index type to normalize
> >> spelling and help relevancy.
> >>
> >> Or you may want to be doing some accent folding for sorting or
> >> faceting (which uses indexed 

Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey
What I have done for this in the past is calculating the expected value of
a symbol within a universe.  Then calculating the difference between
expected value and the actual value at the time you see a symbol.  Take the
difference and use the most surprising symbols, in rank order from most
surprising to least surprising, dropping lower frequency/unique values.
This was a fairly length independent way to get to interesting tokens.

Most calculations around stop words are very difficult to maintain and
handle.  You can have 7 English stop words easy.  Then you go to a larger
set, say 30ish, then another larger set say 150.  The problem is as you
remove stop words, you remove some meaning.  You will see an example of
this when you want to know the difference between 'a noun' and 'the noun'.
  Now that we have covered English and chosen the optimal set of stop words
for a particular set of text, a new language comes around.  Eventually the
stop words become a contributing factor of error.  The other reason to not
use stop words is a corpus is usually a form of golden egg.  You might be
able to reindex it, but the cost is usually not free.  It is generally
better to have an honest index and allow the post analysis to change.  This
way you can change it 10 times a day and no one will care.

If you are interested in a word cloud I would suspect people have done a
reasonable job around this using a solr index already.

tim

On Fri, May 15, 2020 at 1:48 PM A Adel  wrote:

> Yes, significant terms have been calculated but they have the anomaly or
> relative shift nature rather than the high frequency, as suggested also by
> the blog post. So, it looks that adding a preprocessing step upstream in an
> additional field makes more sense in this case. The text is intrinsically
> not straightforward to parse (short free text) using mainstream NLP though.
>
> A.
>
> On Fri, May 15, 2020, 8:43 PM Walter Underwood 
> wrote:
>
> > Right. I might use NLP to pull out noun phrases and entities. Entities
> are
> > essential noun phrases with proper nouns.
> >
> > Put those in a separate field and build the word cloud from that.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> > dturnb...@opensourceconnections.com> wrote:
> > >
> > > You may want something more like "significant terms" - terms
> > statistically
> > > significant in a document. Possibly not just based on doc freq
> > >
> > > https://saumitra.me/blog/solr-significant-terms/
> > >
> > > On Fri, May 15, 2020 at 2:16 PM A Adel  wrote:
> > >
> > >> Hi Walter,
> > >>
> > >> Thank you for your explanation, I understand the point and agree with
> > you.
> > >> However, the use case at hand is building a word cloud based on
> faceting
> > >> the multilingual text field (very simple) which in case of not using
> > stop
> > >> words returns many generic terms, articles, etc. If stop words filter
> is
> > >> not used, is there any other/better technique to be used instead to
> > build a
> > >> meaningful word cloud?
> > >>
> > >>
> > >> On Fri, May 15, 2020, 5:20 PM Walter Underwood  >
> > >> wrote:
> > >>
> > >>> Just don’t use stop words. That will give much better relevance and
> > works
> > >>> for all languages.
> > >>>
> > >>> Stop words are an obsolete hack from the days of search engines
> running
> > >>> on 16 bit CPUs. They save space by throwing away important
> information.
> > >>>
> > >>> The classic example is “to be or not to be”, which is made up
> entirely
> > of
> > >>> stop words. Remove them and it is impossible to search for that
> phrase.
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> >  On May 14, 2020, at 10:47 PM, A Adel  wrote:
> > 
> >  Hi - Is there a way to configure stop words to be dynamic for each
> > >>> document
> >  based on the language detected of a multilingual text field?
> Combining
> > >>> all
> >  languages stop words in one set is a possibility however it
> introduces
> >  false positives for some language combinations, such as German and
> > >>> English.
> >  Thanks, A.
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > *Doug Turnbull **| CTO* | OpenSource Connections
> > > , LLC | 240.476.9983
> > > Author: Relevant Search ; Contributor:
> *AI
> > > Powered Search *
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> >
> >
>


Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey
You do not need stop words to do what you need to do,  For one thing, stop
words requires a segmentation on a phrase-by-phrase basis in some cases.
That is, especially in places like Europe, there is a lot of mixed
language. (Your milage may vary :).

In order to do what you want, you really need to look at the statistical
value of all of the symbols in the universe of consideration.  Find the
relevant terms, throw out common terms and anything with a frequency below
5.  This is also language independent, and slang independent and fairly
medium independent.  If you need a more refined space, you can build the
symbol space from bigrams.

If I ever write a book the title is going to be "The The".  I hope it has
multi-lingual translations.  Although, at this point, it is a very short
book :/

tim

On Fri, May 15, 2020 at 11:43 AM Walter Underwood 
wrote:

> Right. I might use NLP to pull out noun phrases and entities. Entities are
> essential noun phrases with proper nouns.
>
> Put those in a separate field and build the word cloud from that.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> >
> > You may want something more like "significant terms" - terms
> statistically
> > significant in a document. Possibly not just based on doc freq
> >
> > https://saumitra.me/blog/solr-significant-terms/
> >
> > On Fri, May 15, 2020 at 2:16 PM A Adel  wrote:
> >
> >> Hi Walter,
> >>
> >> Thank you for your explanation, I understand the point and agree with
> you.
> >> However, the use case at hand is building a word cloud based on faceting
> >> the multilingual text field (very simple) which in case of not using
> stop
> >> words returns many generic terms, articles, etc. If stop words filter is
> >> not used, is there any other/better technique to be used instead to
> build a
> >> meaningful word cloud?
> >>
> >>
> >> On Fri, May 15, 2020, 5:20 PM Walter Underwood 
> >> wrote:
> >>
> >>> Just don’t use stop words. That will give much better relevance and
> works
> >>> for all languages.
> >>>
> >>> Stop words are an obsolete hack from the days of search engines running
> >>> on 16 bit CPUs. They save space by throwing away important information.
> >>>
> >>> The classic example is “to be or not to be”, which is made up entirely
> of
> >>> stop words. Remove them and it is impossible to search for that phrase.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
>  On May 14, 2020, at 10:47 PM, A Adel  wrote:
> 
>  Hi - Is there a way to configure stop words to be dynamic for each
> >>> document
>  based on the language detected of a multilingual text field? Combining
> >>> all
>  languages stop words in one set is a possibility however it introduces
>  false positives for some language combinations, such as German and
> >>> English.
>  Thanks, A.
> >>>
> >>>
> >>
> >
> >
> > --
> > *Doug Turnbull **| CTO* | OpenSource Connections
> > , LLC | 240.476.9983
> > Author: Relevant Search ; Contributor: *AI
> > Powered Search *
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
>
>


Re: cursorMark and shards? (6.6.2)

2020-02-10 Thread Tim Casey
Walter,

When you do the query, what is the sort of the results?

tim

On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood 
wrote:

> I’ll back up a bit, since it is sort of an X/Y problem.
>
> I have an index with four shards and 17 million documents. I want to dump
> all the docs in JSON, label each one with a classifier, then load them back
> in with the labels. This is a one-time (or rare) bootstrap of the
> classified data. This will unblock testing and relevance work while we get
> the classifier hooked into the indexing pipeline.
>
> Because I’m dumping all the fields, we can’t rely on docValues.
>
> It is OK if it takes a few hours.
>
> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
> using cursorMark and distributed search. Median response time is 10 s. CPU
> usage is about 1%.
>
> It is all pretty grubby and it seems like there could be a better way.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 10, 2020, at 3:39 PM, Erick Erickson 
> wrote:
> >
> > Any field that’s unique per doc would do, but yeah, that’s usually an ID.
> >
> > Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
> firing
> > at individual replicas. Each replica should have multiple UUIDs that
> start with 0-f.
> >
> > Unless I misunderstand and you’re just firing off, say, 16 threads at
> the entire
> > collection rather than individual shards which would work too. But for
> individual
> > shards I think you need to look for all possible IDs...
> >
> > Erick
> >
> >> On Feb 10, 2020, at 5:37 PM, Walter Underwood 
> wrote:
> >>
> >>
> >>> On Feb 10, 2020, at 2:24 PM, Walter Underwood 
> wrote:
> >>>
> >>> Not sure if range queries work on a UUID field, ...
> >>
> >> A search for id:0* took 260 ms, so it looks like they work just fine.
> I’ll try separate queries for 0-f.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >
>
>


Re: Position search

2019-10-16 Thread Tim Casey
Adi,

If you are looking for something specific you might want to try something
different.  Before you would search 'the end of a document', you might
think about segmenting the document and searching specific segments.  At
the end of a lot of things like email will be signatures.  Those are fairly
standard language, although mostly the same in meaning, do differ in
specific language.  They are a common segment.

If you are searching something like research papers, then you would be
thinking about the conclusion (?), bibliography (?).  It does not matter,
but there will be specific segments.

I think you will find the last N tokens of a document have some odd
categories within the search results.  I might guess you have a different
purpose in mind.  Either way, you would likely do better to segment what
you are searching.

tim

On Mon, Oct 14, 2019 at 11:25 PM Kaminski, Adi 
wrote:

> Hi,
> What's the recommended way to search in Solr (assuming 8.2 is used) for
> specific terms/phrases/expressions while limiting the search from position
> perspective.
> For example to search only in the first/last 100 words of the document ?
>
> Is there any built-in functionality for that ?
>
> Thanks in advance,
> Adi
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>


Re: Position search

2019-10-15 Thread Tim Casey
If this is about a normalized query, I would put the normalization text
into a specific field.  The reason for this is you may want to search the
overall text during any form of expansion phase of searching for data.
That is, maybe you want to know the context of up to the 120th word.  At
least you have both.
Also, you may want to note which normalized fields were truncated or were
simply too small. This would give some guidance as to the bias of the
normalization.  If 95% of the fields were not truncated, there is a chance
you are not doing good at normalizing because you have a set of
particularly short messages.  So I would expect a small set of side fields
remarking this.  This would allow you to carry the measures along with the
data.

tim

On Tue, Oct 15, 2019 at 12:19 PM Alexandre Rafalovitch 
wrote:

> Is the 100 words a hard boundary or a soft one?
>
> If it is a hard one (always 100 words), the easiest is probably copy
> field and in the (unstored) copy, trim off whatever you don't want to
> search. Possibly using regular expressions. Of course, "what's a word"
> is an important question here.
>
> Similarly, you could do that with Update Request Processors and
> clone/process field even before it hits the schema. Then you could
> store the extract for highlighting purposes.
>
> Regards,
>Alex.
>
> On Tue, 15 Oct 2019 at 02:25, Kaminski, Adi 
> wrote:
> >
> > Hi,
> > What's the recommended way to search in Solr (assuming 8.2 is used) for
> specific terms/phrases/expressions while limiting the search from position
> perspective.
> > For example to search only in the first/last 100 words of the document ?
> >
> > Is there any built-in functionality for that ?
> >
> > Thanks in advance,
> > Adi
> >
> >
> > This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>


Re: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-09-30 Thread Tim Casey
https://stackoverflow.com/questions/48348312/solr-7-how-to-do-full-text-search-w-geo-spatial-search


On Mon, Sep 30, 2019 at 10:31 AM Anushka Gupta <
anushka_gu...@external.mckinsey.com> wrote:

> Hi,
>
> I want to be able to filter on different cities and also sort the results
> based on geoproximity. But sorting doesn’t work:
>
>
> admin_directory_search_geolocation?q=david=({!geofilt+sfield=adminLatLon+pt=33.0198431,-96.6988856+d=80+sort=min(geodist(33.0198431,-96.6988856))})+OR+({!geofilt+sfield=adminLatLon+pt=50.2171726,8.265894+d=80+sort=min(geodist(50.2171726,8.265894))})
>
> Sorting works fine if I add ‘&’ in geofilt condition like :
> q=david={!geofilt=adminLatLon=33.0198431,-96.6988856=80=geodist(33.0198431,-96.6988856)}
>
> But when I combine the two FQs then sorting doesn’t work.
>
> Please help.
>
>
> Best regards,
> Anushka gupta
>
>
>
> From: David Smiley 
> Sent: Friday, September 13, 2019 10:29 PM
> To: Anushka Gupta 
> Subject: [EXT]Re: Need urgent help with Solr spatial search using
> SpatialRecursivePrefixTreeFieldType
>
> Hello,
>
> Please don't email me directly for public help.  CC is okay if you send it
> to solr-user@lucene.apache.org so
> that the Solr community can benefit from my answer or might even answer it.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley<
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_in_davidwsmiley=DwMFaQ=yIH1_-b1hO27QV_BdDph9suDL0Jq0WcgndLmIuQXoms=0egJOuVVdmY5VQTw_S3m4bVez1r-U8nqqi6RYBxO6tTbzryrDHrFoJROJ8r-TqNc=ulu2-5V3TDOnVNfRRQusod6-FoJcdeAWu5gGB3owryU=Hv2uYeXnut3oi1ijHp14BJ09QIZzhEI-onwzhnQYB8I=
> >
>
>
> On Wed, Sep 11, 2019 at 11:27 AM Anushka Gupta <
> anushka_gu...@external.mckinsey.com anushka_gu...@external.mckinsey.com>> wrote:
> Hello David,
>
> I read a lot of articles of yours regarding Solr spatial search using
> SpatialRecursivePrefixTreeFieldType. But unfortunately it doesn’t work for
> me when I combine filter query with my keyword search.
>
> Solr Version used : Solr 7.1.0
>
> I have declared fields as :
>
>  class="solr.SpatialRecursivePrefixTreeFieldType" geo="true"
> maxDistErr="0.001"
> distErrPct="0.025"
> distanceUnits="kilometers"/>
>
>  stored="true"  multiValued="true" />
>
>
> Field values are populated like :
> adminLatLon: [50.2171726,8.265894]
>
> Query is :
>
> /solr/ac3_persons/admin_directory_search_location?q=Idstein=Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true
>
> My request handler is :
> admin_directory_search_location
>
> I get results if I do :
>
> /solr/ac3_persons/admin_directory_search_location?q=*:*=Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true
>
> But I do not get results when I add any keyword in q.
>
> I am stuck in this issue since last many days. Could you please help with
> the same.
>
>
> Thanks,
> Anushka Gupta
>
> ++
> This email is confidential and may be privileged. If you have received it
> in error, please notify us immediately and then delete it. Please do not
> copy it, disclose its contents or use it for any purpose.
> ++
>
> ++
> This email is confidential and may be privileged. If you have received it
> in error, please notify us immediately and then delete it.  Please do not
> copy it, disclose its contents or use it for any purpose.
> ++
>


Re: Encrypting Solr Index

2019-06-25 Thread Tim Casey
My two cents worth of comment,

For our local lucene indexes we use AES encryption.  We encrypt the blocks
on the way out, decrypt on the way in.
We are using a C version of lucene, not the java version.  But, I suspect
the same methodology could be applied.  This assumes the data at rest is
the attack vector for discovering what is in the invertible index.  But
allows for the indexing/querying to be done in the clear.  This would allow
for stemming and the like.

If you have an attack vector in which the indexing/querying are not
trusted, then you have a whole different set of problems.

To do stemming, you need a homomorphic encryption scheme which would allow
per character/byte queries.  This is different type of attack vector than
the on-disk encryption.  To me, this implies the query system itself is
untrusted and you are indexing/querying encrypted content.  The first
"thing" people are going to try  is to hash a token into a 256bit value
which becomes the indexable token value.  This leads to the lack of
stemming from above comments.  Depending on how keys are handled and hashes
are generated you can run out of token space in the various underlying
lucene indexes because you have more than 2 million tokens.



On Tue, Jun 25, 2019 at 10:21 AM Ahuja, Sakshi  wrote:

> I am actually looking for the best option so currently doing research on
> it.
> For Window's FS encryption I didn't find a way to use different
> Username/Password. It by default takes window's username/password to
> encrypt and decrypt.
>
> I tried bitlocker too for creating encrypted virtual directory (Which
> allows me to use different credentials) and to keep Solr Index in that but
> somehow Solr Admin was unable to access Index from that encrypted
> directory. Not sure how that is working.
>
> If you have any idea on that- will wok for me. Thanks!
>
> -Original Message-
> From: Jörn Franke 
> Sent: Tuesday, June 25, 2019 12:47 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Encrypting Solr Index
>
> Why does FS encryption does not serve your use case?
>
> Can’t you apply it also for backups etc?
>
> > Am 25.06.2019 um 17:32 schrieb Ahuja, Sakshi :
> >
> > Hi,
> >
> > I am using solr 6.6 and want to encrypt index for security reasons. I
> have tried Windows FS encryption option that works but want to know if solr
> has some inbuilt feature to encrypt index or any good way to encrypt solr
> index?
> >
> > Thanks,
> > Sakshi
>


Re: Solr query with long query

2019-05-30 Thread Tim Casey
Venkat,

There is another way to do this.  If you have a category of "thing" you are
attempting to filter over, then you create a query and tag the documents
with this category.  So, create a 'categories' field and append 'thing' to
the field updating the field if need be.  (Be wary of over generation if
one of the categories turns out to be 'thin').

Then in the filter query you can query over a category, or simply require a
category:thing to be in the query.

tim


On Thu, May 30, 2019 at 3:33 PM Shawn Heisey  wrote:

> On 5/30/2019 4:13 PM, Venkateswarlu Bommineni wrote:
> > Thank you guys for quick response.
> >
> > I was able to query solr by sending 1500 products using solrJ with http
> > post method.
> >
> > But I had to change maxBooleanClauses to 4096 from default 1024.
> >
> > But I wanted to check with you guys that, will there be any performance
> > issues by changing maxBooleanClauses to 4096.
> >   Please help me here.
>
> Lucene sets the limit to 1024 basically to act as a short-circuit to
> protect the system against bugs in user code causing the query engine to
> explode by sending thousands of clauses when that wasn't what was intended.
>
> Using a large number of clauses *can* cause performance issues.  But if
> you need to use them, you need to use them, and in that case you'll need
> to increase the limit, and just take the performance hit.
>
> The terms query parser will generally run faster than a large number of
> boolean OR clauses, and return the same results.  But if your query is
> not just OR (also known as SHOULD) clauses, you probably won't be able
> to use the terms parser.
>
>
> https://lucene.apache.org/solr/guide/7_7/other-parsers.html#terms-query-parser
>
> Thanks,
> Shawn
>


Re: Help with multi-lang searches

2018-10-22 Thread Tim Casey
Hi Sambhav,

Calculate the percentage of letter pairs per language in the index.
Given the letter pairs in the incoming token, find the closest "match" for
the languages in the indexes.

Even on a small number of tokens you will get close to the intended
language.  You can also calculate the "source language model" in an index
neutral way, say from a known corpus of language specific tokens +
frequency.

Generally this is a tricky thing to do.  Any kind of recall/precision trade
off requires measuring the results for the given data.  It is hard to ask
for general advice.  Sometimes the language segmentation is not done on a
document (index term here) basis.  But the incoming data is segmented by
something like a paragraph or sentence.  So, there is that as well.

I would expect this to be done where the source document is stored raw.
Then, along side the document is a set of probable languages.  From there,
you can pivot the results based on the user expectations.

tim

On Mon, Oct 22, 2018 at 11:18 AM Alexandre Rafalovitch 
wrote:

> Additional possibilities:
> 1) omitNorms and maybe omitTermFreqAndPositions for the fields to
> avoid frequency of term mattering
>
> http://lucene.apache.org/solr/guide/7_5/defining-fields.html#optional-field-type-override-properties
> 2) Constant score:
>
> http://lucene.apache.org/solr/guide/7_5/the-standard-query-parser.html#constant-score-with
> 3) If your languages are ranked (English first, Italian after), you
> can boost English field
> 4) https://www.manning.com/books/relevant-search may have some ideas.
> The examples use ES, but also has Solr discussion and Solr has some
> additional capabilities now to match (e.g. eDisMax sow parameter).
>
> Hope it helps,
>Alex.
>
>
>
> On Mon, 22 Oct 2018 at 11:56, Sambhav Kothari (BLOOMBERG/ LONDON)
>  wrote:
> >
> > Hi,
> >
> > We have a problem with searches with multiple languages.
> > Our schema looks something like this:
> >
> > 
> > field_en = English content for field
> >
> > field_es = Spanish
> >
> > field_it = Italian
> >
> > etc.
> > 
> >
> > When a user searches for a keyword, e.g.:
> >
> > "brexit" it can also specify several languages s/he wants to see in the
> response, and the query will be performed on all the fields requested.
> >
> > The issue is that for 'brexit' Italian results are boosted more because
> something like "Brexit" is unlikely to occur in the Italian language and
> the idf shoots up causing less relevant but Italian docs to rank higher
> than the English ones.
> >
> > Is there some way to deal with this problem ?
> >
> > The current solutions we can think of:
> >
> > 1. Create a catchall copyfield and use that to score the docs. (But this
> creates problems when a word is present in another language (for eg
> English) and not in the resulting document language (Italian) (we will have
> to pay also extra disk space of the copyfield and also problems with
> analysis for multiple languages)
> > 2. Create a new scorer called "IDFGroupScorer" wrapping multiple fields
> and computing a aggregate idf (by averaging or computing the min/max)
> across the fields in the group.
> >
> > Any thoughts on any other solutions or any suggestions on how we could
> possibly implement the IDFGroupScorer?
> >
> > Thanks,
> >
> > Sambhav
> >
>


Re: solr crypto mining hack...

2018-08-25 Thread Tim Casey
I am not sure how solr is exactly set up currently, much less on any
specific system.  But, for operations which are largely reading, *maybe*
like a query, you might be able run on a read only partition.

A firewall is a lot less work and a good start, like 90% of the problem.

To do this, you bring up the system with two partitions, one read-only and
one read-write.  You chroot into the readonly partition and start the query
server.  This process would only be allowed to run queries and would only
be read only.  The indexing process, if exposed to the world, would have to
have a firewall in front of it with white listing of various parts of the
world.  (Preferably with an ssh enabled exchange, but security is hard,
lets go shopping.)

This is complicated to set up.  If I recall, we had to build up the used
parts of the OS as a sub-mount and then run there.  However, once it is
mounted read only, any subprocess in that root would not be allowed to
write.  As an simple example, this type of change  requires network logging
and then a whole lot of qualification to get to useful production.

On Sat, Aug 25, 2018 at 7:10 PM Shawn Heisey  wrote:

> On 8/25/2018 12:59 PM, humanitarian wrote:
> > I am struggling to fight an attack were the solr user is being used to
> > crate files used for mining cryptocurrencies. The files are being
> > created in the /var/tmp and /tmp folders.
> >
> > It will use 100% of the CPU.
> >
> > I am looking for help in stopping these attacks.
> >
> > All files are created under the solr user.
>
> At least some of what I'm writing is a repeat of what was said in
> SOLR-12700 -- an issue in Jira with a description that's extremely
> similar to the subject of this message.
>
> The Solr server should never be exposed to untrusted parties, especially
> the open Internet.  This is probably our number one recommendation for
> security.  If an attacker cannot reach a server, they cannot compromise it.
>
> There are a lot of possible vectors in Solr that could have been used to
> compromise the system.  Most of the vulnerabilities that have been found
> are in third-party dependencies that Solr utilizes to create certain
> functionality.
>
> This is not the first time I've encountered this.  On at least one other
> occasion, a user found weird software on their system running as the
> solr user.  It turned out to be a crypto-mining program.
>
> If you have Solr logs from when your system was compromised, we can
> check them to see if there's anything useful. There may not be anything
> useful.   One of the better logs for tracking this sort of thing is the
> Jetty request log, but this log is not enabled by default in the Solr
> download.  This log will be the only way you can get the IP address
> making requests.
>
> Lock down your Solr server(s) so that only trusted network addresses can
> reach them.  This will need to be done outside of Solr.  The operating
> system will have a firewall available, and your network equipment might
> also have filtering capability.
>
> Thanks,
> Shawn
>
>


Re: Exact Phrase search not returning results.

2018-07-20 Thread Tim Casey
Deepti,

I am going to guess the analyzer part of the .net application is cutting
off the last token.
If you try the queries on the console of the running solr cluster, what do
you get?  If you dump that specific field for all the docs, can you find it
with grep?

tim


On Fri, Jul 20, 2018 at 10:56 AM Krishnan, Deepti (NIH/OD) [C] <
deepti.krish...@nih.gov> wrote:

> Hi,
>
>
>
> We are working on a .net application using Solr. When we initially
> launched the site we were using the 5.5.3 version and last sprint we
> updated it to the 7.3.1 version. Everything is working fine ass expected
> expect for one feature.
>
>
>
> The exact phrase search does not return any value for some search
> criteria, and this used to work fine with the older version. Based on our
> research, those search terms with stop words and more than one word
> following it is not working.
>
>
>
> The field has been defined as a text_general type in the schema and below
> are the tokenizers and filters used during indexing and querying.
>
>
>
>
>
> Eg.
>
>
>
>- “PROMOTING SCHOOL READINESS AMONG LOW-INCOME FAMILIES” – This works.
>No stop wods
>- “national institutes of health” – This works as well. Notice that
>there is a stop word (of) but only one word following it
>- “Structure of choroid plexus” – Does not work. Notice there are more
>than 2 words following the stop word(of)
>- "Health and Human Services" – This doesn’t work but “Health and
>Human” works.
>
>
>
> Please let me know if there is something I am missing and if something is
> unclear or you need to reach out to me to discuss further.
>
>
>
> Thanks,
>
> Deepti
>
>


Re: Zookeeper 3.4.12 with Solr 6.6.2?

2018-05-22 Thread Tim Casey
We have 3.4.10 and have *tested* at a functional level 6.6.2.  So far it
works. We have not done any stress/load testing.  But would have to do this
prior to release.

On Tue, May 22, 2018 at 9:44 AM, Walter Underwood 
wrote:

> Is anybody running Zookeeper 3.4.12 with Solr 6.6.2? Is that a recommended
> combination? Not recommended?
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Re: Date Query Confusion

2018-05-17 Thread Tim Casey
A simple date range query does not really represent how people query over
time and dates.  If you want any form of date queries, above a single
range, then a special field allowing tokenized query will be the only way
to find documents.

A query for 'ever tuesday in november of 2017' would have to be written as
an or clause over a set of date ranges.  A tokenized date field would just
have to query for "+nov +tues +2017".  How you choose to tokenize a date
into a field will determine the types of queries you can run over the data.

Another part of this is query for a date range, when the source material
has date ranges built into it is kinda odd.  But it occurs.  If you query
from noon-1p does that include meeting notes which started at 1130a, but
went for an hour?  You have to choose what to do.

tim

On Thu, May 17, 2018 at 6:11 AM, Terry Steichen  wrote:

> To me, one of the more frustrating things I've encountered in Solr is
> working with date fields.  Supposedly, according to the documentation,
> this is straightforward.  But in my experience, it is anything but
> that.  In particular, I've found that the abbreviated forms of date
> queries, don't work as described.
>
> If I create a query like creation_date: [2016-10-01 To 2016-11-01], it
> will produce a set of documents produced in the month of November 2016.
> That's the good news.
>
> But, the abbreviated date queries (described in Solr documentation
> )
> don't work.  Tried creation_date: 2016-11.  That's supposed to match
> documents with any November 2016 date.  But actually produces:
> |"Invalid Date String:'2016-11'|
>
> ||And Solr doesn't seem to let me sort on a date field.  Tried
> creation_date asc  Produced: |"can not sort on multivalued field:
> creation_date"|
>
> In the AdminUI, if you go to the schema option for my collection, and
> examine creation_date it show it to be:
> org.apache.solr.schema.TrieDateField  (This was automatically chosen by
> the managed-schema)
>
> In that same AdminUI display, if I click "Load Term Info" I get a list
> of dates, but when I click on one, it transforms it into a different
> query form: {!term f=creation_date}2016-10-26T07:59:09.824Z  But this
> query still produces 0 hits (even though the listing says it should
> produce dozens of hits).
>
> I imagine that I'm missing something basic here.  But I have no idea
> what.  Any thoughts would be MOST welcome.
>
> PS: I'm using Solr 6.6.0.
>


Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Tim Casey
For smaller length documents TFIDFSimilarity will weight towards shorter
documents.  Another way to say this, if your documents are 5-10 terms, the
5 terms are going to win.
You might think about having per token, or token pair, weight.  I would be
surprised if there was not something similar out there.  This is a common
issue with any short text.
I guess I would think of this as TFICF, where the CF is the corpus
frequency. You also might want to weight inversely proportional to the age
of the title, older are less important.  This is assuming people are doing
searches within some time cluster, newer is more likely.

For some obvious advice, things you probably already know.  This kind of
search needs some hard measurement to begin to know how to tune it.  You
need to find a reasonable annotated representation.  So, if you took the
previous months searches where there is a chain of successive searches.  If
you weighted things differently would you shorten the length of the chain.
Can you get the click throughs to happen sooner.

Anyway, just my 2 cents


On Wed, Jan 31, 2018 at 6:38 PM, Sravan Kumar  wrote:

>
> @Walter: We have 6 fields declared in schema.xml for title each with
> different type of analyzer. One without processing symbols, other stemmed
> and other removing  symbols, etc. So, if we have separate fields for each
> alias it will be that many times the number of final fields declared in
> schema.xml. And we exactly do not know what is the maximum number of
> aliases a movie can have.
> @Walter: I will try this but isn’t there any other way  where I can tweak ?
>
> @eric: will try this. But it will work only for exact matches.
>
>
> > On Jan 31, 2018, at 10:39 PM, Erick Erickson 
> wrote:
> >
> > Or use a boost for the phrase, something like
> > "beauty and the beast"^5
> >
> >> On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood <
> wun...@wunderwood.org> wrote:
> >> You can use a separate field for title aliases. That is what I did for
> Netflix search.
> >>
> >> Why disable idf? Disabling tf for titles can be a good idea, for
> example the movie “New York, New York” is not twice as much about New York
> as some other film that just lists it once.
> >>
> >> Also, consider using a popularity score as a boost.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Jan 31, 2018, at 4:38 AM, Sravan Kumar  wrote:
> >>>
> >>> Hi,
> >>> We are using solr for our movie title search.
> >>>
> >>>
> >>> As it is "title search", this should be treated different than the
> normal
> >>> document search.
> >>> Hence, we use a modified version of TFIDFSimilarity with the following
> >>> changes.
> >>> -  disabled TF & IDF and will only have 1 as value.
> >>> -  disabled norms by specifying omitNorms as true for all the fields.
> >>>
> >>> There are 6 fields with different analyzers and we make use of
> different
> >>> weights in edismax's qf & pf parameters to match tokens & boost
> phrases.
> >>>
> >>> But, movies could have aliases and have multiple titles. So, we made
> the
> >>> fields multivalued.
> >>>
> >>> Now, consider the following four documents
> >>> 1>  "Beauty and the Beast"
> >>> 2>  "The Real Beauty and the Beast"
> >>> 3>  "Beauty and the Beast", "La bella y la bestia"
> >>> 4>  "Beauty and the Beast"
> >>>
> >>> Note: Document 3 has two titles in it.
> >>>
> >>> So, for a query "Beauty and the Beast" and with the above
> configuration all
> >>> the documents receive same score. But 1,3,4 should have got same score
> and
> >>> document 2 lesser than others.
> >>>
> >>> To solve this, we followed what is suggested in the following thread:
> >>> http://lucene.472066.n3.nabble.com/Influencing-scores-
> on-values-in-multiValue-fields-td1791651.html
> >>>
> >>> Now, the fields which are used to boost are made to use Norms. And for
> >>> matching norms are disabled. This is to make sure that exact & near
> exact
> >>> matches are rewarded.
> >>>
> >>> But, for the same query, we get the following results.
> >>> query: "Beauty & the Beast"
> >>> Search Results:
> >>> 1>  "Beauty and the Beast"
> >>> 4>  "Beauty and the Beast"
> >>> 2>  "The Real Beauty and the Beast"
> >>> 3>  "Beauty and the Beast", "La bella y la bestia"
> >>>
> >>> Clearly, the changes have solved only a part of the problem. The
> document 3
> >>> should be ranked/scored higher than document 2.
> >>>
> >>> This is because lucene considers the total field length across all the
> >>> values in a multivalued field for normalization.
> >>>
> >>> How do we handle this scenario and make sure that in multivalued
> fields the
> >>> normalization is taken care of?
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> Sravan
> >>
>


Re: Howto search for § character

2017-12-07 Thread Tim Casey
My last company we ended up writing a custom analyzer to handle
punctuation.  But this was for lucent 2 or 3.  That analyzer was carried
forward as we updated and was used for all human derived text.

Although now there are way better analyzers and way better ways to hook
them up, as noted above by Erick, We really cared about how this was done
and all of the work put into the analyzer paid off.

I would expect there to be an analyzer which would maintain punctuation
tokens for search.  One of the issues which comes up is if you want
multiple-runs of punctuation to be a single token or separate tokens.  So
what happens to "§!"  or "§?" or "?§", and in the case of things like
text/email what happens to "§".

In any event, my 2 pence worth

tim

On Thu, Dec 7, 2017 at 10:00 AM, Shawn Heisey  wrote:

> On 12/7/2017 9:37 AM, Bernd Schmidt wrote:
> > Indeed, I saw in the analysis tab of the solr admin that the § char will
> be removed when using type text_general.
> > But in this use case we want to make a full text search like
> "_text_:§45" or "_text_:§*" to find words starting with §.
> > We need a text field here, not a string field!
> > What is your recommended way to deal with it?
> > Is it possible to remove the word break behaviour for the  § char?
> > Or is the best way to encode all § chars when indexing and searching?
>
> This character is classified by Unicode as punctuation:
>
> http://www.fileformat.info/info/unicode/char/00a7/index.htm
>
> Almost any example field type for full-text search that you're likely to
> encounter is going to be designed to split on punctuation and remove it
> from the token stream.  That's one of the most common things that
> full-text search engines do.
>
> You're going to need to design a new analysis chain that *doesn't* do
> this, apply the fieldType containing that analysis to your field,
> restart/reload, and reindex.
>
> Designing analysis chains is an art form, and tends to be one of the
> hardest parts of setting up a production Solr install.  It took me at
> least a month of almost constant work to settle on the schema design for
> the indexes that I maintain.  All of the "solr.TextField" types in my
> schema are completely custom -- none of the analysis chains in Solr
> examples are in that schema.
>
> Thanks,
> Shawn
>
>


Re: Java profiler?

2017-12-06 Thread Tim Casey
I really like Profiler.  It takes a little bit of set up, but it works.

tim

On Wed, Dec 6, 2017 at 2:04 AM, Peter Sturge  wrote:

> Hi,
> We'be been using JPRofiler (www.ej-technologies.com) for years now.
> Without a doubt, the most comprehensive and useful profiler for java.
> Works very well, supports remote profiling and includes some very neat heap
> walking/gc profiling.
> Peter
>
>
> On Tue, Dec 5, 2017 at 3:21 PM, Walter Underwood 
> wrote:
>
> > Anybody have a favorite profiler to use with Solr? I’ve been asked to
> look
> > at why out queries are slow on a detail level.
> >
> > Personally, I think they are slow because they are so long, up to 40
> terms.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >
>


Re: Solr query help

2017-08-18 Thread Tim Casey
You can add a ~3 to the query to allow the order to be reversed, but you
will get extra hits.  Maybe it is a ~4, i can never remember on phrases and
reversals.  I usually just try it.

Alternatively, you can create a custom query field for what you need from
dates.  For example, if you want to search by queries like "fourth
tuesday", you need to have 'tuesday" in a query and better to have " 4
tuesday " as part of the field.

Instead of a phrase query, you do +2017 +(04 03) +(01 02 03 04 05 06 07 08
09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31),
which does all the days in march and apr.  A more complicated nested query
would do more complicated date ranges.

I don't know if there is a way to get repeating date range queries, like
the fourth tuesday for all months in a year.  The date support is usually
about querying a specified range at a time.

tim

On Fri, Aug 18, 2017 at 11:19 AM, Webster Homer 
wrote:

> What field types are you using for your dates?
> Have a look at:
> https://cwiki.apache.org/confluence/display/solr/Working+with+Dates
>
> On Thu, Aug 17, 2017 at 10:08 AM, Nawab Zada Asad Iqbal 
> wrote:
>
> > Hi Krishna
> >
> > I haven't used date range queries myself. But if Solr only supports a
> > particular date format, you can write a thin client for queries, which
> will
> > convert the date to solr's format and query solr.
> >
> > Nawab
> >
> > On Thu, Aug 17, 2017 at 7:36 AM, chiru s  wrote:
> >
> > > Hello guys
> > >
> > > I am working on Apache solr and I am stuck with a use case.
> > >
> > >
> > > The input data will be in the documents like 2017/03/15 in 1st
> document,
> > >
> > > 2017/04/15 in 2nd doc,
> > >
> > > 2017/05/15 in 3rd doc,
> > >
> > > 2017/06/15 in 4th doc so on
> > >
> > > But while fetching the data it should fetch like 03/15/2017 for the
> first
> > > doc and so on.
> > >
> > > My requirement is like this ..
> > >
> > >
> > > The data is like above and when I do an fq with name:[2017/03/15 TO
> > > 2017/05/15] it fetches me the 1st three documents.. but the need the
> data
> > > as 03/15/2017 instead of 2017/03/15.
> > >
> > >
> > > I tried solr.pattetnReplaceCharFilterFactory but it doesn't seem
> > working..
> > >
> > > Can you please help on the above.
> > >
> > >
> > > Thanks in advance
> > >
> > >
> > > Krishna...
> > >
> >
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>


Re: Arabic words search in solr

2017-08-02 Thread Tim Casey
There should be a way to use a phrasal query for the specific names.

On Wed, Aug 2, 2017 at 2:15 PM, Phil Scadden  wrote:

> Hopefully changing to default AND solves your problem. If so, I would be
> quite interested in what your index config looks like in the end. I also
> have upcoming need to index Arabic words.
>
> -Original Message-
> From: mohanmca01 [mailto:mohanmc...@gmail.com]
> Sent: Thursday, 3 August 2017 12:58 a.m.
> To: solr-user@lucene.apache.org
> Subject: RE: Arabic words search in solr
>
> Hi Phil Scadden,
>
>  Thank you for your reply,
>
> we tried your suggested solution by removing hyphen while indexing, but it
> was getting wrong results. i was searching for "شرطة ازكي" and it was
> showing me the result that am looking for, plus irrelevant result which
> either have the first or second word that i have typed while searching.
>
> First word: شرطة
> Second Word: ازكي
>
> results that we are getting:
>
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 3,
> "params": {
>   "indent": "true",
>   "q": "bizNameAr:(شرطة ازكي)",
>   "_": "1501678260335",
>   "wt": "json"
> }
>   },
>   "response": {
> "numFound": 444,
> "start": 0,
> "docs": [
>   {
> "id": "28107",
> "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -
> - مركز شرطة إزكي",
> "_version_": 1574621132849414100
>   },
>   {
> "id": "13937",
> "bizNameAr": "مؤسسةا الازكي للتجارة والمقاولات",
> "_version_": 157462113219720
>   },
>   {
> "id": "15914",
> "bizNameAr": "العلوي والازكي المتحدة ش.م.م",
> "_version_": 1574621132344000500
>   },
>   {
> "id": "20639",
> "bizNameAr": "سحائب ازكي للتجارة",
> "_version_": 1574621132574687200
>   },
>   {
> "id": "25108",
> "bizNameAr": "المستشفيات -  - مستشفى إزكي",
> "_version_": 1574621132737216500
>   },
>   {
> "id": "27629",
> "bizNameAr": "وزارة الداخلية -  -  - والي إزكي -",
> "_version_": 1574621132833685500
>   },
>   {
> "id": "36351",
> "bizNameAr": "طوارئ الكهرباء - إزكي",
> "_version_": 157462113318391
>   },
>   {
> "id": "61235",
> "bizNameAr": "اضواء ازكي للتجارة",
> "_version_": 1574621133785792500
>   },
>   {
> "id": "66821",
> "bizNameAr": "أطلال إزكي للتجارة",
> "_version_": 1574621133915816000
>   },
>   {
> "id": "67011",
> "bizNameAr": "بنك ظفار - فرع ازكي",
> "_version_": 1574621133920010200
>   }
> ]
>   }
> }
>
> Actually  we expecting the below results only since it has both the words
> that we typed while searching:
>
>   {
> "id": "28107",
> "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -
> - مركز شرطة إزكي",
> "_version_": 1574621132849414100
>   },
>
>
> Configuration:
>
> In schema.xml we configured as below:
>
> 
>
>
>  positionIncrementGap="100">
>   
> 
>  words="lang/stopwords_ar.txt" />
> 
> 
> 
> 
>  replacement="ئ"/>
>  replacement=""/>
>   
> 
>
>
> Thanks,
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Arabic-words-search-in-solr-tp4317733p4348774.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>


Re: Spatial Search based on the amount of docs, not the distance

2017-06-22 Thread Tim Casey
deniz,

I was going to add something here.  The reason what you want is probably
hard to do is because you are asking solr, which stores a document, to
return documents using an attribute of document pairs.  As only a though
exercise, if you stored record pairs as a single document, you could
probably query it directly.  That is, if you have d1 and d2 and you are
querying  around d1 and ordering by distance, then you could get this
directly from a document representing a record pair.  I don't think this is
practical, because it is an n^2 store.

Since the n^2 problem is there, people are going to suggest some heuristic
which avoids this problem.  What Erick is suggesting is down this path.
Query around a point and sort by distance taking the top K results.  The
result is taking a linear slice of the n^2 distance attribute.

tim



On Wed, Jun 21, 2017 at 7:50 PM, Erick Erickson 
wrote:

> Would it serve to sort by distance? True, if you matched a zillion
> documents within a 1km radius you'd still perform the distance calcs, but
> the result would be a manageable number.
>
> I have to ask "Why to you care?". Is this an efficiency question (i.e. you
> want to keep Solr from having to do expensive work) or is it a question of
> having to get hits at all? It's at least possible that the solution for one
> is not the solution for the other.
>
> Best,
> Erick
>
> On Wed, Jun 21, 2017 at 5:32 PM, deniz  wrote:
>
> > it is for sure possible to use d value for limiting the distance,
> however,
> > it
> > might not be very efficient, as some of the coords may not have any docs
> > around for a large value of d... so it is hard to determine a default
> value
> > for d.
> >
> > though it sounds like havinga default d and gradual increments on its
> value
> > might be a work around for top K results...
> >
> >
> >
> >
> >
> > -
> > Zeki ama calismiyor... Calissa yapar...
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Spatial-Search-based-on-the-amount-of-docs-not-the-distance-
> > tp4342108p4342258.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: model building

2017-03-21 Thread Tim Casey
Joe,

To do this correctly, soundly, you will need to sample the data and mark
them as threatening or neutral.  You can probably expand on this quite a
bit, but that would be a good start.  You can then draw another set of
samples and see how you did.  You use one to train and one to validate.

What you are doing is probably just noise, from a model point of view, and
it will probably not make too much difference how you index/query/model
through the noise.

I don't mean this critically, just plainly.  Effectively the less
mathematically correctly you do this process, the more anecdotal the result.

tim


On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein  wrote:

> I've only tested with the training data in it's own collection, but it was
> designed for multiple training sets in the same collection.
>
> I suspect you're training set is too small to get a reliable model from.
> The training sets we tested with were considerably larger.
>
> All the idfs_ds values being the same seems odd though. The idfs_ds in
> particular were designed to be accurate when there are multiple training
> sets in the same collection.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
> > If I put the training data into its own collection and use q="*:*", then
> > it works correctly.  Is that a requirement?
> > Thank you.
> >
> > -Joe
> >
> >
> >
> > On 3/20/2017 3:47 PM, Joe Obernberger wrote:
> >
> >> I'm trying to build a model using tweets.  I've manually tagged 30
> tweets
> >> as threatening, and 50 random tweets as non-threatening.  When I build
> the
> >> mode with:
> >>
> >> update(models2, batchSize="50",
> >>  train(UNCLASS,
> >>   features(UNCLASS,
> >>  q="ProfileID:PROFCLUST1",
> >>  featureSet="threatFeatures3",
> >>  field="ClusterText",
> >>  outcome="out_i",
> >>  positiveLabel=1,
> >>  numTerms=250),
> >>   q="ProfileID:PROFCLUST1",
> >>   name="threatModel3",
> >>   field="ClusterText",
> >>   outcome="out_i",
> >>   maxIterations="100"))
> >>
> >> It appears to work, but all the idfs_ds values are identical. The
> >> terms_ss values look reasonable, but nearly all the weights_ds are 1.0.
> >> For out_i it is either -1 for non-threatening tweets, and +1 for
> >> threatening tweets.  I'm trying to follow along with Joel Bernstein's
> >> excellent post here:
> >> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s
> >> ystem-with-solrs.html
> >>
> >> Tips?
> >>
> >> Thank you!
> >>
> >> -Joe
> >>
> >>
> >
>


Re: query rewriting

2017-03-07 Thread Tim Casey
Hendrik,

I would recommend attempting to stick to the query syntax, as it is in
lucene, as close as possible.

However, if you do your own query parse build up, you can use a Lucene
Query object.  I don't know where this bolts into solr, exactly.  But I
have done this extensively with lucene.  The reason was to combine two
distinct portions of content into one unified query language.  Also, we did
some remapping of field names into a normalized user experience.  This
meant the field names could be exposed in the UI, independent of the
metadata of the underlying content.  For what I did, the source content
could be vastly different from one index to another.  Usually this is not
the case.

You end up building or/and query phrases, then passing it off to the query
engine.  If you do this, you can also optimize and add boost terms under
specific circumstances.  If there are a set of required terms/phrases, then
you can add terms to boost or remove non-required terms without any loss to
the overall result set.  This changes the order in which items are
returned, so may impact user perception of recall, but is possible under
for specific reasons.

tim

On Sun, Mar 5, 2017 at 11:40 PM, Hendrik Haddorp 
wrote:

> Hi,
>
> I would like to dynamically modify a query, for example by replacing a
> field name with a different one. Given how complex the query parsing is it
> does look error prone to duplicate that so I would like to work on the
> Lucene Query object model instead. The subclasses of Query look relatively
> simple and easy to rewrite on the Lucene side but on the Solr side this
> does not seem to be the case. Any suggestions on how this could be done?
>
> thanks,
> Hendrik
>


Re: Question about best way to architect a Solr application with many data sources

2017-02-22 Thread Tim Casey
I would possibly extend this a bit futher.  There is the source, then the
'normalized' version of the data, then the indexed version.
Sometimes you realize you miss something in the normalized view and you
have to go back to the actual source.

This will be as likely as there are number of sources for data.   I would
expect the "DB" version of the data would be the normalized view.
It is also possible, the DB holds the raw bytes of the source which are
then transformed and into a normalized view.  Indexing always happens from
the normalized view.  In this scheme, frequently there is a way to mark
what failed normalization so you can go back and recapture the data for a
re-index.

Also, if you are dealing with timely data, being able to reindex helps
removing stale information from the search index.  In the pipeline of
captured source -> normalized -> analyzed -> information, where analyzed is
indexed here, what you do with the data over a year or more becomes part of
the thinking.



On Tue, Feb 21, 2017 at 8:24 PM, Walter Underwood 
wrote:

> Reindexing is exactly why you want the Single Source of Truth to be in a
> repository outside of Solr.
>
> For our slowly-changing data sets, we have an intermediate JSONL batch.
> That is created from the source repositories and saved in Amazon S3. Then
> we load it into Solr nightly. That allows us to reload whenever we need to,
> like loading prod data in test or moving search to a different Amazon
> region.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Feb 21, 2017, at 7:34 PM, Erick Erickson 
> wrote:
> >
> > Dave:
> >
> > Oh, I agree that a DB is a perfectly valid place to store the data and
> > you're absolutely right that it allows better interaction than flat
> > files; you can ask questions of an RDBMS that you can't easily ask the
> > disk ;). Storing to disk is an alternative if you're unwilling to deal
> > with a DB is all.
> >
> > But the main point is you'll change your schema sometime and have to
> > re-index. Having the data you're indexing stored locally in whatever
> > form will allow much faster turn-around rather than re-crawling. Of
> > course it'll result in out of date data so you'll have to refresh
> > somehow sometime.
> >
> > Erick
> >
> > On Tue, Feb 21, 2017 at 6:07 PM, Dave 
> wrote:
> >> Ha I think I went to one of your training seminars in NYC maybe 4 years
> ago Eric. I'm going to have to respectfully disagree about the rdbms.  It's
> such a well know data format that you could hire a high school programmer
> to help with the db end if you knew how to flatten it to solr. Besides it's
> easy to visualize and interact with the data before it goes to solr. A
> Json/Nosql format would work just as well, but I really think a database
> has its place in a scenario like this
> >>
> >>> On Feb 21, 2017, at 8:20 PM, Erick Erickson 
> wrote:
> >>>
> >>> I'll add that I _guarantee_ you'll want to re-index the data as you
> >>> change your schema
> >>> and the like. You'll be able to do that much more quickly if the data
> >>> is stored locally somehow.
> >>>
> >>> A RDBMS is not necessary however. You could simply store the data on
> >>> disk in some format
> >>> you could re-read and send to Solr.
> >>>
> >>> Best,
> >>> Erick
> >>>
>  On Tue, Feb 21, 2017 at 5:17 PM, Dave 
> wrote:
>  B is a better option long term. Solr is meant for retrieving flat
> data, fast, not hierarchical. That's what a database is for and trust me
> you would rather have a real database on the end point.  Each tool has a
> purpose, solr can never replace a relational database, and a relational
> database could not replace solr. Start with the slow model (database) for
> control/display and enhance with the fast model (solr) for retrieval/search
> 
> 
> 
> > On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
> >
> > To learn how to properly use Solr, I'm building a little experimental
> > project with it to search for used car listings.
> >
> > Car listings appear on a variety of different places ... central
> places
> > Craigslist and also many many individual Used Car dealership
> websites.
> >
> > I am wondering, should I:
> >
> > (a) deploy a Solr search engine and build individual indexers for
> every
> > type of web site I want to find listings on?
> >
> > or
> >
> > (b) build my own database to store car listings, and then build
> services
> > that scrape data from different sites and feed entries into the
> database;
> > then point my Solr search to my database, one simple source of
> listings?
> >
> > My concerns are:
> >
> > With (a) ... I have to be smart enough to understand all those
> different
> > data sources and remove/update 

Re: Chegg is looking for a search engineer

2013-11-18 Thread Tim Casey
I have been chasing the chegg recruiters.  I expect to here back from Glenn
sometime tomorrow.

tim


On Mon, Nov 18, 2013 at 6:37 PM, Walter Underwood wun...@wunderwood.orgwrote:

 I work at Chegg.com and I really like it, but we have more search work
 than I can do by myself, so we are hiring a senior software engineer for
 search. The search services include: textbooks (rental and purchase),
 user-generated homework QA, expert-written textbook solutions, search
 within e-books, customer support FAQ, and schools and scholarships for
 Zinch.com. Most of our search services are on Solr.

 http://www.chegg.com/jobs/listings/?jvi=oAQGXfwN,Job

 If you'd like to know a lot more about Chegg's business, you can read the
 S1 that we filed recently in preparation for our IPO or you can follow us
 as CHGG on the New York Stock Exchange.

 wunder
 --
 Walter Underwood
 wun...@wunderwood.org
 Search Guy
 chegg.com