[job-post] Looking for developers and mangers - OpenSource @AWS

2020-04-21 Thread Anirudha Jadhav
Hi Folks,

I have worked with Lucene and Solr since 1.4 and lately been involved in
building on core-lucene work on ML, SQL and other query engines.

Want to work on 100% open source software, While working at Amazon? Let’s
chat!!! Backend SQL engine development ( https://lnkd.in/dGA96vv) Frontend
react ( https://lnkd.in/dtq89dk) data visualization engineers roles for my
team in Seattle. Or submit PRs and get your code running on AWS! Github :
https://lnkd.in/dCdrdAh

-- 
Anirudha P. Jadhav


Re: Is Banana deprecated?

2020-04-21 Thread Cassandra Targett
Banana is a fork of a very old Kibana version (Kibana 3.x) developed by 
Lucidworks. It’s technically out of scope for this list, as the Solr community 
has nothing to do with maintaining it.

(Full disclosure, I work at Lucidworks. However, I’m on a different team and 
have no idea about Banana’s development cycle/roadmap.)

Personally, I think it’s fine for some use cases, but many users have had 
problems with queries bogging down their Solr instances and causing overall 
slowness. This is because behind every panel is a Solr query, so to draw every 
panel a new query is issued. If you have a complex dashboard with even 
semi-complex queries it adds load, possibly a lot of load. It might be fine for 
you, though, depending on what you use it for and how much data you are working 
with.

I will say there are more up-to-date Solr integrations actively maintained by 
the Solr community that may satisfy similar needs:

If you’re looking for something like log analytics, take a look at using 
streaming expressions for this: 
https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/logs.adoc.
 This approach can be adapted for whatever kind of data you have and want to 
visualize.

If you want to track metrics, integrating with something like Prometheus & 
Grafana might be better: 
https://lucene.apache.org/solr/guide/monitoring-solr-with-prometheus-and-grafana.html.

Hope it helps -
Cassandra
On Apr 16, 2020, 6:41 PM -0500, S G , wrote:
> Hello,
>
> I still see releases happening on it:
> https://github.com/lucidworks/banana/pull/355
>
> So it is something recommended to be used for production?
>
> Regards,
> SG


Re: How upgrade to Solr 8 impact performance

2020-04-21 Thread Natarajan, Rajeswari
Any other experience from solr 7 to sol8 upgrade performance  .Please share.

Thanks,
Rajeswari

On 4/15/20, 4:00 PM, "Paras Lehana"  wrote:

In January, we upgraded Solr from version 6 to 8 skipping all versions in
between.

The hardware and Solr configurations were kept the same but we still faced
degradation in response time by 30-50%. We had exceptional Query times
around 25 ms with Solr 6 and now we are hovering around 36 ms.

Since response times under 50 ms are very good even for Auto-Suggest, we
have not tried any changes regarding this. Nevertheless, you can try using
Caffeine Cache. Looking forward to read community inputs as well.



On Thu, 16 Apr 2020 at 01:34, ChienHuaWang  wrote:

> Do anyone have experience to upgrade the application with Solr 7.X to 8.X?
> How's the query performance?
> Found out a little slower response time from application with Solr8 based
> on
> current measurement, still looking into more detail it.
> But wondering is any one have similar experience? is that something we
> should expect for Solr 8.X?
>
> Please kindly share, thanks.
>
> Regards,
> ChienHua
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, *Auto-Suggest*,
IndiaMART InterMESH Ltd,

11th Floor, Tower 2, Assotech Business Cresterra,
Plot No. 22, Sector 135, Noida, Uttar Pradesh, India 201305

Mob.: +91-9560911996
Work: 0120-4056700 | Extn:
*1196*

-- 
*
*

 



Re: "SolrCore Initialization Failures" error message appears briefly in Solr 8.5.1 Admin UI

2020-04-21 Thread Colvin Cowie
>From a (very) brief googling it seems like using the ng-cloak attribute is
the right way to fix this, and it certainly seems to work for me.
https://issues.apache.org/jira/browse/SOLR-14422

On Mon, 20 Apr 2020 at 18:12, Colvin Cowie 
wrote:

> Sorry if this has already been raised, but I didn't see it.
>
> When loading / refreshing the Admin UI in 8.5.1, it briefly but *visibly*
> shows a placeholder for the "SolrCore Initialization Failures" error
> message, with a lot of redness. It looks like there is a real problem.
> Obviously the message then disappears, and it can be ignored.
> However, if I was a first time user, it would not give me confidence that
> everything is okay. In a way, an error message that appears briefly then
> disappears before I can finish reading it is worse than one which just
> stays there.
>
> Here's a screenshot of what I mean
> https://drive.google.com/open?id=1eK4HNprEuEua08_UwtEoDQuRwFgqbGjU
> and a gif:
> https://drive.google.com/open?id=1Rw3z03MzAqFpfZFU4uVv4G158vk66QVx
>
> I assume that this is connected to the UI updates discussed in
> https://issues.apache.org/jira/browse/SOLR-14359
>
> Cheers,
> Colvin
>


Re: solr as a general search engine

2020-04-21 Thread Jan Høydahl
To followup on Charlie’s points.

Looks like your primary source is web or site crawl with Nutch. Once you are in 
the territory of unstructured text mixed with PDF/Word docs, spread across 
multiple sub domains, and perhaps lots of old «garbage» content, then you are 
looking at a very different search problem than a clean and pure structured DB 
search.

You need to deal with everything from HTML cleansing, false last-updated dates 
from various webservers, bad or missing metadata, content using other 
terminology than users of your search, strange old documents popping up to the 
top of your result list for apparently no good reason (other than perhaps some 
IDF boost of a word in title) etc etc. If you go multi lingual you face even 
more challenges.

And from where do you collect «page rank» data, i..e the authority of a page? 
From where do you collect link text? Do you have enough quality link texts to 
even start boosting on them? What should be deemed a landing page and how much 
boost to assign to it versus a page that fits the content very well by hundreds 
of keyword matches….

The first thing to understand is that it will cost you considerable time and 
also skills to build and maintain such a web crawl index.
Next, you must realize you’ll never achieve «Google quality» within a typical 
budget.
But that does not mean it’s never a good idea to build such a search inhouse. 
Over time you will be able to address many of the issues you face, but some of 
the issues may require custom tooling.

I’m not familiar with the commercial systems on the market today. Some of them 
may of course have a toolbox that gets you much quicker to an acceptable level. 
But then when you hit the wall with what the product can do you are likely 
stuck :)

Jan


> 21. apr. 2020 kl. 15:13 skrev Charlie Hull :
> 
> Hi Matt,
> 
> On 21/04/2020 13:41, matthew sporleder wrote:
>> Sorry for the vague question and I appreciate the book recommendations
>> -- I actually think I am mostly confused about suggest vs spellcheck
>> vs morelikethis as they relate to what I referred to as "expected"
>> behavior (like from a typed-in search bar).
> Suggest - here's some results that might match based on what you've typed so 
> far (usually powered by a behind-the-scenes search of the index with some 
> restrictions). Note the difference between this and autocompletion, which 
> suggests complete search terms from the index based on the partial word 
> you've typed so far.
> Spellcheck - The word you typed isn't anywhere in the index, so I've used an 
> edit distance algorithm to suggest a few words you might have meant that are 
> in the index (note this isn't spelling correction as the engine doesn't 
> necessarily have the corrected form in its index)
> Morelikethis - here's some results that share some characteristics with the 
> document you're looking at, e.g. they're indexed by some of the same terms
>> 
>> For reference we have been using solr as search in some form for
>> almost 10 years and it's always been great in finding things based on
>> clear keywords, programmatic-type discovery, a nosql/distrtibuted k:v
>> (actually really really good at this) but has always fallen short
>> (imho and also our fault, obviously) in the "typed in a search query"
>> experience.
> I'm guessing you're bumping into the problem that most people type very 
> little into a search bar, and expect the engine to magically know what they 
> meant. It doesn't of course, so it has to suggest some ways for the user to 
> tell it more specific information - facets for example, or some of the 
> features above.
>> 
>> We are in the midst of re-developing our internal content ranking
>> system and it has me grasping on how to *really* elevate our game in
>> terms of giving an excellent human-driven discovery vs our current
>> behavior of: "here is everything we have that contains those words,
>> minus ones I took out".
> 
> I think you need to look at several angles:
> 
> - What defines a 'good' result in your world/for your content?
> - Who judges this? How do you record this? Human/clicks/both?
> - What Solr features *could* help - and how are you going to test that they 
> actually do using the two lines above?
> 
> We think that building up this measurement-driven, experimental process is 
> absolutely key to improving relevance.
> 
> Cheers
> 
> Charlie
> 
>> 
>> 
>> On Tue, Apr 21, 2020 at 5:35 AM Charlie Hull  wrote:
>>> Hi Matt,
>>> 
>>> Are you looking for a good, general purpose schema and config for Solr?
>>> Well, there's the problem: you need to define what you mean by general
>>> purpose. Every search application will have its own requirements and
>>> they'll be slightly different to every other application. Yes, there
>>> will be some commonalities too. I guess by "as a human might expect one
>>> to behave" you mean "a bit like how Google works" but unfortunately
>>> Google is a poor example: you won't have Google's money or staff or
>>> 

Re: solr as a general search engine

2020-04-21 Thread Charlie Hull

Hi Matt,

On 21/04/2020 13:41, matthew sporleder wrote:

Sorry for the vague question and I appreciate the book recommendations
-- I actually think I am mostly confused about suggest vs spellcheck
vs morelikethis as they relate to what I referred to as "expected"
behavior (like from a typed-in search bar).
Suggest - here's some results that might match based on what you've 
typed so far (usually powered by a behind-the-scenes search of the index 
with some restrictions). Note the difference between this and 
autocompletion, which suggests complete search terms from the index 
based on the partial word you've typed so far.
Spellcheck - The word you typed isn't anywhere in the index, so I've 
used an edit distance algorithm to suggest a few words you might have 
meant that are in the index (note this isn't spelling correction as the 
engine doesn't necessarily have the corrected form in its index)
Morelikethis - here's some results that share some characteristics with 
the document you're looking at, e.g. they're indexed by some of the same 
terms


For reference we have been using solr as search in some form for
almost 10 years and it's always been great in finding things based on
clear keywords, programmatic-type discovery, a nosql/distrtibuted k:v
(actually really really good at this) but has always fallen short
(imho and also our fault, obviously) in the "typed in a search query"
experience.
I'm guessing you're bumping into the problem that most people type very 
little into a search bar, and expect the engine to magically know what 
they meant. It doesn't of course, so it has to suggest some ways for the 
user to tell it more specific information - facets for example, or some 
of the features above.


We are in the midst of re-developing our internal content ranking
system and it has me grasping on how to *really* elevate our game in
terms of giving an excellent human-driven discovery vs our current
behavior of: "here is everything we have that contains those words,
minus ones I took out".


I think you need to look at several angles:

- What defines a 'good' result in your world/for your content?
- Who judges this? How do you record this? Human/clicks/both?
- What Solr features *could* help - and how are you going to test that 
they actually do using the two lines above?


We think that building up this measurement-driven, experimental process 
is absolutely key to improving relevance.


Cheers

Charlie




On Tue, Apr 21, 2020 at 5:35 AM Charlie Hull  wrote:

Hi Matt,

Are you looking for a good, general purpose schema and config for Solr?
Well, there's the problem: you need to define what you mean by general
purpose. Every search application will have its own requirements and
they'll be slightly different to every other application. Yes, there
will be some commonalities too. I guess by "as a human might expect one
to behave" you mean "a bit like how Google works" but unfortunately
Google is a poor example: you won't have Google's money or staff or
platform in your company, nor are you likely to be building a
massive-scale web search engine, so at best you can just take
inspiration from it, not replicate it.

In practice, what a lot of people do is start with an example setup
(perhaps from one of the examples supplied with Solr, e.g.
'techproducts') and adapt it: or they might start with the Solr
configset provided by another framework, e.g. Drupal (yay! Pink
Ponies!). Unfortunately the standard example configsets are littered
with comments that say things like 'Here is how you *could* do XYZ but
please don't actually attempt it this way' and other config sections
that if you un-comment them may just get you into further trouble. It's
grown rather than been built, and to my mind there's a good argument for
starting with an absolutely minimal Solr configset and only adding
things in as you need them and understand them (see
https://lucene.472066.n3.nabble.com/minimal-solrconfig-example-td4322977.html
for some background and a great presentation from Alex Rafalovitch on
the examples).

You're also going to need some background on *why* all these features
should be used, and for that I'd recommend my colleague Doug's book
Relevant Search https://www.manning.com/books/relevant-search - or maybe
our training (quick plug: we're running some online training in a couple
of weeks
https://opensourceconnections.com/blog/2020/05/05/tlre-solr-remote/ )

Hope this helps,

Cheers

Charlie

On 20/04/2020 23:43, matthew sporleder wrote:

Is there a comprehensive/big set of tips for making solr into a
search-engine as a human would expect one to behave?  I poked around
in the nutch github for a minute and found this:
https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml
   but I was wondering if I was missing a very obvious document
somewhere.

I guess I'm looking for things like:
use suggester here, use spelling there, use DocValues around here, DIY

Re: solr as a general search engine

2020-04-21 Thread matthew sporleder
Sorry for the vague question and I appreciate the book recommendations
-- I actually think I am mostly confused about suggest vs spellcheck
vs morelikethis as they relate to what I referred to as "expected"
behavior (like from a typed-in search bar).

For reference we have been using solr as search in some form for
almost 10 years and it's always been great in finding things based on
clear keywords, programmatic-type discovery, a nosql/distrtibuted k:v
(actually really really good at this) but has always fallen short
(imho and also our fault, obviously) in the "typed in a search query"
experience.

We are in the midst of re-developing our internal content ranking
system and it has me grasping on how to *really* elevate our game in
terms of giving an excellent human-driven discovery vs our current
behavior of: "here is everything we have that contains those words,
minus ones I took out".





On Tue, Apr 21, 2020 at 5:35 AM Charlie Hull  wrote:
>
> Hi Matt,
>
> Are you looking for a good, general purpose schema and config for Solr?
> Well, there's the problem: you need to define what you mean by general
> purpose. Every search application will have its own requirements and
> they'll be slightly different to every other application. Yes, there
> will be some commonalities too. I guess by "as a human might expect one
> to behave" you mean "a bit like how Google works" but unfortunately
> Google is a poor example: you won't have Google's money or staff or
> platform in your company, nor are you likely to be building a
> massive-scale web search engine, so at best you can just take
> inspiration from it, not replicate it.
>
> In practice, what a lot of people do is start with an example setup
> (perhaps from one of the examples supplied with Solr, e.g.
> 'techproducts') and adapt it: or they might start with the Solr
> configset provided by another framework, e.g. Drupal (yay! Pink
> Ponies!). Unfortunately the standard example configsets are littered
> with comments that say things like 'Here is how you *could* do XYZ but
> please don't actually attempt it this way' and other config sections
> that if you un-comment them may just get you into further trouble. It's
> grown rather than been built, and to my mind there's a good argument for
> starting with an absolutely minimal Solr configset and only adding
> things in as you need them and understand them (see
> https://lucene.472066.n3.nabble.com/minimal-solrconfig-example-td4322977.html
> for some background and a great presentation from Alex Rafalovitch on
> the examples).
>
> You're also going to need some background on *why* all these features
> should be used, and for that I'd recommend my colleague Doug's book
> Relevant Search https://www.manning.com/books/relevant-search - or maybe
> our training (quick plug: we're running some online training in a couple
> of weeks
> https://opensourceconnections.com/blog/2020/05/05/tlre-solr-remote/ )
>
> Hope this helps,
>
> Cheers
>
> Charlie
>
> On 20/04/2020 23:43, matthew sporleder wrote:
> > Is there a comprehensive/big set of tips for making solr into a
> > search-engine as a human would expect one to behave?  I poked around
> > in the nutch github for a minute and found this:
> > https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml
> >   but I was wondering if I was missing a very obvious document
> > somewhere.
> >
> > I guess I'm looking for things like:
> > use suggester here, use spelling there, use DocValues around here, DIY
> > pagerank, etc
> >
> > Thanks,
> > Matt
>
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>


Payloads

2020-04-21 Thread Vincenzo D'Amore
Hi All,


still struggling with payloads. Trying to understand better my problem I've
created a minimal reproducible example.

Basically I have a multivalued field with payloads with this schema
configuration:



  



  

  





  



  



  



That are populated with data like this:


  

1

A:1 B:2 C:3 D:4

A:0.1 B:0.2 E:5 F:6

E:0.5 F:0.6

  



I want to be able to query on the multipayload field with a free number of
token in any possible sequence and having as a result the SUM of the
payloads values of those tokens only for the rows of the multipayload field
that satisfy the condition of having all the tokens of the query as clause
(basically the same of saying AND condition on the row). For example:



   1. I run the query having B F A as clauses, I expect to obtain a match
   on the second row for doc with id=1, and so a score of 0.2 + 0.1 + 6 = 6.3
   2. I run the query having F E as clauses, I expect to obtain a match on
   the second and the third row for doc with id=1 and thus a score of (6 + 5)
   + (0.6 + 0.5) = 12.1
   3. I run the query having A F as clauses, I expect to have no match and
   thus a score of 0.0



I tried to use a query like this:



http://localhost:8983/solr/test/select?debugQuery=true={!payload_score
f=multipayload v=$pl func=sum includeSpanScore=false
operator=phrase}=__MY_CLAUSES__



The results I obtain are:



   1. B F A: No results
   2. F E:  6.5 (resulting from match of row#2: 6 and row#3: 0.5) – as
   result of the span query I presume
   3. E F:  12.1 (as expected, but only because “by chance” the sequence
   matches as a phrase on rows #2 and #3)
   4. A F: No results (as expected)



Looking into Solr payloads code (

https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/util/PayloadUtils.java#L139

), I see that:



   - There are only two options: OR and phrase, while I think that my case
   should need to have an AND operator
   - The phrase option has an hardwired distance of 0 for the span query:
   query = new SpanNearQuery(terms.toArray(new SpanTermQuery[terms.size()]),
   0, true);



I think that a phrase query with a huge distance (i.e. 100) could behave as
an AND query, but I’m just guessing. But anyway to suit my case I think
that in general I’d need an AND option or the possibility to define the
span behaviour in a more flexible way for the phrase query).



Even if my case is quite specific, I think that the current implementation
of the phrase option is not really well suited also for a more general case
of having weights associated to Part-of-speech classes, that is in my
opinion a more classic usage of payloads, where for example I want to
deboost adjectives against nouns, as for example:



   - a *race horse* is a *horse* that runs in races
   - a *horse race* is a *race* for horses



In general it seems to me that the absence of an AND option and the
hardwired phrase span to 0 is quite limiting.


Thanks in advance for your time,

Vincenzo

-- 
Vincenzo D'Amore


Re: solr as a general search engine

2020-04-21 Thread Charlie Hull

Hi Matt,

Are you looking for a good, general purpose schema and config for Solr? 
Well, there's the problem: you need to define what you mean by general 
purpose. Every search application will have its own requirements and 
they'll be slightly different to every other application. Yes, there 
will be some commonalities too. I guess by "as a human might expect one 
to behave" you mean "a bit like how Google works" but unfortunately 
Google is a poor example: you won't have Google's money or staff or 
platform in your company, nor are you likely to be building a 
massive-scale web search engine, so at best you can just take 
inspiration from it, not replicate it.


In practice, what a lot of people do is start with an example setup 
(perhaps from one of the examples supplied with Solr, e.g. 
'techproducts') and adapt it: or they might start with the Solr 
configset provided by another framework, e.g. Drupal (yay! Pink 
Ponies!). Unfortunately the standard example configsets are littered 
with comments that say things like 'Here is how you *could* do XYZ but 
please don't actually attempt it this way' and other config sections 
that if you un-comment them may just get you into further trouble. It's 
grown rather than been built, and to my mind there's a good argument for 
starting with an absolutely minimal Solr configset and only adding 
things in as you need them and understand them (see 
https://lucene.472066.n3.nabble.com/minimal-solrconfig-example-td4322977.html 
for some background and a great presentation from Alex Rafalovitch on 
the examples).


You're also going to need some background on *why* all these features 
should be used, and for that I'd recommend my colleague Doug's book 
Relevant Search https://www.manning.com/books/relevant-search - or maybe 
our training (quick plug: we're running some online training in a couple 
of weeks 
https://opensourceconnections.com/blog/2020/05/05/tlre-solr-remote/ )


Hope this helps,

Cheers

Charlie

On 20/04/2020 23:43, matthew sporleder wrote:

Is there a comprehensive/big set of tips for making solr into a
search-engine as a human would expect one to behave?  I poked around
in the nutch github for a minute and found this:
https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml
  but I was wondering if I was missing a very obvious document
somewhere.

I guess I'm looking for things like:
use suggester here, use spelling there, use DocValues around here, DIY
pagerank, etc

Thanks,
Matt



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com