Re: Questions about Solr Search

2020-07-02 Thread Doug Turnbull
I think it's better to think of Solr as a piece of infrastructure or
component for you to build these things, rather than a product that has a
lot of capabilities for some specific use case.

So you can find 'lego pieces' to build some of these things, but with Solr
you need to build these things yourself. You're trading off targeted
feature you'll find in a search product vs depth of configurability and
pluggability in open source search. With Solr you should expect a big
engineering investment and getting to know the internals to use it most
effectively.

On topics 2 & 3, you might be interested in AI Powered Search which has a
strong NLP component http://aipoweredsearch.com

-Doug

On Thu, Jul 2, 2020 at 10:26 AM Gautam K  wrote:

> Dear Team,
>
> Hope you all are doing well.
>
> Can you please help with the following question? We are using Solr search
> in our Organisation and now checking whether Solr provides search
> capabilities like Google Enterprise search(Google Knowledge Graph Search).
>
> 1, Does Solr Search provide Voice Search like Google?
> 2. Does Solar Search provide NLP Search(Natural Language Processing)?
> 3. Does Solr have all the capabilities which Google Knowledge Graph
> provides like below?
>
>
>- Getting a ranked list of the most notable entities that match
>certain criteria.
>- Predictively completing entities in a search box.
>- Annotating/organizing content using the Knowledge Graph entities.
>
>
> *Your help will be appreciated highly.*
>
> Many thanks
> Gautam Kanaujia
> India
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
Powered Search <http://aipoweredsearch.com>*
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Welcome Mayya Sharipova as Lucene/Solr committer

2020-06-08 Thread Doug Turnbull
What a great person to have as a committer - congrats Mayya!

On Mon, Jun 8, 2020 at 1:30 PM Eric Pugh 
wrote:

> Congratulations!  Welcome!
>
> On Jun 8, 2020, at 1:26 PM, Steve Rowe  wrote:
>
> Congrats and welcome, Mayya!
>
> --
> Steve
>
> On Jun 8, 2020, at 12:58 PM, jim ferenczi  wrote:
>
> Hi all,
>
> Please join me in welcoming Mayya Sharipova as the latest Lucene/Solr
> committer.
> Mayya, it's tradition for you to introduce yourself with a brief bio.
>
> Congratulations and Welcome!
>
> Jim
>
>
>
> ___
> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
> | http://www.opensourceconnections.com | My Free/Busy
> <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>
>

-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
Powered Search <http://aipoweredsearch.com>*
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: [DISCUSS] Lucene-Solr split (Solr promoted to TLP)

2020-05-14 Thread Doug Turnbull
enance as a separate component. The learning curve for people
>> > coming to each project separately is going to be gentler than trying
>> > to dive into the combined codebase.
>> >
>> > 5) Mailing lists, build servers. Mailing lists for users are already
>> > separated. I think this is yet another indication that Solr is
>> > something more than a component within Lucene. It is perceived as an
>> > independent entity and used as an independent product. I would really
>> > like to have separate mailing lists for these two projects (this
>> > includes build and test results) as it would make life easier: if your
>> > focus is more on Lucene (or Solr), you would only need to track half
>> > of the current traffic.
>> >
>> >
>> > As I already mentioned, the discussion among PMC members highlighted
>> > some initial concerns and reasons why the project should perhaps
>> > remain glued together. These are outlined below with some of the
>> > counter-arguments presented under each concern to avoid repetition of
>> > the same content from the PMC mailing list (they’re copied from the
>> > private discussion list).
>> >
>> > 1) Both projects may gradually split their ways after the separation
>> > and even develop “against” each other like it used to be before the
>> > merge.
>> >
>> > Whether this is a legitimate concern is hard to tell. If Solr goes TLP
>> > then all existing Lucene committers will automatically become Solr
>> > committers (unless they opt not to) so there will be both procedural
>> > ways to prevent this from happening (vetoes) as well as common-sense
>> > reasons to just cooperate.
>> >
>> > 2) Some people like parallel version numbering (concurrent Solr and
>> > Lucene releases) as it gives instant clarity which Solr version uses
>> > which version of Lucene.
>> >
>> > This can still be done on Solr side (it is Solr’s decision to adapt
>> > any versioning scheme the project feels comfortable with). I
>> > personally (DW) think this kind of versioning is actually more
>> > confusing than helpful; Solr should have its own cadence of releases
>> > driven by features, not sub-component changes. If the “backwards
>> > compatibility” is a factor then a solution might be to sync on major
>> > version releases only (e.g., this is how Elasticsearch is handling
>> > this).
>> >
>> > 3) Solr tests are the first “battlefield” test zone for Lucene changes
>> > - if it becomes TLP this part will be gone.
>> >
>> > Yes, true. But realistically Solr will have to adopt some kind of
>> > snapshot-based dependency on Lucene anyway (whether as a git submodule
>> > or a maven snapshot dependency). So if there are bugs in Lucene they
>> > will still be detected by Solr tests (and fairly early).
>> >
>> > 4) Why split now if we merged in the first place?
>> >
>> > Some of you may wonder why split the project that was initially
>> > *merged* from two independent codebases (around 10 years ago). In
>> > short, there was a lot of code duplication and interaction between
>> > Solr and Lucene back then, with patches flying back and forth.
>> > Integration into a single codebase seemed like a great idea to clean
>> > things up and make things easier. In many ways this is exactly what
>> > did happen: we have cleaned up code dependencies and reusable
>> > components (on Lucene side) consumed by not just Solr but also other
>> > projects (downstream from Lucene).
>> >
>> > The situation we find ourselves now is different to what it was
>> > before: recent and ongoing development for the most part falls within
>> > Solr or Lucene exclusively.
>> >
>> >
>> > This e-mail is for discussing the idea and presenting arguments/
>> > counter-arguments for or against the split. It will be followed by a
>> > separate VOTE thread e-mail next Monday. If the vote passes then there
>> > are many questions about how this process should be arranged and
>> > orchestrated. There are past examples even within Lucene [1] that we
>> > can learn from, and there are people who know how to do it - the
>> > actual process is of lesser concern at the moment, what we mostly want
>> > to do is to reach out to you, signal the idea and ask about your
>> > opinion. Let us know what you think.
>> >
>> > [1]
>> https://lists.apache.org/thread.html/15bf2dc6d6ccd25459f8a43f0122751eedd3834caa31705f790844d7%401270142638%40%3Cuser.nutch.apache.org%3E
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> Anshum Gupta
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
Powered Search <http://aipoweredsearch.com>*
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: [DISCUSS] Lucene-Solr split (Solr promoted to TLP)

2020-05-13 Thread Doug Turnbull
Jason, I hear your arguments and think of them FOR a split

This might sound a bit harsh, but maybe Lucene devs helping with Solr has
let Solr off the hook a bit too much? I actually like the fact that the
split causes Solr to figure out it's own situation and focus on
its problems.

Regardless of the split or not, Solr is going to sink or swim based on the
efforts of Solr committers, not Lucene committers. I don't think Lucene
committers are going to be the ones to really address the systemic issues
with Solr. If anything, I imagine they are "let me fix this so the code
compiles" level of maintenance.

"Falling behind Lucene" is counterbalanced to me with "Should Solr be on
cutting-edge Lucene?"

I'd be OK with a stable, robust Solr that got 1-2 major versions behind
Lucene, but was rock-solid with a lower barrier to entry...

On Wed, May 13, 2020 at 10:07 AM Jason Gerlowski 
wrote:

> Wanted to add my two cents to the mix, though I'm a little late as the
> vote has already progressed pretty far.
>
> I'm against a split.  From the points raised, I agree that Lucene has
> much to gain.  But Solr has a lot to lose.
>
> Lucene devs would be freed from keeping Solr usage up to date.  That's
> a great improvement for Lucene itself.  But that burden doesn't
> disappear - it's just being moved to a different (smaller) group of
> committers - who by definition don't know Lucene as well, and are less
> suited to the task.  (Lucene devs still might help post-split, but
> given that avoiding this burden is one of the arguments made above for
> a split, it seems unwise to assume how much this generosity will
> continue.)
>
> One likely result is that Solr will fall behind Lucene. Possibly
> permanently behind.  Lucene folks are doing great work to improve
> perf, add features etc. so falling behind is a Very Bad Thing.  To
> Solr, Lucene is not the same as Jetty or Jackson which Solr can fall
> behind on without significant detriment.  Lucene and the core search
> functionality it offers is what brings people to Solr (or Elastic).
> Putting ourselves in a position to fall behind on Lucene does a huge
> disservice to our users, and loses Solr one of its greatest
> advantages.
>
> I hope that in the case of a split, the Solr community would rise to
> the occasion and prevent this.  But my personal judgement is that it's
> unlikely.  I hate to be negative, and I hope to be proven wrong, but
> that's how things look to me.  We (Solr folks) have a bad track record
> of addressing things with less-tangible, less-sellable benefits.  Take
> our ongoing test flakiness woes and SolrCloud instability issues as
> examples: both are serious threats to the project, both have been
> around for years, and both are here to stay for the foreseeable
> future.
>
> If conditions were different in a way that made "falling behind" less
> likely, I'd be all for a split.  But given (1) our recent track record
> of addressing these sort of issues, (2) our test flakiness which will
> make identifying "Lucene snapshot upgrade" bugs exceedingly difficult,
> and (3) the current economic conditions which may make it harder for
> committers to negotiate time from their employers to work on Lucene
> updates...now seems like a bad time to attempt a split.  It will harm
> Solr more than it helps Lucene.
>
> On Tue, May 12, 2020 at 3:37 PM Namgyu Kim  wrote:
> >
> > It's hard to make a decision because it seems to have pros and cons.
> > Basically, I agree to separate but there are some questions.
> > So I don't not vote right now.
> >
> > 1) Release version
> > Currently, versions of Lucene and Solr are aligned, how will they be
> managed in the future?
> > Other people took Elasticsearch as an example... But it was an
> independent project from the beginning.
> > So there is no problem with the Lucene version. (Elasticsearch 7.7 and
> Lucene 8.5.1)
> > I'm sure if we make solr as an independent project, it will make cracks
> about the version structure. (like Lucene 8.6.2 and Solr 8.9.1)
> > But it's also strange to suddenly start a new version of the Solr. (Solr
> 1.0)
> > Of course it's a matter of adaption, but it's likely to cause some
> confusion for existing users.
> >
> > 2) Complementary relationship
> > When Lucene and Solr are built together, Solr can always maintain the
> latest Lucene.
> > In my personal opinion, it's a great advantage of Solr.
> > Because Solr doesn't have to suffer from Lucene API changes and has
> latest library.
> > But it will be difficult if Solr becomes independent.
> > If Solr tracks the master branch of Lucene on separate
> repository(project), can it always check and reflect Lu

Re: [DISCUSS] Lucene-Solr split (Solr promoted to TLP)

2020-05-12 Thread Doug Turnbull
)
> relationship with the Lucene community as an involved and vested consumer.
> >>
> >> Erik
> >>
> >
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
Powered Search <http://aipoweredsearch.com>*
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: [DISCUSS] Lucene-Solr split (Solr promoted to TLP)

2020-05-05 Thread Doug Turnbull
Personally I feel the burden of proof should not be why they should be
split up, but the other way - "what arguments can be made for keeping them
together?"

I would be curious if people can make the argument for keeping them
together...

-Doug

On Tue, May 5, 2020 at 10:29 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Mon, May 4, 2020 at 5:28 PM Gézapeti Cseh  wrote:
>
> I think separating the git repository and even the release schedules could
>> be done under the same TLP.
>>
> It would solve most of the technical issues reflected in the first mail
>> and there would be more time and data to
>>
>
> Hmm that is technically true, and in fact that is the way it was before 10
> years ago: Solr was a sub-project of Apache Lucene.
>
> But that is not the proposal here.
>
> Lucene and Solr have become such major efforts, in developers and users
> eyes and keyboard effort/time, that they really are very different entities
> now.  TLP makes sense to me for each project.
>
>>
>
>> see if creating Apache Solr again is something the PMC would want to do
>>
>
> Hmm, just to clarify, this is not an "again" sort of situation: Solr was
> not a top-level project before.  It was and still is a sub-project of
> Apache Lucene.
>
> And the proposal is to now split it out as its own (new) top-level
> project, Apache Solr.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
Powered Search <http://aipoweredsearch.com>*
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Welcome Eric Pugh as a Lucene/Solr committer

2020-04-07 Thread Doug Turnbull
Eric, great work! Congrats!

Yes we need to see a pic of that quilt... ;)

On Tue, Apr 7, 2020 at 4:40 PM Mikhail Khludnev  wrote:

> Welcome, Eric.
>
> On Tue, Apr 7, 2020 at 4:57 PM Eric Pugh 
> wrote:
>
>> Thank you everyone!  I’ll keep it short, otherwise this will be a very
>> long email… ;-).
>>
>> I was first introduced to Solr and Lucene by Erik Hatcher, and today I
>> wonder what my life would be like if he hadn’t taken the time to show me
>> some cool code he was working on and explained to me the way to change the
>> world was through open source contributions!
>>
>> I co-founded OpenSource Connections (http://o19s.com) along with Scott
>> Stults and Jason Hull in 2005.  We found our niche in Solr consulting after
>> I went to the first LuceneRevolution and got inspired (complete with Jerry
>> Maguire style manifesto shared with the company). Through consulting, I get
>> to help onboard organizations into the Solr community - a thriving, healthy
>> ASF is very near & dear to my heart.
>>
>> I’ve been around this community for a long time, with my first JIRA being
>> three digits: SOLR-284.  Today, I’m still contributing to Apache Tika. I’ve
>> gotten to meet and spend some significant time with Tim Allison from that
>> project and learned a LOT about text!
>>
>> I was in the right place at the right time and was able to join David
>> Smiley as co-author on the first Solr book, we went on and did a total of
>> three editions of that book.  Phew!
>>
>> Once I got to sit on stage as a judge for Stump the Chump, it was Erick,
>> Erik, and Eric ;-)
>>
>> After doing Solr for a good while, I got lucky and met Doug Turnbull on
>> the sidewalk one day because he had on a t-shirt that said “My code doesn’t
>> have bugs, it has unexpected features”.   Couple of years later he and
>> fellow colleague John Berryman published Relevant Search and today I’m
>> working in the fascinating intersection of people, Search, and Data Science
>> helping build smarter search experiences as a Relevance Strategist. I'm
>> excited about bringing relevance use cases 'down to earth'. I also steward
>> OSC's contributions to the open source tool Quepid to help fulfill that
>> goal.
>>
>> Oh, and I’ve got a stack of LuceneRevolution and related conference
>> t-shirts that my mother turned into a fantastic quilt ;-).
>>
>> Eric
>>
>>
>>
>> On Apr 6, 2020, at 9:39 PM, Shalin Shekhar Mangar 
>> wrote:
>>
>> Congratulations and welcome Eric!
>>
>> On Mon, Apr 6, 2020 at 5:51 PM Jan Høydahl  wrote:
>>
>>> Hi all,
>>>
>>> Please join me in welcoming Eric Pugh as the latest Lucene/Solr
>>> committer!
>>>
>>> Eric has been part of the Solr community for over a decade, as a code
>>> contributor, book author, company founder, blogger and mailing list
>>> contributor! We look forward to his future contributions!
>>>
>>> Congratulations and welcome! It is a tradition to introduce yourself
>>> with a brief bio, Eric.
>>>
>>> Jan Høydahl
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>>
>> ___
>> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>> | http://www.opensourceconnections.com | My Free/Busy
>> <http://tinyurl.com/eric-cal>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless
>> of whether attachments are marked as such.
>>
>>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>; Contributor: *AI
Powered Search <http://aipoweredsearch.com>*
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Change solr/lucene Readme file format

2020-01-22 Thread Doug Turnbull
I think this got lost in the holidays. I wanted to bump this contribution,
as I feel markdown is pretty standard format for readmes that devs are
expecting these days. (And the files were close to markdown anyway.)

Or if the project doesn't want this contribution, I feel we should at least
let Pinkesh (with his 1st contribution) that this isn't something the
project wants, and close the PR

Best!
-Doug

On Thu, Nov 14, 2019 at 12:51 AM Man with No Name 
wrote:

> Hey guys,
> I have created a PR <https://github.com/apache/lucene-solr/pull/908> on
> this, please have a look to see if that's helpful.
>
> Thanks:
> Pinkesh Sharma
>
> On Sun, Nov 10, 2019 at 11:29 AM Uwe Schindler  wrote:
>
>> Hi,
>>
>> When building the documentation (ant documentation), all readme files
>> included in the documentation are parsed as markdown (see flexmark task in
>> ant) and converted to html. This works well, although not everything is
>> markdown. If you have a plain readme file it would still parse as valid
>> markdown and HTML output looks fine, so Erik's problem with markdown isn't
>> one.
>>
>> Uwe
>>
>> Am November 10, 2019 4:00:21 PM UTC schrieb Marcus Eagan <
>> marcusea...@gmail.com>:
>>>
>>> Most README files in contemporary open source projects are Markdown
>>> because of the formatting features. I personally favor convention over ease
>>> of use in this case.
>>>
>>> Marcus Eagan
>>>
>>> On Sun, Nov 10, 2019, 8:58 AM Erick Erickson 
>>> wrote:
>>>
>>>> Personally I’d make them text files. The last thing I want to do is
>>>> make reading/updating these have a barrier to entry. We should save
>>>> formatting for the ref guide and/or Wiki.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> > On Nov 10, 2019, at 1:01 AM, Man with No Name <
>>>> pinkeshsharm...@gmail.com> wrote:
>>>> >
>>>> > Hey folks,
>>>> > I have been looking into the solr/lucene source code, and the first
>>>> thing caught my eye was the different Readme files. All the files had
>>>> different file and text format. What do you guys think about making all the
>>>> readmes to markdown file rather than text files, and a standard template?
>>>> >
>>>> >
>>>> > --
>>>> > Regards:
>>>> > Pinkesh Sharma
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>>
>> --
>> Uwe Schindler
>> Achterdiek 19, 28357 Bremen
>> https://www.thetaphi.de
>>
>
>
> --
> Regards:
> Pinkesh Sharma
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Commit / Code Review Policy

2019-12-03 Thread Doug Turnbull
n the Yetus or ZK model with a 72 hour timeout
> is a reasonable compromise, especially because a hard shift from CTR to RTC
> would need a corresponding culture shift that may not happen immediately.
>
> Mike
>
> On Mon, Dec 2, 2019 at 11:19 PM David Smiley 
> wrote:
>
>> https://cwiki.apache.org/confluence/display/LUCENE/Commit+Policy+-+DRAFT
>>
>> Updated:
>> * Suggested new title
>> * Emphasizing "Guidelines" instead of policy
>> * Defines lazy-consensus
>> * Added [PENDING DISCUSSION] to other topics for now
>>
>> Question:
>> * Are we agreeable to my definition of "minor"?
>> * Do we agree we don't even need a JIRA issue for "minor" things?
>> * Do we agree we don't even need a CHANGES.txt entry for "minor" things?
>> Of course it's ultimately up to the committer's discretion but I ask as a
>> general guideline.  If we can imagine some counter examples then they might
>> be good candidates to add to the doc.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Mon, Dec 2, 2019 at 10:15 PM Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> wrote:
>>
>>> > Why should I ask for your review? It's not even your code thats
>>> running anymore, its the hackers code :)
>>>
>>> Haha! +1 on moving ahead with RCEs and other security issues without
>>> needing to wait for reviews. Waiting for reviews (esp. if no one has enough
>>> bandwidth for quick reviews) for such crucial issues can risk dragging
>>> those issues on and on needlessly. Reviews can happen after commit too, if
>>> people have the time.
>>>
>>> On Tue, 3 Dec, 2019, 6:51 AM Robert Muir,  wrote:
>>>
>>>>
>>>>
>>>> On Mon, Dec 2, 2019 at 3:33 PM David Smiley 
>>>> wrote:
>>>>
>>>>>
>>>>> Rob wrote:
>>>>>
>>>>>> Why should I wait weeks/months for some explicit review
>>>>>>
>>>>> Ask for a review, which as this document says is really just a LGTM
>>>>> threshold of approval, not even a real code review.  Given your reputation
>>>>> of writing quality code, this isn't going to be an issue for you.  If it's
>>>>> taking multiple weeks for anyone then we have a problem to fix -- and at
>>>>> present we do in Solr.  Explicitly encouraging mere approvals (as the
>>>>> document says) should help a little.  Establishing that we want this
>>>>> standard of conduct as this document says (even if not mandatory) will 
>>>>> also
>>>>> help -- "you scratch my back, I scratch yours".  But I think we should do
>>>>> even more...
>>>>>
>>>>>
>>>>  Why should I ask for your review? It's not even your code thats
>>>> running anymore, its the hackers code :)
>>>>
>>>>
>>>>

-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Renaming SolrCloud

2019-10-14 Thread Doug Turnbull
I agree very much on normalizing to one mode of running Solr

So long as the 'cluster mode' hello world is easier than having to think a
lot about zookeeper and other hard things. One reason people use standalone
mode because it's as simple as "Point '/bin/solr' at config directory and
go". If there's just cluster mode, it all should be dead simple to help
newbies play around with Solr without thinking that hard

-Douc

On Mon, Oct 14, 2019 at 12:36 PM Houston Putman 
wrote:

> Jan,
>
> I agree strongly with your last point. And in case you haven't seen it
> before, there is a solr k8s operator, with a growing community, under
> development at https://github.com/bloomberg/solr-operator.
>
> I agree that taking control of the solr docker images could be a good
> idea. That way, it could have larger involvement from the community and
> grow more organically with changes in Solr itself.
>
> - Houston
>
> On Tue, Oct 8, 2019 at 8:25 PM Noble Paul  wrote:
>
>> Why even "cluster mode" or "cloud mode"?
>>
>> Solr, by default , should use the cluster mode. So in all our
>> documentation, we should use just "Solr" and it should refer to a
>> "cluster mode of Solr"
>>
>> Wherever we don't have a "cluster mode" should be explicitly qualified
>> as "standalone Solr"
>>
>> On Wed, Oct 2, 2019 at 1:24 PM David Smiley 
>> wrote:
>> >
>> > I hear you and sympathize but "SolrCloud" has been used long enough
>> that I doubt the trouble is worth it.  I guess that makes me "+0".  That
>> said, I think it wouldn't hurt to formalize "standalone mode" as-such and
>> perhaps say more explicitly that SolrCloud == "cluster mode" even if we
>> don't eliminate SolrCloud terminology.
>> >
>> > And as SolrCloud ... errr... "cluster mode" I mean, gains in usage
>> relative to "standalone mode", perhaps we can reference SolrCloud less
>> often and sorta assume that and instead make exceptions in documentation to
>> standalone mode specifics where we call that out as such.  It's a loose
>> idea; I'm don't have an example in mind.
>> >
>> > Similar to the above notion, maybe "CloudSolrClient" could be more
>> invisible without renaming it.  Imagine SolrClient.createFromZooKeeper()
>> etc. static methods that instantiate CloudSolrClient by default.  Just a
>> thought.
>> >
>> > ~ David Smiley
>> > Apache Lucene/Solr Search Developer
>> > http://www.linkedin.com/in/davidwsmiley
>> >
>> >
>> > On Mon, Sep 30, 2019 at 11:19 AM Shawn Heisey 
>> wrote:
>> >>
>> >> On 9/30/2019 6:59 AM, Ishan Chattopadhyaya wrote:
>> >> > I propose that we rename SolrCloud mode to "cluster mode" such that
>> >> > there shall be "Apache Solr", running in either "standalone mode" or
>> >> > "cluster mode". We can effect this renaming 9.0 onwards, if we have
>> >> > consensus.
>> >> >
>> >> > I am open to any other proposal as well, so long as we drop the
>> "cloud"
>> >> > in the name.
>> >>
>> >> I see your point, but I think that "cloud" is so entrenched in the
>> >> overall consciousness of the software that changing it will not be
>> easy.
>> >>
>> >> Maybe it might be something we could accomplish slowly, over the rest
>> of
>> >> 8.0's lifetime and the entire 9.0 lifetime.  Begin changing the
>> >> terminology we use in communication, start shifting documentation and
>> >> code, with a hard cutover in a later major version, perhaps 10.0 or
>> 11.0.
>> >>
>> >> The level of effort involved would be considerable, whether it happens
>> >> quickly or slowly.  It might be the kind of thing we just don't want to
>> >> try and do.
>> >>
>> >> I'm not opposed to the idea, and I might even be able to help, but it's
>> >> going to need a lot of buy-in from those of us who work on Solr.
>> >>
>> >> Thanks,
>> >> Shawn
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>>
>> --
>> -
>> Noble Paul
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Separate dev mailing list for automated mails?

2019-08-07 Thread Doug Turnbull
+1 - Just two days ago I created a filter to send [JENKINS] emails
elsewhere... I don't want to completely unsubscribe from Lucene development
emails, but the traffic here is a bit overwhelming and it's hard to see the
signal in the noise sometimes (high recall, low precision you might say!)

On Wed, Aug 7, 2019 at 5:27 PM Noble Paul  wrote:

> +1
>
> The mail list is sending so many mails that it has become difficult to
> catch up
>
> On Thu, Aug 8, 2019 at 12:26 AM Michael Sokolov 
> wrote:
> >
> > big +1 -- I'm also curious why the subject lines of many automated
> > emails (from Jira?) start with [CREATED] even though they are
> > generated by comments or other kinds of updates (not creating a new
> > issue). Overall, I think we have way too much comment spam. In
> > particular Github comments are so poorly formatted in email (at least
> > in gmail?) as to be almost unreadable - I think because they always
> > include the complete comment history. I wonder if there is a way to
> > neaten them up (especially the subject lines, so you can scan
> > quickly)?
> >
> > On Tue, Aug 6, 2019 at 7:17 PM Jan Høydahl 
> wrote:
> > >
> > > Hi
> > >
> > > The mail volume on dev@ is fairly high, betwen 2500-3500/month.
> > > To break down the numbers last month, see
> https://lists.apache.org/trends.html?dev@lucene.apache.org:lte=1M:
> > >
> > > Top 10 participants:
> > > -GitBox: 420 emails
> > > -ASF subversion and git services (JIRA): 351 emails
> > > -Apache Jenkins Server: 261 emails
> > > -Policeman Jenkins Server: 234 emails
> > > -Munendra S N (JIRA): 134 emails
> > > -Joel Bernstein (JIRA): 84 emails
> > > -Tomoko Uchida (JIRA): 77 emails
> > > -Jan Høydahl (JIRA): 52 emails
> > > -Andrzej Bialecki (JIRA): 47 emails
> > > -Adrien Grand (JIRA): 46 emails
> > >
> > > I have especially noticed how every single GitHub PR review comment
> triggers its own email instead of one email per review session.
> > > Also, every commit/push triggers an email since a bot adds a comment
> to JIRA for it.
> > >
> > > Personally I think the ratio of notifications vs human emails is a bit
> too high. I fear external devs who just want to follow the project may get
> overwhelmed and unsubscribe.
> > > One suggestion is therefore to add a new list where detailed JIRA
> comments and Github comments / reviews go. All committers should of course
> subscribe!
> > > I saw the Zookeeper project have a notifications@ list for GitHub
> comments and issues@ for JIRA comments (Except the first [Created] email
> for a JIRA will also go to dev@)
> > > The Maven project follows the same scheme and they also send Jenkins
> mails to the notifications@ list. The Cassandra project seems to divert
> all jira comments to the commits@ list.
> > > The HBase project has keeps only [Created]/[Resolved] mails on dev@
> and all other from Jira/GH on issues@ list and Jenkins mails on a
> separate builds@ list.
> > >
> > > Is it time we did something similar? I propose a single new
> notifications@ list for everything JIRA, GitHub and Jenkins but keep
> [Created|Resolved] mails on dev@
> > >
> > > --
> > > Jan Høydahl, search solution architect
> > > Cominvent AS - www.cominvent.com
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> >
> > -----
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> --
> -
> Noble Paul
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


[jira] [Commented] (LUCENE-8841) Explore Relevance Based Performance Benchmarks

2019-06-08 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859165#comment-16859165
 ] 

Doug Turnbull commented on LUCENE-8841:
---

Big +1, though I suspect it would be very hard! This could be an Apache project 
in and of itself...

One challenge is that the number of use cases Lucene is used is tremendously 
diverse, from job search, to e-commerce, to legal search, to enterprise search, 
to news search, to Web search, to everything in between and outside the box. 
You wouldn't want a situation, for example, where you only have an e-commerce 
test set, so you end up creating a situation where Enterprise search users are 
harmed because of decisions made optimizing an e-commerce set. 

Another challenge is getting reliable relevance judgments. Teams go deep into 
developing their methodology for creating a golden set of judgments. This of 
course can be very domain specific and challenging problem. There's not a 
one-size-fits-all obvious approach. Some teams use human judges, others crowd 
source, others very analytics based. Some have access to conversion data, 
others don't. You have all sorts of biases to contend with in every situation. 
And the judgments evolve over time. (today's most relevant iPhone isn't the 
same as 2 years ago). So getting it right takes a lot of energy and time from 
mature search orgs. So what judgments/data you choose isn't clear if you want 
to cover a broad range of use cases.

I think the best case is to partner with some organizations that are willing to 
open up this data alongside their corpus. Where we could validate and feel good 
about the methodology they use in generating judgments. You'd need to update 
the relevance judgments and corpus over time. There's of course TREC and other 
academic datasets, that's one data point. Some folks I know at Wikipedia have 
talked about this. But you'd want some more commercial datasets (corpus + 
judgments).

But partnering with orgs would also have limits, as this stuff has very 
high-value to companies... But perhaps they'd be incentivized to open up their 
data if Lucene was going to make decisions with it that helped them?!?

 

> Explore Relevance Based Performance Benchmarks
> --
>
> Key: LUCENE-8841
> URL: https://issues.apache.org/jira/browse/LUCENE-8841
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While discussing improvements in relevance of fuzzy queries with [~jimczi], 
> the topic of how to measure impact of changes to relevance of common queries 
> came up. While a non trivial effort, having such a benchmark will allow us to 
> measure the impact of potential changes and also catch regressions well in 
> time.
>  
> This Jira tracks ideas and efforts in that direction



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Vector based store and ANN

2019-03-02 Thread Doug Turnbull
mjz1T%2B3YMH90pO%2FXSi15Eszzmg%3D=0>
>.
>- Advanced document retrieval: Using a numerical vector representation
>of a document, we could improve the search result
>- Nearest neighbor queries: discovering the nearest neighbors to a
>given query could also benefit from these ANN algorithms (although doesn’t
>necessarily need the vector based index)
>
>
> I would be grateful to hear your thoughts and whether the community is
> open to a conversation on this topic with my team.
>
> Thanks,
>
> Pedram
>
> *From:* J. Delgado 
> *Sent:* Thursday, February 28, 2019 7:38 AM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) 
> *Subject:* Re: Vector based store and ANN
>
> Lucene’s scoring function (which I believe is okapi BM25
> https://en.m.wikipedia.org/wiki/Okapi_BM25
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009=UsNUOOH88fog95sKTM%2FkgjYak5%2Bp%2F%2BWaMZYsMAgQ5MA%3D=0>)
> is a kind of nearest neighbor using the TF-IDF vector representation of
> documents and query. Are you interested in ANN to be applied to a different
> kind of vector representation, say for example Doc2Vec?
>
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand  wrote:
>
> Hi Pedram,
>
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908784014=XrrdrkhWOHp8%2FYLGowJK5%2B3km0f04Nr6BxPFxbiRQdM%3D=0>
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
>  wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023=rmLY5WMZtQCZ99yumefC%2BQoglS4JeONfLShsj5qaWkU%3D=0>
>  | Book: http://www.solrenterprisesearchserver.com
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023=DZslOJYShNLZ9GOSpstuq85F%2FwVrFtnZIVDiXe%2F%2B0fw%3D=0>
>
>
>

-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Vector based store and ANN

2019-03-02 Thread Doug Turnbull
or a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382=GvDfvwmayyPuk%2FyzdRwV6iz4dvEZNyZ%2FFjl%2BjKYKCAM%3D=0>
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
>  wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> -----
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
>
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
>
> LinkedIn: http://linkedin.com/in/davidwsmiley
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382=f4y0dYTDXxe7HMCZMbk9d5S%2BX8q93Yo7CkROITsyeNo%3D=0>
>  | Book: http://www.solrenterprisesearchserver.com
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499817365=9pkGzZID%2FeuGEdd90ZOrpRUybWLVV2H7vHUO4kp9%2FA4%3D=0>
>
>
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-28 Thread Doug Turnbull
I like that idea Alan. The trick is for QueryBuilder's 'newSynonymQuery' to
be useful in that context, you need to pass terms with metadata down to the
subclass. This is what I started working on a few weeks ago:

https://github.com/o19s/lucene-solr/commit/0fc3930671ef002cfbb5e3d52b6f8edc3715bf14

I don't think it's as simple as overriding analyzeBoolean/analyzeMultiBoolean
as Rob suggests, as there's also analyzeGraphBoolean and the  that would
also need to collect this metadata. I wouldn't want to copy paste all this
code into a subclass just to add one token attribute.

-Doug



On Wed, Nov 28, 2018 at 12:25 PM Alan Woodward  wrote:

> I think we can expose this information now with a small tweak to the
> SynonymGraphFilter, using the already-existing TypeAttribute.
>
> SGF is hard-coded to set the type attribute to “SYNONYM” on all tokens
> that it inserts into the stream.  It should be simple to add another
> constructor parameter allowing users to change this; then you can chain
> synonym filters, one for each type of expansion you want: synonym, hyponym,
> hypernym, whatever, each setting the type attribute differently.
>
> > On 28 Nov 2018, at 15:59, Michael Gibney 
> wrote:
> >
> > I think the objection to "boosting" in token filters isn't because it
> > is "too much", but rather because it breaks the abstraction of the
> > analysis chain to directly target scoring (as implied by
> > characterizing as "boosting").
> >
> > That said, I'm sympathetic to an approach that would establish an
> > Attribute to expose the kind of information that would be useful in
> > the context of synonyms (or other sorts of derived tokens discussed
> > here, where it could be useful to express information about token
> > derivation). Such an Attribute would not be directly related to
> > scoring/boosting, but would be related to analysis per se, (e.g.,
> > source token text, thesaurus, degree of confidence, etc.); support
> > could be selectively implemented by TokenFilters, and optionally
> > leveraged by query builders (e.g., translated to boosts) or even
> > recorded to index Payloads by a final custom analysis component 
> >
> > "You can look at any attribute on the tokenstream you want", "rely on
> > abstract attributes (type, ...) then it should be easy to sub-class
> > the query builder to access them".  Obviously that works iff analysis
> > components record the relevant information in attributes on the
> > tokenstream, which I think they currently don't (for much of the
> > information that has been discussed here) ... and I know of no
> > standard way to express the relevant information on the tokenstream.
> >
> > I can see that such an Attribute would be out of place (too
> > specialized) in the context of the Attributes in lucene/core; but
> > there are lots of more specialized Attributes in the various
> > submodules under lucene/analysis/* (SynonymGraphFilter lives in
> > analysis-common, FWIW). Again, this doesn't strike me as terribly
> > specialized, if one thinks of it more generally as a
> > "derivation/relationship" Attribute.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug


Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Doug Turnbull
There's a lot of different topics here and ideas, so we captured the use
cases we see being discussed here as in this google doc
https://docs.google.com/document/d/1w4G9bEICJ1aarr3l7OodwR5aecPkbFTISOgymErpZfQ/edit#heading=h.pszpx5dpxq7a

Basically, we've seen 5 high-level use cases discussed
- Alt Labels (what SynonymQuery does well now)
- Synonyms (looser synonyms with close meaning that need to be scored
somehow - `notebook,laptop`)
- Taxonomies (hierarchies of concepts/terms `dress shoes\oxfords`)
- Ontologies / Knowledge Graphs (networks of concepts)
- Embeddings (distributed representations of a term)

It's a doc in progress, embeddings needs more work, and is probably the
hardest thing on the list. There's possible other

The goal isn't so much to make Lucene implement all of these (it would
create a lot of maintenance headaches to shove this all in), but some of it
is just defining practices / patterns / tools that enable these things in
Lucene-based search. Some may require no work, or some may require
supporting functionality.

-Doug

On Wed, Nov 21, 2018 at 9:23 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> I agree there is a tension between analysis and query parser
> responsibilities (or external to how queries are constructed). I wonder
> what you'd think of making QueryBuilder more easily subclassible by passing
> more term metadata to newSynonymQuery (such as types etc). This would let
> you select an alt strategy (such as some of the scoring systems used in the
> query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> something with a term labeled a hyponym/hypernym in a QueryBuilder
> subclass..
>
> -Doug
>
> On Wed, Nov 21, 2018 at 8:09 AM Robert Muir  wrote:
>
>> I don't think we should put scoring stuff into the analysis chain like
>> this. It already has a laundry list of responsibilities.
>>
>> Analysis chain can tell you the term is stacked or its a certain type
>> or occurs a certain number of times, but it shouldn't be supplying
>> things such as floating point boosts. That kind of scoring
>> manipulation needs to really happen in query parsing/somewhere else.
>>
>> On 11/20/18, jim ferenczi  wrote:
>> > Sorry for the late reply,
>> >
>> >> So perhaps one way forward to contribute this sort of thing into Lucene
>> > is we could implement additional QueryBuilder implementations that
>> provide
>> > such functionality?
>> >
>> > I am not sure, I mentioned Solr and ES because I thought it was about
>> > adding taxonomies and complex expansion mechanisms to query builders
>> but I
>> > wonder if we can have a simple
>> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
>> could be
>> > a new attribute that token filters would use when they produce stacked
>> > tokens and that the QueryBuilder checks when he builds the
>> SynonymQuery. We
>> > already have a TermFrequencyAttribute to alter the frequency of a term
>> when
>> > indexing so we could have the same mechanism for query term boosting ?
>> >
>> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> > dturnb...@opensourceconnections.com> a écrit :
>> >
>> >> Thanks Jim
>> >>
>> >> Yeah, now that I think about it - I agree that perhaps the simplest
>> >> option
>> >> would to create alternate query builders. I think there's a couple of
>> >> enhancement to the base class that would be nice, such as
>> >> - Some additional token attributes passed to newSynonymQuery, such as
>> the
>> >> type (was this a synonym or hyponym or something else...)
>> >> - The ability to differentiate between the original query term and the
>> >> generated synonym terms
>> >> - Consistent support for phrases
>> >>
>> >> I think part of my goal too is to help people without the use of
>> plugins.
>> >> As we often are in scenarios at OpenSource Connections where people
>> won't
>> >> be able to use a plugin. In this case alternate expansions around
>> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>> have
>> >> using Solr/Lucene/ES.
>> >>
>> >> So perhaps one way forward to contribute this sort of thing into Lucene
>> >> is
>> >> we could implement additional QueryBuilder implementations that provide
>> >> such functionality?
>> >>
>> >> Thanks
>> >> -Doug
>> >>
>> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
>> >> wrote:
>

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Doug Turnbull
I agree there is a tension between analysis and query parser
responsibilities (or external to how queries are constructed). I wonder
what you'd think of making QueryBuilder more easily subclassible by passing
more term metadata to newSynonymQuery (such as types etc). This would let
you select an alt strategy (such as some of the scoring systems used in the
query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
something with a term labeled a hyponym/hypernym in a QueryBuilder
subclass..

-Doug

On Wed, Nov 21, 2018 at 8:09 AM Robert Muir  wrote:

> I don't think we should put scoring stuff into the analysis chain like
> this. It already has a laundry list of responsibilities.
>
> Analysis chain can tell you the term is stacked or its a certain type
> or occurs a certain number of times, but it shouldn't be supplying
> things such as floating point boosts. That kind of scoring
> manipulation needs to really happen in query parsing/somewhere else.
>
> On 11/20/18, jim ferenczi  wrote:
> > Sorry for the late reply,
> >
> >> So perhaps one way forward to contribute this sort of thing into Lucene
> > is we could implement additional QueryBuilder implementations that
> provide
> > such functionality?
> >
> > I am not sure, I mentioned Solr and ES because I thought it was about
> > adding taxonomies and complex expansion mechanisms to query builders but
> I
> > wonder if we can have a simple
> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
> be
> > a new attribute that token filters would use when they produce stacked
> > tokens and that the QueryBuilder checks when he builds the SynonymQuery.
> We
> > already have a TermFrequencyAttribute to alter the frequency of a term
> when
> > indexing so we could have the same mechanism for query term boosting ?
> >
> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> > dturnb...@opensourceconnections.com> a écrit :
> >
> >> Thanks Jim
> >>
> >> Yeah, now that I think about it - I agree that perhaps the simplest
> >> option
> >> would to create alternate query builders. I think there's a couple of
> >> enhancement to the base class that would be nice, such as
> >> - Some additional token attributes passed to newSynonymQuery, such as
> the
> >> type (was this a synonym or hyponym or something else...)
> >> - The ability to differentiate between the original query term and the
> >> generated synonym terms
> >> - Consistent support for phrases
> >>
> >> I think part of my goal too is to help people without the use of
> plugins.
> >> As we often are in scenarios at OpenSource Connections where people
> won't
> >> be able to use a plugin. In this case alternate expansions around
> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
> >> using Solr/Lucene/ES.
> >>
> >> So perhaps one way forward to contribute this sort of thing into Lucene
> >> is
> >> we could implement additional QueryBuilder implementations that provide
> >> such functionality?
> >>
> >> Thanks
> >> -Doug
> >>
> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
> >> wrote:
> >>
> >>> You can easily customize the query that is used for synonyms in a
> custom
> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
> >>> intended for subclasses that wish to customize the generated queries."
> so
> >>> I
> >>> don't think we need to do anything there. I agree that it is sometimes
> >>> better to use something different than the SynonymQuery but in the
> >>> general
> >>> case it works as expected and can be combined with other terms
> >>> naturally.
> >>> The kind of customization you want to achieve could be done in a plugin
> >>> (or
> >>> in Solr or ES) that extends the QueryBuilder, you can also use custom
> >>> token
> >>> filters and alter the query the way you want. My point here is that the
> >>> QueryBuilder should remain simple, you can add the complexity you want
> in
> >>> a
> >>> subclass.
> >>> However I think there is another area we need to fix, the scoring of
> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and could
> >>> be
> >>> improved so we need something similar than the SynonymQuery that
> handles
> >>> multi phrases.
> >>>
> >>>
> >>

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Doug Turnbull
Alessandro reading your post, I realized I made a mistake in that you'd
need to go both up and down the hierarchy when blending. When a user
searches for dress shoes, going down a level (or two) is just as important.
If a user searches for 'dress shoes' you also need hyponym terms.

This works out if you do an index time expansion (child terms get parent
terms injected) but doesn't work out if you want a 100% query time blending.

In this case, I think I would revise my blending idea to

- Search for the term 'wingtips' (lowest doc freq, smallest set)
- Search for the term 'wingtips' blended with all child terms
- Search for parent & sibling concepts (the set of all dress shoes)
- Search for grandparent, aunt, uncle, cousins... (the set of all shoes,
highest df)

In this case, I don't *think* need any special weighting, as the true doc
freq of each concept recreates the priority ordering you guys came up with.
That's pretty neat!

-Doug

On Wed, Nov 21, 2018 at 7:20 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Great thoughts Jim - +1 to your idea
>
> One brainstorm I had, is taxonomies have a kind of 'ideal scoring' that I
> think would lead to a different blending strategy for taxonomies than
> synonyms.
>
> If you have a taxonomy:
>
> \shoes\dress_shoes\oxfords
> \shoes\dress_shoes\wingtips
> \shoes\lazy_shoes\loafers
> \shoes\lazy_shoes\sketchers
>
> This taxonomy states - if a document mentions 'oxfords', it's also
> discussing the concept of dress shoes. If it only mentions 'wingtips' it
> also is discussing dress shoes.
>
> Thus ideally, the true document frequency of the parent concept 'dress
> shoes' is the combination of the children. This is the number of documents
> that discuss this concept.
>
> You can repeat this for grandparent concepts. The number of documents with
> 'shoes' really is all the documents mentioning oxfords, wingtips, loafers,
> sketchers, and the like...
>
> We have implemented this idea at index time, with index-time semantic
> expansion to inject the parent concepts. (manually put dress_shoes into
> documents that just mention wingtips). This is mentioned in this blog post
> https://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/
>  and
> conference talk https://www.youtube.com/watch?v=90F30PS-884 This is
> annoying and requires reindexing. Though it's the most accurate.
>
> BUT I think a blended query-time query would capture the same semantics.
> You basically want to score a taxonomy like the following. Imagine a user
> query of wingtips, you could imagine 3 should clauses that blend at
> different levels
>
> - Search for the term 'wingtips' (lowest doc freq, smallest set)
> - Search for parent & sibling concepts (the set of all dress shoes)
> - Search for grandparent, aunt, uncle, cousins... (the set of all shoes,
> highest df)
>
> text:wingtips OR Blended(text:wingtips, text:oxfords, text:dress_shoes) OR
> Blended(text:wingtips, text:oxfords, text:dress_shoes, text:sketchers,
> text:loafers, ...)
>
> Right now this can be accomplished by just issuing 3 SHOULD queries with 3
> different query-time analyzers each with different synonym expansions
> (exact user term, child => parent/sibling, child => parent, grandparent,
> etc...). And maybe it should stay that way.
>
> But this is why I think it's a 'yes AND', yes I think it would be a great
> addition to have synonym weighting. AND I think there are blending
> strategies that are specific to the use case.
>
> -Doug
>
>
>
> On Tue, Nov 20, 2018 at 9:34 PM Michael Sokolov 
> wrote:
>
>> This is a great idea. It would also be compelling to modify the term
>> frequency using this deboosting so that stacked indexed terms can be
>> weighted according to their closeness to the original term.
>>
>> On Tue, Nov 20, 2018, 2:19 PM jim ferenczi >
> Sorry for the late reply,
>>>
>>> > So perhaps one way forward to contribute this sort of thing into
>>> Lucene is we could implement additional QueryBuilder implementations that
>>> provide such functionality?
>>>
>>> I am not sure, I mentioned Solr and ES because I thought it was about
>>> adding taxonomies and complex expansion mechanisms to query builders but I
>>> wonder if we can have a simple
>>> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
>>> be a new attribute that token filters would use when they produce stacked
>>> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
>>> already have a TermFrequencyAttribute to alter the frequency of a term when
>>> indexing so we could have the same mechani

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Doug Turnbull
Great thoughts Jim - +1 to your idea

One brainstorm I had, is taxonomies have a kind of 'ideal scoring' that I
think would lead to a different blending strategy for taxonomies than
synonyms.

If you have a taxonomy:

\shoes\dress_shoes\oxfords
\shoes\dress_shoes\wingtips
\shoes\lazy_shoes\loafers
\shoes\lazy_shoes\sketchers

This taxonomy states - if a document mentions 'oxfords', it's also
discussing the concept of dress shoes. If it only mentions 'wingtips' it
also is discussing dress shoes.

Thus ideally, the true document frequency of the parent concept 'dress
shoes' is the combination of the children. This is the number of documents
that discuss this concept.

You can repeat this for grandparent concepts. The number of documents with
'shoes' really is all the documents mentioning oxfords, wingtips, loafers,
sketchers, and the like...

We have implemented this idea at index time, with index-time semantic
expansion to inject the parent concepts. (manually put dress_shoes into
documents that just mention wingtips). This is mentioned in this blog post
https://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/
and
conference talk https://www.youtube.com/watch?v=90F30PS-884 This is
annoying and requires reindexing. Though it's the most accurate.

BUT I think a blended query-time query would capture the same semantics.
You basically want to score a taxonomy like the following. Imagine a user
query of wingtips, you could imagine 3 should clauses that blend at
different levels

- Search for the term 'wingtips' (lowest doc freq, smallest set)
- Search for parent & sibling concepts (the set of all dress shoes)
- Search for grandparent, aunt, uncle, cousins... (the set of all shoes,
highest df)

text:wingtips OR Blended(text:wingtips, text:oxfords, text:dress_shoes) OR
Blended(text:wingtips, text:oxfords, text:dress_shoes, text:sketchers,
text:loafers, ...)

Right now this can be accomplished by just issuing 3 SHOULD queries with 3
different query-time analyzers each with different synonym expansions
(exact user term, child => parent/sibling, child => parent, grandparent,
etc...). And maybe it should stay that way.

But this is why I think it's a 'yes AND', yes I think it would be a great
addition to have synonym weighting. AND I think there are blending
strategies that are specific to the use case.

-Doug



On Tue, Nov 20, 2018 at 9:34 PM Michael Sokolov  wrote:

> This is a great idea. It would also be compelling to modify the term
> frequency using this deboosting so that stacked indexed terms can be
> weighted according to their closeness to the original term.
>
> On Tue, Nov 20, 2018, 2:19 PM jim ferenczi 
Sorry for the late reply,
>>
>> > So perhaps one way forward to contribute this sort of thing into Lucene
>> is we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> I am not sure, I mentioned Solr and ES because I thought it was about
>> adding taxonomies and complex expansion mechanisms to query builders but I
>> wonder if we can have a simple
>> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
>> be a new attribute that token filters would use when they produce stacked
>> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
>> already have a TermFrequencyAttribute to alter the frequency of a term when
>> indexing so we could have the same mechanism for query term boosting ?
>>
>> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> dturnb...@opensourceconnections.com> a écrit :
>>
> Thanks Jim
>>>
>>> Yeah, now that I think about it - I agree that perhaps the simplest
>>> option would to create alternate query builders. I think there's a couple
>>> of enhancement to the base class that would be nice, such as
>>> - Some additional token attributes passed to newSynonymQuery, such as
>>> the type (was this a synonym or hyponym or something else...)
>>> - The ability to differentiate between the original query term and the
>>> generated synonym terms
>>> - Consistent support for phrases
>>>
>>> I think part of my goal too is to help people without the use of
>>> plugins. As we often are in scenarios at OpenSource Connections where
>>> people won't be able to use a plugin. In this case alternate expansions
>>> around hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>>> have using Solr/Lucene/ES.
>>>
>>> So perhaps one way forward to contribute this sort of thing into Lucene
>>> is we could implement additional QueryBuilder implementations that provide
>>> such functionality?
>>>
>>> Thanks
>>> -Doug
>>>
>&g

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-17 Thread Doug Turnbull
Thanks Jim

Yeah, now that I think about it - I agree that perhaps the simplest option
would to create alternate query builders. I think there's a couple of
enhancement to the base class that would be nice, such as
- Some additional token attributes passed to newSynonymQuery, such as the
type (was this a synonym or hyponym or something else...)
- The ability to differentiate between the original query term and the
generated synonym terms
- Consistent support for phrases

I think part of my goal too is to help people without the use of plugins.
As we often are in scenarios at OpenSource Connections where people won't
be able to use a plugin. In this case alternate expansions around
hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
using Solr/Lucene/ES.

So perhaps one way forward to contribute this sort of thing into Lucene is
we could implement additional QueryBuilder implementations that provide
such functionality?

Thanks
-Doug

On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi  wrote:

> You can easily customize the query that is used for synonyms in a custom
> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
> intended for subclasses that wish to customize the generated queries." so I
> don't think we need to do anything there. I agree that it is sometimes
> better to use something different than the SynonymQuery but in the general
> case it works as expected and can be combined with other terms naturally.
> The kind of customization you want to achieve could be done in a plugin (or
> in Solr or ES) that extends the QueryBuilder, you can also use custom token
> filters and alter the query the way you want. My point here is that the
> QueryBuilder should remain simple, you can add the complexity you want in a
> subclass.
> However I think there is another area we need to fix, the scoring of
> multi-terms synonyms is broken (compared to the SynonymQuery) and could be
> improved so we need something similar than the SynonymQuery that handles
> multi phrases.
>
>
> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
> dturnb...@opensourceconnections.com> a écrit :
>
>> Yes that is another good area (there are many). Although of course
>> embeddings have their own challenges and complexities. (they often capture
>> shared context, but not shared meaning).
>>
>> It's a data point though of something we'd want to include in such a
>> framework, though not sure where it would go on the roadmap...
>>
>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado 
>> wrote:
>>
>>> What about the use of word embeddings (see
>>>
>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>> to compute word similarity?
>>>
>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>> dturnb...@opensourceconnections.com> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> I wanted to open up a discussion about a change to the usage of
>>>> SynonymQuery. The goal here is to have a broader library of queries that
>>>> can address other cases where related terms occupy the same position but
>>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>>> ambiguous terms, and other query expansion situations).
>>>>
>>>>
>>>> I bring this up because we've noticed (as I'm sure many of you have)
>>>> the pattern of clients jamming any related term into a synonyms file and
>>>> being surprised with odd results. I like the idea of enforcing "synonyms"
>>>> means exactly-the-same in Lucene-land. It's an easy thing to tell a client
>>>> and setup simple patterns. So for synonyms, I think leaving SynonymQuery in
>>>> place works great.
>>>>
>>>> But I feel if that's the rule, we need to open up discussion of other
>>>> methods of scoring conceptual 'related term' relationships that usually
>>>> comes up in the context of query expansion. This paper (
>>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>>> surveys the current thinking for scoring various query expansion scenarios
>>>> like those we deal with in the messy, ambiguous uses of synonyms in prod
>>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>>
>>>>
>>>> The cool thing is many of the ideas in this paper seem doable with
>>>> existing Lucene index stats. So one might imagine a 'related terms' token
>>>> filter that injected some scoring based on how related it really is to
>>>> the original query term using Jaccard, Dice, or ot

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-16 Thread Doug Turnbull
Yes that is another good area (there are many). Although of course
embeddings have their own challenges and complexities. (they often capture
shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a
framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado 
wrote:

> What about the use of word embeddings (see
>
> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
> to compute word similarity?
>
> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
>
>> Hey folks,
>>
>> I wanted to open up a discussion about a change to the usage of
>> SynonymQuery. The goal here is to have a broader library of queries that
>> can address other cases where related terms occupy the same position but
>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>> ambiguous terms, and other query expansion situations).
>>
>>
>> I bring this up because we've noticed (as I'm sure many of you have) the
>> pattern of clients jamming any related term into a synonyms file and being
>> surprised with odd results. I like the idea of enforcing "synonyms" means
>> exactly-the-same in Lucene-land. It's an easy thing to tell a client and
>> setup simple patterns. So for synonyms, I think leaving SynonymQuery in
>> place works great.
>>
>> But I feel if that's the rule, we need to open up discussion of other
>> methods of scoring conceptual 'related term' relationships that usually
>> comes up in the context of query expansion. This paper (
>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys
>> the current thinking for scoring various query expansion scenarios like
>> those we deal with in the messy, ambiguous uses of synonyms in prod systems
>> (khakis aren't trousers, they're a kind-of trouser).
>>
>>
>> The cool thing is many of the ideas in this paper seem doable with
>> existing Lucene index stats. So one might imagine a 'related terms' token
>> filter that injected some scoring based on how related it really is to
>> the original query term using Jaccard, Dice, or other methods called out in
>> this paper.
>>
>>
>> Another insightful set of research is this article on concept scoring (
>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>> ), which prioritizes related terms by connectedness and other factors.
>>
>> Needless to say, it's an open area how two terms someone has asserted are
>> related to a query term 'should be' scored. It's one of those things that
>> likely will forever depend on a number of domain and application specific
>> factors. It's possibly a big opportunity of improvement for Lucene - but
>> likely is about putting the right framework in place to allow for good
>> default set of query-expansion scoring scenarios with options for
>> customization.
>>
>> What I'm proposing is:
>>
>>
>>-
>>
>>Submit a small patch that restricts SynonymQuery to tokens of type
>>"SYNONYM" in the same posn, which allows some short term work to be done
>>with the current Lucene QueryBuilder. Any additional non-synonym terms
>>would be appended as a boolean query for now
>>-
>>
>>Begin work on alternate 'related-term' scoring systems that also key
>>off the token type in QueryBuilder to create custom scoring using built-in
>>term stats. The possibilities here are endless, up to weighted related
>>terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, 
>> etc
>>
>>
>> I'm curious what folks would think of a patch for bullet one followed by
>> other patches down the road for additional functionality?
>>
>> (related to discussion in this Elasticsearch PR
>>
>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>> )
>>
>> --
>> CTO, OpenSource Connections
>> Author, Relevant Search
>> http://o19s.com/doug
>>
> --
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug


SynonymQuery / Query Expansion Strategies Discussion

2018-11-16 Thread Doug Turnbull
Hey folks,

I wanted to open up a discussion about a change to the usage of
SynonymQuery. The goal here is to have a broader library of queries that
can address other cases where related terms occupy the same position but
don't have the same meaning (such as hypernyms, hyponyms, meronyms,
ambiguous terms, and other query expansion situations).


I bring this up because we've noticed (as I'm sure many of you have) the
pattern of clients jamming any related term into a synonyms file and being
surprised with odd results. I like the idea of enforcing "synonyms" means
exactly-the-same in Lucene-land. It's an easy thing to tell a client and
setup simple patterns. So for synonyms, I think leaving SynonymQuery in
place works great.

But I feel if that's the rule, we need to open up discussion of other
methods of scoring conceptual 'related term' relationships that usually
comes up in the context of query expansion. This paper (
https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys
the current thinking for scoring various query expansion scenarios like
those we deal with in the messy, ambiguous uses of synonyms in prod systems
(khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing
Lucene index stats. So one might imagine a 'related terms' token filter
that injected some scoring based on how related it really is to the
original query term using Jaccard, Dice, or other methods called out in
this paper.


Another insightful set of research is this article on concept scoring (
https://usabilityetc.com/articles/information-retrieval-concept-matching/),
which prioritizes related terms by connectedness and other factors.

Needless to say, it's an open area how two terms someone has asserted are
related to a query term 'should be' scored. It's one of those things that
likely will forever depend on a number of domain and application specific
factors. It's possibly a big opportunity of improvement for Lucene - but
likely is about putting the right framework in place to allow for good
default set of query-expansion scoring scenarios with options for
customization.

What I'm proposing is:


   -

   Submit a small patch that restricts SynonymQuery to tokens of type
   "SYNONYM" in the same posn, which allows some short term work to be done
   with the current Lucene QueryBuilder. Any additional non-synonym terms
   would be appended as a boolean query for now
   -

   Begin work on alternate 'related-term' scoring systems that also key off
   the token type in QueryBuilder to create custom scoring using built-in term
   stats. The possibilities here are endless, up to weighted related terms (ie
   Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by
other patches down the road for additional functionality?

(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)

-- 
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug


[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-15 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688424#comment-16688424
 ] 

Doug Turnbull commented on LUCENE-8563:
---

Ah... I assumed "Adrien has his performance hat on" which probably colored my 
perception of the issue

Ah yeah my mistake I see that now, I think your strategy makes sense now and 
helps with scoring comparability across queries. :+1: to your approach with the 
LegacyBM25 implementation then!

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-15 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688382#comment-16688382
 ] 

Doug Turnbull commented on LUCENE-8563:
---

Thanks [~jpountz] - My feeling is if Lucene has something called "BM25 
Similarity" it should match to the traditional definition of BM25, and 
shouldn't be deprecated. But if we want to create a faster version, and make it 
default, I think that would be great.

Or if you want to call the current (what you call legacy) 
"ClassicBM25Similarity" instead of legacy... 

I just don't feel it should be deprecated. As an IR person, I would be 
surprised if I was new to Lucene, looked up BM25 and it wasn't actually BM25...

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-14 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687268#comment-16687268
 ] 

Doug Turnbull commented on LUCENE-8563:
---

I feel perhaps one way forward is to create a second (default?) similarity - 
FastBM25Similarity? ConstantCeilingBM25Similarity? and leave in place the 
current BM25 similarity as an optional similarity to configure. There may be 
existing practices around tuning BM25 similarity at many places where writing a 
similarity plugin is not an option

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-12 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684091#comment-16684091
 ] 

Doug Turnbull commented on LUCENE-8563:
---

For the sake of this discussion, here's a desmos graph with BM25 with/without 
k1 in the numerator 

https://www.desmos.com/calculator/cklb27fcn9 

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-12 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684080#comment-16684080
 ] 

Doug Turnbull edited comment on LUCENE-8563 at 11/12/18 5:01 PM:
-

It would modify ordering when dealing with multiple fields. Consider one field 
with a different k1 than another because the impact of term frequency is 
calibrated differently. If one calibrates one field to saturate term freq 
faster, and another slower, then ordering would be impacted


was (Author: softwaredoug):
It would modify ordering when dealing with multiple fields. Consider one field 
with a different k1 than another because the impact of term frequency is 
calibrated differently. If one calibrates one field to saturate term freq 
faster, and another slower, then ordering would be impacted

Additionally, currently k1=0 is the only way to disable term frequency without 
also disabling positions.

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-12 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684080#comment-16684080
 ] 

Doug Turnbull commented on LUCENE-8563:
---

It would modify ordering when dealing with multiple fields. Consider one field 
with a different k1 than another because the impact of term frequency is 
calibrated differently. If one calibrates one field to saturate term freq 
faster, and another slower, then ordering would be impacted

Additionally, currently k1=0 is the only way to disable term frequency without 
also disabling positions.

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12238) Synonym Query Style Boost By Payload

2018-11-10 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682427#comment-16682427
 ] 

Doug Turnbull commented on SOLR-12238:
--

What can we do to get this functionality into Solr? (My vote would be to make 
Alessandro a committer so he can stop bugging you guys :) )

> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12238) Synonym Query Style Boost By Payload

2018-05-04 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463830#comment-16463830
 ] 

Doug Turnbull commented on SOLR-12238:
--

Just want to say I've been watching this feature and 

+1 - great feature!

Exactly the kind of thing I was hoping to see after much of [~ehatcher]'s great 
payload work :)

> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7996) Should we require positive scores?

2017-12-05 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278893#comment-16278893
 ] 

Doug Turnbull commented on LUCENE-7996:
---

Just FYI for upstream impact, LTR models tend to output negative scores. For 
example Ranklib gradient boosting models range from -100 to 100. Of course this 
can be changed by always adding 100 to the score, but there's appeal in seeing 
the expected score from an LTR query being identical to the score you'd get 
from the model if you ran it outside of Solr/Elasticsearch.

> Should we require positive scores?
> --
>
> Key: LUCENE-7996
> URL: https://issues.apache.org/jira/browse/LUCENE-7996
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7996.patch, LUCENE-7996.patch, LUCENE-7996.patch
>
>
> Having worked on MAXSCORE recently, things would be simpler if we required 
> that scores are positive. Practically, this would mean 
>  - forbidding/fixing similarities that may produce negative scores (we have 
> some of them)
>  - forbidding things like negative boosts
> So I'd be curious to have opinions whether this would be a sane requirement 
> or whether we need to be able to cope with negative scores eg. because some 
> similarities that we want to support produce negative scores by design.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11662) Make overlapping query term scoring configurable per field type

2017-12-04 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277462#comment-16277462
 ] 

Doug Turnbull commented on SOLR-11662:
--

Thanks for helping with the change David!

I would probably personally do something like that. However, I tend to 
restructure most synonyms into a taxonomy. Many people aren't aware of 
hypernymy/hyponymy. It's not uncommon to see a synonym in an e-commerce 
clients, for example, that looks like `pants,khakis` with another line that's 
`pants,jeans` which of course creates an unintentional equivalence between 
jeans and khakis. Even when these are mixed in with true synonyms, I tend to 
restructure the whole thing as a taxonomy

For example, some people avoid this for example at query time by expanding the 
query, and expecting the "as_distinct_terms" behavior, which biases towards 
exact match

pants => jeans,pants,khakis
jeans => jeans,pants
khakis => jeans,khakis

A search for pants here shows a mix of different kinds of pants (khakis and 
jeans roughly equal)
A search for jeans puts jeans first (low doc freq), followed by various kinds 
of pants (high doc freq)
A search for khakis puts khakis first, followed by various kinds of non-jean 
pants

I tend to think of synonyms as hyponyms of a canonical name for an idea. So 
jeans for example, I might expand that to

blue_jeans => blue_jeans,jeans,pants
denim_jeans => denim_jeans,jeans,pants

With multiple analyzer chains, I might recommend controlling how loose the 
search is with different analyzer chains. For example, one could see forcing a 
strong boost for conceptually similar items. Or limiting the semantic expansion 
so that blue_jeans, for example, only expands up to the jeans level.

There's quite a lot of "it depends". The example above presupposes that pants 
have a higher doc freq than jeans, which may not be the case without a similar 
index-time expansion.


> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
>Assignee: David Smiley
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11698) Query-time per-field query settings (ie analyzers, autoGeneratePhraseQueries, etc)

2017-12-02 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275741#comment-16275741
 ] 

Doug Turnbull commented on SOLR-11698:
--

I'm considering adding query time config to field aliases for this 
functionality. It builds on an existing feature, and seems to be the least 
error-prone to implement as edismax's query parser is already alias aware. This 
seems to be simpler than adding a whole new "config" idea.

As an example, to override autoGeneratePhraseQueries for a field "text" one 
would write

{code}
=text 
text_autophrase^10_autophrase.qf=text_autophrase.autoGeneratePhraseQueries=true
{code}

Similarly, if we had a query-overridable field type setting for analyzer we 
could write

{code}
=text text_synonym_autophrase^10
_synonym_autophrase.qf=text
_synonym_autophrase.autoGeneratePhraseQueries=true
_synonym_autophrase.queryAnalyzer=with_synonyms
{code}

> Query-time per-field query settings (ie analyzers, autoGeneratePhraseQueries, 
> etc)
> --
>
> Key: SOLR-11698
> URL: https://issues.apache.org/jira/browse/SOLR-11698
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
>
> This is an issue wrt to [this email 
> chain](http://lucene.472066.n3.nabble.com/Multiple-Query-Time-Analyzers-in-Solr-td4364540.html)
>  created to discuss the ability to change the query time analyzer in Solr, 
> with input from [~solrtrey], [~dsmiley], [~steve_rowe], and [~mkhludnev]
> Specifically, we ended up with the following
> _
> it seems like there's some consensus around
> - Creating multiple named analyzers per field
> - Referencing those analyzers by name at query time somehow
> I would advocate for refactoring edismax (or making a new query parser) that 
> would allow you to specify per-field query configuration. Then I would 
> advocate refactoring some of the flags autoGeneratePhraseQueries, etc to this 
> query-time config. Then we could follow suit using the same syntax to specify 
> the analyzer to use at query time.
> Perhaps more generally these configuration items can stay on the fieldType, 
> but a syntax could allow them to be overriden per field at query time?
> Finally, another requirement I would add would be the ability to specify the 
> same field twice in qf, but configured to be queries two different ways. 
> Perhaps a syntax like qf=title:config1  title:config2? Where config1 and 
> config2 modify fieldType query flags? Like 
> fieldConfig.config1.autoGeneratePhraseQuerise=false=no_synonyms
> This sort of thing would in my opinion help both enhance the power of Solr, 
> but with a more consistent vision around how field-specific query settings 
> could be organized
> _



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr Ref Guide not building

2017-12-02 Thread Doug Turnbull
For building HTML, the ref guide specifies the following. Here's what I
have with corresponding versions

** Prerequisites: `Ruby` (v2.1 or higher) and the following gems must be
installed:
doug@wiz$~/ws/lucene-solr/solr/solr-ref-guide(mas) $ ruby --version
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin16]


*** `jekyll`: v3.5, not v4.x. Use `gem install --force --version 3.5.0
jekyll` to force install of Jekyll 3.5.0.
*** `jekyll-asciidoc`: v2.1 or higher. Use `gem install jekyll-asciidoc` to
install.
doug@wiz$~/ws/lucene-solr/solr/solr-ref-guide(mas) $ gem list | grep jekyll
jekyll (3.5.0, 3.0.3)
jekyll-asciidoc (2.1.0)

*** `pygments.rb`: v1.1.2 or higher. Use `gem install pygments.rb` to
install.
doug@wiz$~/ws/lucene-solr/solr/solr-ref-guide(mas) $ gem list | grep
pygments
pygments.rb (1.2.0)

Then I follow the instructions in "Building the Guide"

== Building the Guide
For details on building the ref guide, see `ant -p`.

There are currently four available targets:

* `ant default`: builds both the PDF and HTML versions of the Solr Ref
Guide.
* `ant build-site`: builds only the HTML version.

And run "ant build-site" and get the output from my original email

Thanks,
-Doug


On Sat, Dec 2, 2017 at 1:16 PM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> I'm facing some errors building the ref-guide as well:
>
> -build-raw-pdf:
> [asciidoctor:convert] Render SolrRefGuide-all.adoc from
> /home/ishan/code/lucene-solr/solr/build/solr-ref-guide/content/pdf to
> /home/ishan/code/lucene-solr/solr/build/solr-ref-guide/pdf-tmp with
> backend=pdf
> [asciidoctor:convert] asciidoctor: ERROR: about-this-guide.adoc: line 1:
> invalid part, must have at least one section (e.g., chapter, appendix, etc.)
> [asciidoctor:convert] asciidoctor: ERROR: solr-glossary.adoc: line 1:
> invalid part, must have at least one section (e.g., chapter, appendix, etc.)
>
>
> On Sat, Dec 2, 2017 at 11:42 PM, Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
>
>> Thanks.
>>
>> I don’t mind poking around the ref guide config, but I'm following the
>> readme and building master. I'm hesitant to change config files as part of
>> my PR, which probably will break the ref guide build for others :)
>>
>> Best
>> -Doug
>>
>>
>> On Sat, Dec 2, 2017 at 7:37 AM Martin Gainty <mgai...@hotmail.com> wrote:
>>
>>>
>>> MG>see below
>>>
>>>
>>> --
>>> *From:* Doug Turnbull <dturnb...@opensourceconnections.com>
>>> *Sent:* Friday, December 1, 2017 9:17 PM
>>> *To:* dev@lucene.apache.org
>>> *Subject:* Solr Ref Guide not building
>>>
>>> Hello!
>>>
>>> I'm trying to update the Solr Ref guide with my change for SOLR-11662. I
>>> believe I've installed the required dependencies and double checked the
>>> README in the solr-ref-guide. Unfortunately, running ant build-site I
>>> immediately get this error, seemingly on the first adoc file encountered:
>>>
>>>
>>>  [exec] jekyll 3.5.0 | Error:  No header received back.
>>>  [exec]   Conversion error: Jekyll::AsciiDoc::Converter encountered
>>> an error while converting 'about-filters.adoc':
>>> MG>possible ant build snafu
>>> MG>configuration file:
>>> /Users/doug/ws/lucene-solr/solr/build/solr-ref-guide/content/_config.yml
>>> MG>Deprecation: The 'gems' configuration option has been renamed to
>>> 'plugins'.
>>> MG>Please update your config file accordingly.
>>> MG>check output config.yml for 'gems' and sub in 'plugins'
>>> MG>also check ant input solr-ref-guide/config.yml.template for 'gems'
>>> instead of 'plugins'
>>>
>>> I feel like I must be doing something stupid (I'll assume user error on
>>> my part). But if there's anything obvious I'm doing wrong, please let me
>>> know
>>>
>>> A more complete log can be found here
>>> https://gist.github.com/softwaredoug/36fe87f0d63403e7be22d5a2ff8af073
>>> <https://gist.github.com/softwaredoug/36fe87f0d63403e7be22d5a2ff8af073>
>>> doug-being-dumb.txt
>>> <https://gist.github.com/softwaredoug/36fe87f0d63403e7be22d5a2ff8af073>
>>> gist.github.com
>>>
>>>
>>> Thanks for any help
>>>
>>> -Doug
>>> --
>>> Consultant, OpenSource Connections. Contact info at
>>> http://o19s.com/about-us/doug-turnbull/; Free/Busy (
>>> http://bit.ly/dougs_cal)
>>> <http://o19s.com/about-us/doug-turnbull/>
>>> Doug Turnbull <http://o19s.com/about-us/doug-turnbull/>
>>> o19s.com
>>> Search relevance consultant. Author of Relevant Search. Doug crafts
>>> search/recommendation solutions that “get” users. To do this, Doug uses
>>> Solr/Elasticsearc...
>>>
>>> --
>> Consultant, OpenSource Connections. Contact info at
>> http://o19s.com/about-us/doug-turnbull/; Free/Busy (
>> http://bit.ly/dougs_cal)
>>
>
> --
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)


Re: Solr Ref Guide not building

2017-12-02 Thread Doug Turnbull
Thanks.

I don’t mind poking around the ref guide config, but I'm following the
readme and building master. I'm hesitant to change config files as part of
my PR, which probably will break the ref guide build for others :)

Best
-Doug


On Sat, Dec 2, 2017 at 7:37 AM Martin Gainty <mgai...@hotmail.com> wrote:

>
> MG>see below
>
>
> --
> *From:* Doug Turnbull <dturnb...@opensourceconnections.com>
> *Sent:* Friday, December 1, 2017 9:17 PM
> *To:* dev@lucene.apache.org
> *Subject:* Solr Ref Guide not building
>
> Hello!
>
> I'm trying to update the Solr Ref guide with my change for SOLR-11662. I
> believe I've installed the required dependencies and double checked the
> README in the solr-ref-guide. Unfortunately, running ant build-site I
> immediately get this error, seemingly on the first adoc file encountered:
>
>
>  [exec] jekyll 3.5.0 | Error:  No header received back.
>  [exec]   Conversion error: Jekyll::AsciiDoc::Converter encountered an
> error while converting 'about-filters.adoc':
> MG>possible ant build snafu
> MG>configuration file:
> /Users/doug/ws/lucene-solr/solr/build/solr-ref-guide/content/_config.yml
> MG>Deprecation: The 'gems' configuration option has been renamed to
> 'plugins'.
> MG>Please update your config file accordingly.
> MG>check output config.yml for 'gems' and sub in 'plugins'
> MG>also check ant input solr-ref-guide/config.yml.template for 'gems'
> instead of 'plugins'
>
> I feel like I must be doing something stupid (I'll assume user error on my
> part). But if there's anything obvious I'm doing wrong, please let me know
>
> A more complete log can be found here
> https://gist.github.com/softwaredoug/36fe87f0d63403e7be22d5a2ff8af073
> <https://gist.github.com/softwaredoug/36fe87f0d63403e7be22d5a2ff8af073>
> doug-being-dumb.txt
> <https://gist.github.com/softwaredoug/36fe87f0d63403e7be22d5a2ff8af073>
> gist.github.com
>
>
> Thanks for any help
>
> -Doug
> --
> Consultant, OpenSource Connections. Contact info at
> http://o19s.com/about-us/doug-turnbull/; Free/Busy (
> http://bit.ly/dougs_cal)
> <http://o19s.com/about-us/doug-turnbull/>
> Doug Turnbull <http://o19s.com/about-us/doug-turnbull/>
> o19s.com
> Search relevance consultant. Author of Relevant Search. Doug crafts
> search/recommendation solutions that “get” users. To do this, Doug uses
> Solr/Elasticsearc...
>
> --
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)


Solr Ref Guide not building

2017-12-01 Thread Doug Turnbull
Hello!

I'm trying to update the Solr Ref guide with my change for SOLR-11662. I
believe I've installed the required dependencies and double checked the
README in the solr-ref-guide. Unfortunately, running ant build-site I
immediately get this error, seemingly on the first adoc file encountered:


 [exec] jekyll 3.5.0 | Error:  No header received back.
 [exec]   Conversion error: Jekyll::AsciiDoc::Converter encountered an
error while converting 'about-filters.adoc':

I feel like I must be doing something stupid (I'll assume user error on my
part). But if there's anything obvious I'm doing wrong, please let me know

A more complete log can be found here
https://gist.github.com/softwaredoug/36fe87f0d63403e7be22d5a2ff8af073

Thanks for any help

-Doug
-- 
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)


[jira] [Updated] (SOLR-11698) Query-time per-field query settings (ie analyzers, autoGeneratePhraseQueries, etc)

2017-11-28 Thread Doug Turnbull (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-11698:
-
Summary: Query-time per-field query settings (ie analyzers, 
autoGeneratePhraseQueries, etc)  (was: Query-time fieldType query settings (ie 
analyzers, autoGeneratePhraseQueries, etc))

> Query-time per-field query settings (ie analyzers, autoGeneratePhraseQueries, 
> etc)
> --
>
> Key: SOLR-11698
> URL: https://issues.apache.org/jira/browse/SOLR-11698
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>    Reporter: Doug Turnbull
>
> This is an issue wrt to [this email 
> chain](http://lucene.472066.n3.nabble.com/Multiple-Query-Time-Analyzers-in-Solr-td4364540.html)
>  created to discuss the ability to change the query time analyzer in Solr, 
> with input from [~solrtrey], [~dsmiley], [~steve_rowe], and [~mkhludnev]
> Specifically, we ended up with the following
> _
> it seems like there's some consensus around
> - Creating multiple named analyzers per field
> - Referencing those analyzers by name at query time somehow
> I would advocate for refactoring edismax (or making a new query parser) that 
> would allow you to specify per-field query configuration. Then I would 
> advocate refactoring some of the flags autoGeneratePhraseQueries, etc to this 
> query-time config. Then we could follow suit using the same syntax to specify 
> the analyzer to use at query time.
> Perhaps more generally these configuration items can stay on the fieldType, 
> but a syntax could allow them to be overriden per field at query time?
> Finally, another requirement I would add would be the ability to specify the 
> same field twice in qf, but configured to be queries two different ways. 
> Perhaps a syntax like qf=title:config1  title:config2? Where config1 and 
> config2 modify fieldType query flags? Like 
> fieldConfig.config1.autoGeneratePhraseQuerise=false=no_synonyms
> This sort of thing would in my opinion help both enhance the power of Solr, 
> but with a more consistent vision around how field-specific query settings 
> could be organized
> _



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-11698) Query-time fieldType query settings (ie analyzers, autoGeneratePhraseQueries, etc)

2017-11-28 Thread Doug Turnbull (JIRA)
Doug Turnbull created SOLR-11698:


 Summary: Query-time fieldType query settings (ie analyzers, 
autoGeneratePhraseQueries, etc)
 Key: SOLR-11698
 URL: https://issues.apache.org/jira/browse/SOLR-11698
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Doug Turnbull


This is an issue wrt to [this email 
chain](http://lucene.472066.n3.nabble.com/Multiple-Query-Time-Analyzers-in-Solr-td4364540.html)
 created to discuss the ability to change the query time analyzer in Solr, with 
input from [~solrtrey], [~dsmiley], [~steve_rowe], and [~mkhludnev]

Specifically, we ended up with the following
_
it seems like there's some consensus around

- Creating multiple named analyzers per field
- Referencing those analyzers by name at query time somehow

I would advocate for refactoring edismax (or making a new query parser) that 
would allow you to specify per-field query configuration. Then I would advocate 
refactoring some of the flags autoGeneratePhraseQueries, etc to this query-time 
config. Then we could follow suit using the same syntax to specify the analyzer 
to use at query time.

Perhaps more generally these configuration items can stay on the fieldType, but 
a syntax could allow them to be overriden per field at query time?

Finally, another requirement I would add would be the ability to specify the 
same field twice in qf, but configured to be queries two different ways. 
Perhaps a syntax like qf=title:config1  title:config2? Where config1 and 
config2 modify fieldType query flags? Like 
fieldConfig.config1.autoGeneratePhraseQuerise=false=no_synonyms

This sort of thing would in my opinion help both enhance the power of Solr, but 
with a more consistent vision around how field-specific query settings could be 
organized
_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [GitHub] lucene-solr pull request #275: SOLR-11662: Configurable query when terms ove...

2017-11-27 Thread Doug Turnbull
Thanks for the correction, you're quite correct. And if we move forward
with more query time config, we can reuse the same syntax

On Mon, Nov 27, 2017 at 10:27 PM dsmiley <g...@git.apache.org> wrote:

> Github user dsmiley commented on a diff in the pull request:
>
> https://github.com/apache/lucene-solr/pull/275#discussion_r153387514
>
> --- Diff: solr/core/src/java/org/apache/solr/schema/FieldType.java ---
> @@ -905,6 +905,7 @@ protected void checkSupportsDocValues() {
>protected static final String ENABLE_GRAPH_QUERIES =
> "enableGraphQueries";
>private static final String ARGS = "args";
>private static final String POSITION_INCREMENT_GAP =
> "positionIncrementGap";
> +  protected static final String SCORE_OVERLAPS = "scoreOverlaps";
> --- End diff --
>
> I need to correct you one one point: Solr has had a syntax for
> per-field query parameters for a long time.  The syntax is
> `f.fieldName.parameterName`  e.g. `f.title.hl.snippets`   SolrJ's
> SolrParams has convenience methods for this on the implementation side.
> Perhaps you overlooked this because most users only use it in the context
> of faceting parameters, even though it's certainly not unique to faceting
> (as in the example above for highlighting).  I'm not aware of any query
> parser that uses it yet but they certainly could.
>
> Any way, I suppose even if we agree we'd like some query time
> customizability of this (and other settings), it would still be nice to
> establish a default fallback on the FieldType.
>
>
> ---
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)


Re: Multiple Query-Time Analyzers in Solr

2017-11-27 Thread Doug Turnbull
Thanks Steve, Trey, David, and Mikhail

Lots of great ideas, it seems like there's some consensus around

- Creating multiple named analyzers per field
- Referencing those analyzers by name at query time somehow

I would advocate for refactoring edismax (or making a new query parser)
that would allow you to specify per-field query configuration. Then I would
advocate refactoring some of the flags autoGeneratePhraseQueries, etc to
this query-time config. Then we could follow suit using the same syntax to
specify the analyzer to use at query time.

Perhaps more generally these configuration items can stay on the fieldType,
but a syntax could allow them to be overriden per field at query time?

Finally, another requirement I would add would be the ability to specify
the same field twice in qf, but configured to be queries two different
ways. Perhaps a syntax like qf=title:config1  title:config2? Where config1
and config2 modify fieldType query flags? Like
fieldConfig.config1.autoGeneratePhraseQuerise=false=no_synonyms

This sort of thing would in my opinion help both enhance the power of Solr,
but with a more consistent vision around how field-specific query settings
could be organized

Best
-Doug

On Fri, Nov 24, 2017 at 3:25 PM Steve Rowe <sar...@gmail.com> wrote:

> Somewhat orthogonal here, but I’ve long thought that it would be useful to
> introduce named analyzers that could be referenced by name from potentially
> multiple field types.
>
> --
> Steve
> www.lucidworks.com
>
> > On Nov 24, 2017, at 10:17 AM, David Smiley <david.w.smi...@gmail.com>
> wrote:
> >
> > Doug,
> >
> > I think it would be wonderful if a FieldType had N analyzer chains
> instead of exactly 3 (index, query, multiTerm).  Each chain could simply
> have a name.  The query parser could be configured to pick a particular
> chain by name.
> >
> > I worked on a search project that had like a half dozen query analyzers,
> which were also machine generated in code on the custom FieldType.  The
> query parser, also custom, could then communicate with the FieldType to get
> the particular analyzer that was appropriate for the use.
> >
> > It's annoying (hard to maintain) to see repeated chains that are
> slightly different.  I've wondered if it would be more maintainable to have
> one chain, with some qualifier on each element to say to which named chains
> it applies to (if not all)?  I dunno; trade-offs, trade-offs.
> >
> > ~ David
> >
> > On Thu, Nov 23, 2017 at 11:03 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> > An alternate solution could be to create a fieldType that was a
> "FacadeTextField" that searches a real TextField field with a different
> query time analyzer. IE it would not have a physical representation in the
> index, but just provide a handle to a "field" that is searched with a
> different query time analyzer.
> >
> > For example, actor_nosyn is really a facade for searching "actor" with a
> different analyzer
> >
> > 
> >   
> >
> > 
> >   
> >
> >
> > 
> > 
> > ...
> > 
> >
> > 
> > 
> > ...
> > ...
> > 
> >
> > This would allow edismax and other query parsers to remain unchanged
> searching, ie:
> >
> > q=action movies=actor actor_nosyn title text=edismax
> >
> >
> >
> > On Thu, Nov 23, 2017 at 10:50 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> > I wonder if there's been any thought by the community to refactoring
> fieldTypes to allow multiple query-time analyzers per indexed field?
> Currently, to get different query-time analysis behavior you have to
> duplicate a field. This is unfortunate duplication if, for example, I want
> to search a field with query time synonyms on/off. For higher scale search
> cases, allowing multiple query time analyzers against a single index field
> can be invaluable. It's one reason I created the Match Query Parser (
> https://github.com/o19s/match-query-parser) and a major feature of
> hon-lucene-synonyms (https://github.com/healthonnet/hon-lucene-synonyms )
> >
> > What I would propose is the ability to place multiple analyzers under a
> field type. For example:
> >
> > 
> >  name="with_synonyms">...
> > ...
> > ...
> > 
> >
> > Notice how one query-time analyzer is "default" (and including only one
> would make it the default)
> >
> > This would require allowing query parsers pass the analyzer to use at
> query time. I would propose introduce a syntax for configuring query
> behavior per-field in edism

Re: Multiple Query-Time Analyzers in Solr

2017-11-23 Thread Doug Turnbull
An alternate solution could be to create a fieldType that was a
"FacadeTextField" that searches a real TextField field with a different
query time analyzer. IE it would not have a physical representation in the
index, but just provide a handle to a "field" that is searched with a
different query time analyzer.

For example, actor_nosyn is really a facade for searching "actor" with a
different analyzer


  


  




...




...
...


This would allow edismax and other query parsers to remain unchanged
searching, ie:

q=action movies=actor actor_nosyn title text=edismax



On Thu, Nov 23, 2017 at 10:50 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> I wonder if there's been any thought by the community to refactoring
> fieldTypes to allow multiple query-time analyzers per indexed field?
> Currently, to get different query-time analysis behavior you have to
> duplicate a field. This is unfortunate duplication if, for example, I want
> to search a field with query time synonyms on/off. For higher scale search
> cases, allowing multiple query time analyzers against a single index field
> can be invaluable. It's one reason I created the Match Query Parser (
> https://github.com/o19s/match-query-parser) and a major feature of
> hon-lucene-synonyms (https://github.com/healthonnet/hon-lucene-synonyms )
>
> What I would propose is the ability to place multiple analyzers under a
> field type. For example:
>
> 
>  name="with_synonyms">...
> ...
> ...
> 
>
> Notice how one query-time analyzer is "default" (and including only one
> would make it the default)
>
> This would require allowing query parsers pass the analyzer to use at
> query time. I would propose introduce a syntax for configuring query
> behavior per-field in edismax. Omitting this would continue to use the
> default behavior/analyzer.
>
> For example, one could query title and text as usual:
>
> q=action movies=actor title text=edismax
>
> I would propose introducing a syntax whereby qf could refer to a kind of
> psuedo field, configurable with a syntax similar to per-field facet settings
>
> For example, below "actor_nosyn" and "actor_syn" actually search the same
> physical field, but are configured with different analyzers
>
> q=action movies=actor_syn actor_nosyn^10 title
> text=edismax_nosyn.field=actor_nosyn.analyzer=without_synonyms_syn.field=actor_syn.analyzer=with_synonyms
>
> Indeed, I would propose extending this syntax to control some of the
> query-specific properties that currently are tied to the fieldType, such as
>
> q=action movies=actor_syn actor_nosyn^10 title
> text=edismax_nosyn.field=actor_nosyn.analyzer=without_synonyms_syn.field=actor_syn.analyzer=with_synonyms=false
>
> I think this could be a pretty powerful syntax, but would require
> refactoring of the field type and edismax (and possibly other query
> parsers) quite a bit
>
> Any thoughts?
>
> Best
> -Doug
> --
> Consultant, OpenSource Connections. Contact info at
> http://o19s.com/about-us/doug-turnbull/; Free/Busy (
> http://bit.ly/dougs_cal)
>
-- 
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)


Multiple Query-Time Analyzers in Solr

2017-11-23 Thread Doug Turnbull
I wonder if there's been any thought by the community to refactoring
fieldTypes to allow multiple query-time analyzers per indexed field?
Currently, to get different query-time analysis behavior you have to
duplicate a field. This is unfortunate duplication if, for example, I want
to search a field with query time synonyms on/off. For higher scale search
cases, allowing multiple query time analyzers against a single index field
can be invaluable. It's one reason I created the Match Query Parser (
https://github.com/o19s/match-query-parser) and a major feature of
hon-lucene-synonyms (https://github.com/healthonnet/hon-lucene-synonyms )

What I would propose is the ability to place multiple analyzers under a
field type. For example:


...
...
...


Notice how one query-time analyzer is "default" (and including only one
would make it the default)

This would require allowing query parsers pass the analyzer to use at query
time. I would propose introduce a syntax for configuring query behavior
per-field in edismax. Omitting this would continue to use the default
behavior/analyzer.

For example, one could query title and text as usual:

q=action movies=actor title text=edismax

I would propose introducing a syntax whereby qf could refer to a kind of
psuedo field, configurable with a syntax similar to per-field facet settings

For example, below "actor_nosyn" and "actor_syn" actually search the same
physical field, but are configured with different analyzers

q=action movies=actor_syn actor_nosyn^10 title
text=edismax_nosyn.field=actor_nosyn.analyzer=without_synonyms_syn.field=actor_syn.analyzer=with_synonyms

Indeed, I would propose extending this syntax to control some of the
query-specific properties that currently are tied to the fieldType, such as

q=action movies=actor_syn actor_nosyn^10 title
text=edismax_nosyn.field=actor_nosyn.analyzer=without_synonyms_syn.field=actor_syn.analyzer=with_synonyms=false

I think this could be a pretty powerful syntax, but would require
refactoring of the field type and edismax (and possibly other query
parsers) quite a bit

Any thoughts?

Best
-Doug
-- 
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)


[jira] [Comment Edited] (SOLR-11662) Make overlapping query term scoring configurable per field type

2017-11-22 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263377#comment-16263377
 ] 

Doug Turnbull edited comment on SOLR-11662 at 11/22/17 9:31 PM:


PR updated w/ code in Solr level, patch can be viewed here 
https://github.com/apache/lucene-solr/pull/275.patch


was (Author: softwaredoug):
PR updated, patch can be viewed here 
https://github.com/apache/lucene-solr/pull/275.patch

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11662) Make overlapping query term scoring configurable per field type

2017-11-22 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263377#comment-16263377
 ] 

Doug Turnbull commented on SOLR-11662:
--

PR updated, patch can be viewed here 
https://github.com/apache/lucene-solr/pull/275.patch

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11662) Make overlapping query term scoring configurable per field type

2017-11-22 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262567#comment-16262567
 ] 

Doug Turnbull commented on SOLR-11662:
--

Great! And that would actually let me submit an ES patch in parallel... I'll 
update my PR/patch

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11662) Make overlapping query term scoring configurable per field type

2017-11-22 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262468#comment-16262468
 ] 

Doug Turnbull commented on SOLR-11662:
--

Thanks Adrien! Yes, it could be moved to SolrQueryParser. This would narrow the 
scope to just Solr, however. I would like to see this capability in 
Elasticsearch as well. Though that could be handled differently.

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11662) Make overlapping query term scoring configurable per field type

2017-11-21 Thread Doug Turnbull (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-11662:
-
Description: 
This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms*
(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the boolean query (text:tabby  text:cat text:animal)


  was:
This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms*
(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby  text:cat text:animal)



> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOv

[jira] [Commented] (SOLR-11662) Make overlapping query term scoring configurable per field type

2017-11-21 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261549#comment-16261549
 ] 

Doug Turnbull commented on SOLR-11662:
--

Associated pull request https://github.com/apache/lucene-solr/pull/275/files
And Patch 
https://patch-diff.githubusercontent.com/raw/apache/lucene-solr/pull/275.patch

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby  text:cat text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11662) Make overlapping query term scoring configurable per field type

2017-11-21 Thread Doug Turnbull (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-11662:
-
Summary: Make overlapping query term scoring configurable per field type  
(was: More than SynonymQuery: Let overlapping query terms model 
hypernym/hyponym relationships)

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms
> *(The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby  text:cat text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11662) Make overlapping query term scoring configurable per field type

2017-11-21 Thread Doug Turnbull (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-11662:
-
Description: 
This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms*
(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby  text:cat text:animal)


  was:
This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms
*(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby  text:cat text:animal)



> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOv

[jira] [Created] (SOLR-11662) More than SynonymQuery: Let overlapping query terms model hypernym/hyponym relationships

2017-11-21 Thread Doug Turnbull (JIRA)
Doug Turnbull created SOLR-11662:


 Summary: More than SynonymQuery: Let overlapping query terms model 
hypernym/hyponym relationships
 Key: SOLR-11662
 URL: https://issues.apache.org/jira/browse/SOLR-11662
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Doug Turnbull
 Fix For: 7.2, master (8.0)


This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms
*(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby  text:cat text:animal)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Looking for development docs.

2017-04-26 Thread Doug Turnbull
Something I found helpful was to go back to very early Lucene versions.
That let's you see the essential functionality in relatively
straightforward Java code. You can get a sense for how Lucene is
structured. Functionality has been built around this since. The Java has
been battle tested, refactored, and optimized. But those core bits were
really helpful for me to see what Lucene specifically did.

https://sourceforge.net/projects/lucene/

That plus Lucene in Action
On Wed, Apr 26, 2017 at 7:16 PM Erick Erickson 
wrote:

> Solr/Lucene is big. Really big. I'd think seriously about taking
> something you're interested in/know about, finding a JIRA that you'd
> like to work on and diving in. Plus there aren't very many
> architecture docs.
>
> Your characterization of the realms of responsibility is pretty accurate.
>
> Have you seen: https://wiki.apache.org/solr/HowToContribute?
>
> A somewhat painful but "safe" way to get your feet wet is to look at
> the coverage reports on jenkins and see what code is not tested in the
> junit tests and...write a test. At least I think the coverage reports
> are still there.
>
> Best,
> Erick
>
> On Wed, Apr 26, 2017 at 3:12 PM, David Lee 
> wrote:
> > I'd like to have a better understanding of how much of Solr is unique to
> it
> > versus directly extending Lucene.
> >
> > For example, I assume that sharding, replication, etc. is implemented in
> > Solr where-as indexing, querying, etc. would be implemented by Lucene.
> >
> > I'm hoping to learn enough to be able to contribute at some point.
> >
> > Thanks,
> >
> > David
> >
> >
> >
> > ---
> > This email has been checked for viruses by Avast antivirus software.
> > https://www.avast.com/antivirus
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Change Default Response Format (wt) to JSON in Solr 7.0?

2017-04-14 Thread Doug Turnbull
Sounds great. I agree!

I can imagine there might be really old client libraries/integrations that
assume XML without sending a wt, but I think it's ok to break those sorts
of things in a major release. And those folks can learn to send wt=xml

-Doug

On Fri, Apr 14, 2017 at 2:53 PM Trey Grainger  wrote:

> Just wanted to throw this out there for discussion. Solr's default query
> response format is still XML, despite the fact that Solr has supported the
> JSON response format for over a decade, developer mindshare has clearly
> shifted toward JSON over the years, and most modern/competing systems also
> use JSON format now by default.
>
> In fact, Solr's admin UI even explicitly adds wt=json to the request (by
> default in the UI) to override the default of wt=xml, so Solr's Admin UI
> effectively has a different default than the API.
>
> We have now introduced things like the JSON faceting API, and the new more
> modern /V2 apis assume JSON for the areas of Solr they cover, so clearly
> we're moving in the direction of JSON anyway.
>
> I'd like propose that we switch the default response writer to JSON
> (wt=json) instead of XML for Solr 7.0, as this seems to me like the right
> direction and a good time to make this change with the next major version.
>
> Before I create a JIRA and submit a patch, though, I wanted to check here
> make sure there were no strong objections to changing the default.
>
> -Trey Grainger
>


Re: Search Engine question

2017-03-21 Thread Doug Turnbull
Definitely start with Solr unless you have some specialized use case.
Lucene skills can come up in a Solr context (ie if you wanted to write
plugins)

I would also recommend:
- Solr in Action
- Lucene in Action (out of date, but many concepts still valid)
- Apache Solr Ref Guide (
https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide
)
- Solr Start (http://www.solr-start.com/)
- Relevant Search (I wrote this book, email me directly for a discount code)

Slightly shameless plug. What might help you is I basically give anyone a
free hour of my time for consulting, so hit me up and I'd be happy to walk
you through some basics/ideas on getting started
http://opensourceconnections.com/blog/2016/08/01/search-for-lunch/

Best
-Doug

On Tue, Mar 21, 2017 at 8:04 PM Bina N Shah  wrote:

> Good Afternoon,
>
>
>
> My name is Bina Shah and I work for University of New Mexico Hospitals,
> non-profit organization.
>
>
>
> We are considering ways to implement Search Engine for our static intranet
> pages. In the second phase, implement search engine for our dynamic web
> applications. I noticed on your web site, there are two different Search
> projects:  Apache Lucene Core and Apache Solr.  I need your guidance as to
> where to start, search engine demo video, and which would be the
> appropriate Search project?
>
>
>
> Thank you in advance for your time and looking forward to hearing from you.
>
>
>
> Thank you,
>
>
>
> Bina Shah
>
> Web Analyst
>
> UNM Hospitals
>
> bns...@salud.unm.edu
>
> (505) 925-4795
>
>
>
>
>


Re: Developer's Guide

2017-03-03 Thread Doug Turnbull
As an aside, I'm pretty sure if anyone wanted to write a new edition of
Lucene in Action, and you're masochistic enough to write a book for a top
tier tech book publisher, I'd be happy to introduce you to someone at
Manning :)

And Lucene In Action is a very good read, will help you get the big ideas,
even if the examples are outdated

-Doug

On Fri, Mar 3, 2017 at 11:23 AM David Smiley 
wrote:

> Hi,
> There is no developer's guide.  There are Javadocs, and there's an
> outdated book (although the concepts are still good but it's the details
> that have changed).
> ~ David
>
> On Fri, Mar 3, 2017 at 11:16 AM Nilesh Kamani 
> wrote:
>
> Could anybody please help me with this ?
>
> On Wed, Mar 1, 2017 at 9:22 AM, Nilesh Kamani 
> wrote:
>
> Hello All,
>
> Are there any Developer's Guide to understand various packages and classes
> and their role ?
> I am looking to modify boolean AND search to meet some specific criteria.
>
> Thanks,
> Nilesh Kamani
>
>
>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>


[jira] [Commented] (SOLR-9418) Probabilistic-Query-Parser RequestHandler

2016-10-20 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591921#comment-15591921
 ] 

Doug Turnbull commented on SOLR-9418:
-

Looking at your patch (I'm not a committer just curious about the patch). A few 
things jump out in a shallow reading that would probably need to change for 
this to be accepted:

- Field names and thresholds likely need to be configurable, as most folks 
won't nescesarilly have a field named exactly "title" or "content." 
- Can this be a qparser plugin instead of a request handler? It's likely I'd 
want to use it alongside other qparsers and SearchComponents (like highlighting 
or facets).
- Can you provide some documentation on how the thresholds work/can be 
configured?

> Probabilistic-Query-Parser RequestHandler
> -
>
> Key: SOLR-9418
> URL: https://issues.apache.org/jira/browse/SOLR-9418
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Akash Mehta
> Attachments: SOLR-9418.zip
>
>
> The main aim of this requestHandler is to get the best parsing for a given 
> query. This basically means recognizing different phrases within the query. 
> We need some kind of training data to generate these phrases. The way this 
> project works is:
> 1.)Generate all possible parsings for the given query
> 2.)For each possible parsing, a naive-bayes like score is calculated.
> 3.)The main scoring is done by going through all the documents in the 
> training set and finding the probability of bunch of words occurring together 
> as a phrase as compared to them occurring randomly in the same document. Then 
> the score is normalized. Some higher importance is given to the title field 
> as compared to content field which is configurable.
> 4.)Finally after scoring each of the possible parsing, the one with the 
> highest score is returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7436) MinHashFilter has package-local constructor and constants

2016-09-06 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467891#comment-15467891
 ] 

Doug Turnbull commented on LUCENE-7436:
---

Fix is here https://github.com/apache/lucene-solr/pull/78

> MinHashFilter has package-local constructor and constants
> -
>
> Key: LUCENE-7436
> URL: https://issues.apache.org/jira/browse/LUCENE-7436
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.2
>Reporter: Doug Turnbull
>Priority: Minor
>
> Trying to use the MinHashFilter outside of Lucene/Solr. Was it intentional 
> that the constructor and useful defaults are package-private? Seems like an 
> oversight to me, correct me if I'm wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7436) MinHashFilter has package-local constructor and constants

2016-09-06 Thread Doug Turnbull (JIRA)
Doug Turnbull created LUCENE-7436:
-

 Summary: MinHashFilter has package-local constructor and constants
 Key: LUCENE-7436
 URL: https://issues.apache.org/jira/browse/LUCENE-7436
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 6.2
Reporter: Doug Turnbull
Priority: Minor


Trying to use the MinHashFilter outside of Lucene/Solr. Was it intentional that 
the constructor and useful defaults are package-private? Seems like an 
oversight to me, correct me if I'm wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Word stop list in examples (was Re: Default stop word list)

2016-09-04 Thread Doug Turnbull
I see it more of a performance tweak than a relevance thing. matches on
stopwords introduce the potential for many more documents to be scored.

Large collections usually should have a high min-should-match, so more than
likely queries with at least one or two non-stopwords that dramatically
limit the docs that will be scored. And since large collections are where
people have stopwords perf problems, this tends to obviate the performance
gains of removing stopwords.

On Sun, Sep 4, 2016 at 12:08 PM Erick Erickson 
wrote:

> Wouldn't most frequent term serve?
>
> On Sep 4, 2016 08:52, "Alexandre Rafalovitch"  wrote:
>
>> On 4 September 2016 at 22:23, Walter Underwood 
>> wrote:
>> > If you do want to use stopwords, I’d index without them, then look at
>> the
>> > words with the lowest IDF to make the list.
>>
>> That's an interesting approach. Is there an easy way to do that (in Solr?)
>>
>> Regards,
>>Alex.
>>
>> 
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: Lucene or Apache Solr : Project Decision making

2016-08-30 Thread Doug Turnbull
Hi Archit, I would make a strong argument for using Solr unless you have
some exotic requirements.

- Solr has distributed indexing and search built in, building your own
distributed system is non-trivial, just as Mark Miller :)
- Solr comes prebaked with an HTTP API for non search experts to interact
with.
- For hiring, it's more likely you'll find a Solr expert than a Lucene
expert
- Custom capabilities can be handled by Solr plugins that specialize bits
and pieces of Solr to your needs
- You can pretty easily proxy Solr for security, from anything from a dumb
nginx proxy to a tad bit of custom code

I might consider using just Lucene if the consumers of my library don't
realize there's "search" under the hood
- I really just want a Java library that does search-like operations under
the hood, but the consumers of my code don't care about search.
- I'm doing something data-sciency with Lucene, my problem doesn't resemble
search, and I want direct control (ie classification, etc).

(note Elasticsearch would have similar capabilities and pros/cons vs Solr,
but the Solr vs ES is a whole 'nother conversation and I don't want to
hijack your thread)

-Doug



On Tue, Aug 30, 2016 at 1:24 PM Alexandre Rafalovitch 
wrote:

> SolerCloud uses Zookeeper. As to the rest, Solr is open source. It may be
> more efficient stripping out whatever you don't want than reinventing it on
> top of Lucene again.
>
> Regards,
> Alex
>
> On 31 Aug 2016 12:17 AM, "archit mehta"  wrote:
>
>> Hi,
>>
>> We need to take decision whether to go for lucene or solr. There are few
>> points which I would like to mention.
>>
>> 1. If we use lucene we do not have to worry about security as it is
>> already taken care but need to build own distributed indexer and searcher,
>> if we use solr then we don't have to worry about distributed indexer and
>> searcher but as it is a another process we have to put some security
>> controls.
>>
>> In our case getting permission for solr is bit difficult, lucene is
>> already in production (withou distribution stuff)
>>
>> 2. Does solr uses kafka or zookeeper or other third party library? Can I
>> get list from somewhere?
>> Server is heavily loaded, new process and running kafka/zookeeper is also
>> an overhead for us.
>> With current implementation we removed kafka and wrote some of our own
>> code.
>>
>> How much easy or difficult to build distributed indexer and searcher with
>> core lucene?
>>
>> Kindly share your views based on the point I have mentioned here. In case
>> any more clarification require write me back.
>>
>>
>> Regards,
>> Archit
>>
>>


Re: Proposal to Move Solr Ref Guide off Confluence

2016-08-18 Thread Doug Turnbull
Is there anyway to maintain inbound links to confluence pages with the new
system? I'm just thinking about all the user group questions, stackoverflow
Qs, and the like that link to cwiki pages.

Is it possible to setup the right redirects for cwiki pages into the new
system?

Doug
On Thu, Aug 18, 2016 at 7:30 PM Chris Hostetter 
wrote:

>
> : First, I'm not about to second-guess this. I wouldn't like to lose the
> : ability to download a full doc to search offline, but it looks like
> : this solution allows that since there is a PDF version after all.
>
> I also like being able to officially "release" the guide, and doing so via
> PDF will still be possible.
>
> But the other nice thing is that this will make it easy to
> maintain "branches" of the ref guide in git, and publish those with
> releases as well -- so you can edit the docs on master, and backport the
> docs to the branch_6x at the same you backport the feature, and we can
> publish HTML versions of the guide right along side the javadoc docs for
> each version of solr.
>
> : As you know, every time I try to edit he CWiki I come whimpering to
> : you or Hoss. Sounds like this solution will reduce the volume of my
> : whimpering which is a good thing. I so loathe Confluence that find
>
> Ideally yes -- a lot of the problems we have with confluence today stem
> from the "WYSI-kind-of-WYG" mentality of it's editor, and the fact that it
> sometimes preserves html styling you can't see until the PDF is published
> (especially when you copy/paste).  Most of that pain should go away
> because the adoc files will be plain text.  (Any markup langauge has it's
> share of "wait, how do i get get formatting XYZ?" but being plain text
> files in git will make it a lot easier to spot mistakes in diffs -- as
> opposed to confluence with it's "heres a historical diff that is also in
> rendered HTML, so good luck noticing that there is an extra span with a
> css class that affects the PDF but isn't mentioned in the web stylesheet"
>
> : I downloaded AsciidocFX and it looks quite usable. There may be better
> : tools out there but that was fast to find and I could work with it. I
> : see a Chrome extension, IntelliJ plugin etc. so it looks like there
> : are a variety of ways to go about all this.
>
> yeah -- just like java IDE/editor choices can be very personal,
> people will also be free to choose any tooling they want for editing
> asciidoc files -- which is another nice win over the web based confluence
> editor.  The trick will be having good automation in place to build the
> HTML & PDF output formats from the source documents, and give helpful
> feedback/errors about any weirdness that we can detect in scripts.  I plan
> on working to help cassandra with the "ongoing automation" when i get back
> from vacation in a few weeks.
>
> (at the moment, I'm spending my last few days before vacation tyring to
> better automate the confluence->(clean)asciidoc conversion so
> cassandra can iterate faster on demos of the full guide)
>
> : > If reaction is positive, my next step will be to expand the demo
> : > online with a full copy of the Ref Guide instead of the current small
> : > set.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (SOLR-9395) Add ceil/floor bounding to stats calculations

2016-08-09 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413994#comment-15413994
 ] 

Doug Turnbull commented on SOLR-9395:
-

Hmm that won't work, nm as you'd do stats over a relevance score :-/ yeah you 
probably need some way of passing up the exists value and/or declaring 
something as non existent. I'll have to think on it some more

> Add ceil/floor bounding to stats calculations
> -
>
> Key: SOLR-9395
> URL: https://issues.apache.org/jira/browse/SOLR-9395
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> In the pull request to be attached we add optional ceil and floor parameters 
> to a field being computed via the stats component. This bounds the stats 
> calculations to ceil to floor inclusive.
> For example, let's say your searching over all the employees.
> stats=true=employee_age
> But you want to focus on employees aged 18-60 for whatever reason. You can 
> reissue this query as
> stats=true={!floor=18 ceil=60}employee_age
> This limits the resulting stats calculations to 18-60 inclusive. This 
> functionality also works on date fields (see test in PR).
> Now one question might be, why not do this with a filter query? In many cases 
> you don't necessarily want to filter these documents from the main search 
> results. You just want to eliminate outliers from a specific stats 
> calculation. For example, you search your employee database for "clerks." You 
> still want to see all the clerks, even little 16 year old Timmy. But for this 
> particular calculation you just want to focus on folks of traditional working 
> age for whatever reason.
> Some notes
> - floor/ceil are only supported as local params.
> - works for date and numeric values
> - date math works!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-9395) Add ceil/floor bounding to stats calculations

2016-08-09 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413459#comment-15413459
 ] 

Doug Turnbull edited comment on SOLR-9395 at 8/9/16 12:41 PM:
--

Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 

{quote}
stats.field=\{!func\}query($someRangeFilter) 
{quote}

I hadn't tried that, but I wonder if it would work. I'll have to try it and 
report back...


was (Author: softwaredoug):
Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 

{quote}
stats.field={!func}query($someRangeFilter) 
{quote}

I hadn't tried that, but I wonder if it would work. I'll have to try it and 
report back...

> Add ceil/floor bounding to stats calculations
> -
>
> Key: SOLR-9395
> URL: https://issues.apache.org/jira/browse/SOLR-9395
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> In the pull request to be attached we add optional ceil and floor parameters 
> to a field being computed via the stats component. This bounds the stats 
> calculations to ceil to floor inclusive.
> For example, let's say your searching over all the employees.
> stats=true=employee_age
> But you want to focus on employees aged 18-60 for whatever reason. You can 
> reissue this query as
> stats=true={!floor=18 ceil=60}employee_age
> This limits the resulting stats calculations to 18-60 inclusive. This 
> functionality also works on date fields (see test in PR).
> Now one question might be, why not do this with a filter query? In many cases 
> you don't necessarily want to filter these documents from the main search 
> results. You just want to eliminate outliers from a specific stats 
> calculation. For example, you search your employee database for "clerks." You 
> still want to see all the clerks, even little 16 year old Timmy. But for this 
> particular calculation you just want to focus on folks of traditional working 
> age for whatever reason.
> Some notes
> - floor/ceil are only supported as local params.
> - works for date and numeric values
> - date math works!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-9395) Add ceil/floor bounding to stats calculations

2016-08-09 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413459#comment-15413459
 ] 

Doug Turnbull edited comment on SOLR-9395 at 8/9/16 12:41 PM:
--

Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 

{quote}
stats.field={!func}query($someRangeFilter) 
{quote}

I hadn't tried that, but I wonder if it would work. I'll have to try it and 
report back...


was (Author: softwaredoug):
Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 
{{stats.field={!func}query($someRangeFilter)}}. I hadn't tried that, but I 
wonder if it would work. I'll have to try it and report back...

> Add ceil/floor bounding to stats calculations
> -
>
> Key: SOLR-9395
> URL: https://issues.apache.org/jira/browse/SOLR-9395
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> In the pull request to be attached we add optional ceil and floor parameters 
> to a field being computed via the stats component. This bounds the stats 
> calculations to ceil to floor inclusive.
> For example, let's say your searching over all the employees.
> stats=true=employee_age
> But you want to focus on employees aged 18-60 for whatever reason. You can 
> reissue this query as
> stats=true={!floor=18 ceil=60}employee_age
> This limits the resulting stats calculations to 18-60 inclusive. This 
> functionality also works on date fields (see test in PR).
> Now one question might be, why not do this with a filter query? In many cases 
> you don't necessarily want to filter these documents from the main search 
> results. You just want to eliminate outliers from a specific stats 
> calculation. For example, you search your employee database for "clerks." You 
> still want to see all the clerks, even little 16 year old Timmy. But for this 
> particular calculation you just want to focus on folks of traditional working 
> age for whatever reason.
> Some notes
> - floor/ceil are only supported as local params.
> - works for date and numeric values
> - date math works!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9395) Add ceil/floor bounding to stats calculations

2016-08-09 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413459#comment-15413459
 ] 

Doug Turnbull commented on SOLR-9395:
-

Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 
{{stats.field={!func}query($someRangeFilter)}}. I hadn't tried that, but I 
wonder if it would work. I'll have to try it and report back...

> Add ceil/floor bounding to stats calculations
> -
>
> Key: SOLR-9395
> URL: https://issues.apache.org/jira/browse/SOLR-9395
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> In the pull request to be attached we add optional ceil and floor parameters 
> to a field being computed via the stats component. This bounds the stats 
> calculations to ceil to floor inclusive.
> For example, let's say your searching over all the employees.
> stats=true=employee_age
> But you want to focus on employees aged 18-60 for whatever reason. You can 
> reissue this query as
> stats=true={!floor=18 ceil=60}employee_age
> This limits the resulting stats calculations to 18-60 inclusive. This 
> functionality also works on date fields (see test in PR).
> Now one question might be, why not do this with a filter query? In many cases 
> you don't necessarily want to filter these documents from the main search 
> results. You just want to eliminate outliers from a specific stats 
> calculation. For example, you search your employee database for "clerks." You 
> still want to see all the clerks, even little 16 year old Timmy. But for this 
> particular calculation you just want to focus on folks of traditional working 
> age for whatever reason.
> Some notes
> - floor/ceil are only supported as local params.
> - works for date and numeric values
> - date math works!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-9395) Add ceil/floor bounding to stats calculations

2016-08-08 Thread Doug Turnbull (JIRA)
Doug Turnbull created SOLR-9395:
---

 Summary: Add ceil/floor bounding to stats calculations
 Key: SOLR-9395
 URL: https://issues.apache.org/jira/browse/SOLR-9395
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: master (7.0)
Reporter: Doug Turnbull
 Fix For: master (7.0)


In the pull request to be attached we add optional ceil and floor parameters to 
a field being computed via the stats component. This bounds the stats 
calculations to ceil to floor inclusive.

For example, let's say your searching over all the employees.

stats=true=employee_age

But you want to focus on employees aged 18-60 for whatever reason. You can 
reissue this query as

stats=true={!floor=18 ceil=60}employee_age

This limits the resulting stats calculations to 18-60 inclusive. This 
functionality also works on date fields (see test in PR).

Now one question might be, why not do this with a filter query? In many cases 
you don't necessarily want to filter these documents from the main search 
results. You just want to eliminate outliers from a specific stats calculation. 
For example, you search your employee database for "clerks." You still want to 
see all the clerks, even little 16 year old Timmy. But for this particular 
calculation you just want to focus on folks of traditional working age for 
whatever reason.

Some notes
- floor/ceil are only supported as local params.
- works for date and numeric values
- date math works!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-28 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397636#comment-15397636
 ] 

Doug Turnbull commented on SOLR-9279:
-

+1

> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
> Attachments: SOLR-9279.patch
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request attached to this Jira adds the following function queries
> (https://github.com/apache/lucene-solr/pull/49)
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-27 Thread Doug Turnbull
Great!

+1
On Wed, Jul 27, 2016 at 3:26 PM David Smiley (JIRA) <j...@apache.org> wrote:

>
> [
> https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396205#comment-15396205
> ]
>
> David Smiley commented on SOLR-9279:
> 
>
> Sure -- trivial enough.  Unless there are further suggestions on this
> issue, I'll commit it with that change later this week.  I'll update Lucene
> & Solr's CHANGES.txt since both get something here.
>
> > Add greater than, less than, etc in Solr function queries
> > -
> >
> > Key: SOLR-9279
> > URL: https://issues.apache.org/jira/browse/SOLR-9279
> > Project: Solr
> >  Issue Type: New Feature
> >  Security Level: Public(Default Security Level. Issues are Public)
> >  Components: search
> >Reporter: Doug Turnbull
> > Fix For: master (7.0)
> >
> > Attachments: SOLR-9279.patch
> >
> >
> > If you use the "if" function query, you'll often expect to be able to
> use greater than/less than functions. For example, you might want to boost
> books written in the past 7 years. Unfortunately, there's no "greater than"
> function query that will return non-zero when the lhs > rhs. Instead to get
> this, you need to create really awkward function queries like I do here (
> http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/
> ):
> > if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> > The pull request attached to this Jira adds the following function
> queries
> > (https://github.com/apache/lucene-solr/pull/49)
> > -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> > -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> > -gte
> > -lte
> > -eq
> > So instead of
> > if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> > one could now write
> > if(lt(ms(mydatefield),315569259747,0.8,1)
> > (if mydatefield < 315569259747 then 0.8 else 1)
> > A bit more readable and less puzzling
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-27 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396118#comment-15396118
 ] 

Doug Turnbull commented on SOLR-9279:
-

Looks great [~dsmiley]! Definitely a big improvement. Appreciate your 
attention, I've learned a lot through this issue.

What do you think about adding an objectValue override as suggested by 
[~hossman]?

{code:java}
 @Override
  public Object objectVal(int doc) {
return exists(doc) ? boolVal(doc) : null;
  }
{code}

> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
> Attachments: SOLR-9279.patch
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request attached to this Jira adds the following function queries
> (https://github.com/apache/lucene-solr/pull/49)
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-26 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394521#comment-15394521
 ] 

Doug Turnbull commented on SOLR-9279:
-

[~hossman] Thanks for your help! Great points. -- I think I addressed your 
comments other than the Object value one. Is there documentation on an object 
value source? I'm not sure what's expected here.

> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request attached to this Jira adds the following function queries
> (https://github.com/apache/lucene-solr/pull/49)
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-05 Thread Doug Turnbull (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-9279:

Description: 
If you use the "if" function query, you'll often expect to be able to use 
greater than/less than functions. For example, you might want to boost books 
written in the past 7 years. Unfortunately, there's no "greater than" function 
query that will return non-zero when the lhs > rhs. Instead to get this, you 
need to create really awkward function queries like I do here 
(http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

The pull request attached to this Jira adds the following function queries
(https://github.com/apache/lucene-solr/pull/49)

-gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
-lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
-gte
-lte
-eq

So instead of 

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

one could now write

if(lt(ms(mydatefield),315569259747,0.8,1)

(if mydatefield < 315569259747 then 0.8 else 1)

A bit more readable and less puzzling


  was:
If you use the "if" function query, you'll often expect to be able to use 
greater than/less than functions. For example, you might want to boost books 
written in the past 7 years. Unfortunately, there's no "greater than" function 
query that will return non-zero when the lhs > rhs. Instead to get this, you 
need to create really awkward function queries like I do here 
(http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

The pull request to be attached to this Jira adds the following function queries

-gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
-lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
-gte
-lte
-eq

So instead of 

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

one could now write

if(lt(ms(mydatefield),315569259747,0.8,1)

(if mydatefield < 315569259747 then 0.8 else 1)

A bit more readable and less puzzling



> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request attached to this Jira adds the following function queries
> (https://github.com/apache/lucene-solr/pull/49)
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-05 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363563#comment-15363563
 ] 

Doug Turnbull commented on SOLR-9279:
-

Associated Pull request https://github.com/apache/lucene-solr/pull/49

> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request to be attached to this Jira adds the following function 
> queries
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-05 Thread Doug Turnbull (JIRA)
Doug Turnbull created SOLR-9279:
---

 Summary: Add greater than, less than, etc in Solr function queries
 Key: SOLR-9279
 URL: https://issues.apache.org/jira/browse/SOLR-9279
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
  Components: search
Reporter: Doug Turnbull
 Fix For: master (7.0)


If you use the "if" function query, you'll often expect to be able to use 
greater than/less than functions. For example, you might want to boost books 
written in the past 7 years. Unfortunately, there's no "greater than" function 
query that will return non-zero when the lhs > rhs. Instead to get this, you 
need to create really awkward function queries like I do here 
(http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

The pull request to be attached to this Jira adds the following function queries

-gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
-lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
-gte
-lte
-eq

So instead of 

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

one could now write

if(lt(ms(mydatefield),315569259747,0.8,1)

(if mydatefield < 315569259747 then 0.8 else 1)

A bit more readable and less puzzling




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene/Solr git mirror will soon turn off

2015-12-16 Thread Doug Turnbull
In defense of more history immediately available--it is often far more
useful to poke around code history/run blame to figure out some code than
by taking it at face value. Putting this in a secondary place like
Apache SVN repo IMO reduces the readability of the code itself. This is
doubly true for new developers that won't know about Apache's SVN. And
Lucene can be quite intricate code. Further in my own work poking around in
github mirrors I frequently hit the current cutoff. Which is one reason I
stopped using them for anything but the casual investigation.

I'm not totally against a cutoff point, but I'd advocate for exhausting
other options first, such as trimming out unrelated projects, binaries, etc.

-Doug


On Wednesday, December 16, 2015, Shawn Heisey <apa...@elyograg.org
<javascript:_e(%7B%7D,'cvml','apa...@elyograg.org');>> wrote:

> On 12/16/2015 5:53 PM, Alexandre Rafalovitch wrote:
> > On 16 December 2015 at 00:44, Dawid Weiss <dawid.we...@gmail.com> wrote:
> >> 4) The size of JARs is really not an issue. The entire SVN repo I
> mirrored
> >> locally (including empty interim commits to cater for svn:mergeinfos)
> is 4G.
> >> If you strip the stuff like javadocs and side projects (Nutch, Tika,
> Mahout)
> >> then I bet the entire history can fit in 1G total. Of course stripping
> JARs
> >> is also doable.
> > I think this answered one of the issues. So, this is not something to
> focus on.
> >
> > The question I had (I am sure a very dumb one): WHY do we care about
> > history preserved perfectly in Git? Because that seems to be the real
> > bottleneck now. Does anybody still checks out an intermediate commit
> > in Solr 1.4 branch?
>
> I do not think we need every bit of history -- at least in the primary
> read/write repository.  I wonder how much of a size difference there
> would be between tossing all history before 5.0 and tossing all history
> before the ivy transition was completed.
>
> In the interests of reducing the size and download time of a clone
> operation, I definitely think we should trim history in the main repo to
> some arbitrary point, as long as the full history is available
> elsewhere.  It's my understanding that it will remain in svn.apache.org
> (possibly forever), and I think we could also create "historical"
> read-only git repos.
>
> Almost every time I am working on the code, I only care about the stable
> branch and trunk.  Sometimes I will check out an older 4.x tag so I can
> see the exact code referenced by a stacktrace in a user's error message,
> but when this is required, I am willing to go to an entirely different
> repository and chew up bandwidth/disk resourcesto obtain it, and I do
> not care whether it is git or svn.  As time marches on, fewer people
> will have reasons to look at the historical record.
>
> Thanks,
> Shawn
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Lucene/Solr git mirror will soon turn off

2015-12-15 Thread Doug Turnbull
I thought the general consensus at minimum was to investigate a git mirror
that stripped some artifacts out (jars etc) to lighten up the work of the
process. If at some point the project switched to git, such a mirror might
be a suitable git repo for the project with archived older versions in SVN.

I think probably what is lacking is a volunteer to figure it all out.

-Doug

On Tue, Dec 15, 2015 at 11:32 AM, Mark Miller <markrmil...@gmail.com> wrote:

> Anyone willing to lead this discussion to some kind of better resolution?
> Did that whole back and forth help with any ideas on the best path forward?
> I know it's a complicated issue, git / svn, the light side, the dark side,
> but doesn't GitHub also depend on this mirroring? It's going to be super
> annoying when I can no longer pull from a relatively up to date git remote.
>
> Who has boiled down the correct path?
>
> - Mark
>
> On Wed, Dec 9, 2015 at 6:07 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
>
>> FYI.
>>
>> - All of Lucene's SVN, incremental deltas, uncompressed: 5.0G
>> - the above, tar.bz2: 1.2G
>>
>> Sadly, I didn't succeed at recreating a local SVN repo from those
>> incremental dumps. svnadmin load fails with a cryptic error related to
>> the fact that revision number of node-copy operations refer to
>> original SVN numbers and they're apparently renumbered on import.
>> svnadmin isn't smart enough to somehow keep a reference of those
>> original numbers and svndumpfilter can't work with incremental dump
>> files... A seemingly trivial task of splitting a repo on a clean
>> boundary seems incredibly hard with SVN...
>>
>> If anybody wishes to play with the dump files, here they are:
>> http://goo.gl/m6q3J8
>>
>> Dawid
>>
>> On Tue, Dec 8, 2015 at 10:49 PM, Upayavira <u...@odoko.co.uk> wrote:
>> > You can't avoid having the history in SVN. The ASF has one large repo,
>> and
>> > won't be deleting that repo, so the history will survive in perpetuity,
>> > regardless of what we do now.
>> >
>> > Upayavira
>> >
>> > On Tue, Dec 8, 2015, at 09:24 PM, Doug Turnbull wrote:
>> >
>> > It seems you'd want to preserve that history in a frozen/archiced
>> Apache Svn
>> > repo for Lucene. Then make the new git repo slimmer before switching.
>> Folks
>> > that want very old versions or doing research can at least go through
>> the
>> > original SVN repo.
>> >
>> > On Tuesday, December 8, 2015, Dawid Weiss <dawid.we...@gmail.com>
>> wrote:
>> >
>> > One more thing, perhaps of importance, the raw Lucene repo contains
>> > all the history of projects that then turned top-level (Nutch,
>> > Mahout). These could also be dropped (or ignored) when converting to
>> > git. If we agree JARs are not relevant, why should projects not
>> > directly related to Lucene/ Solr be?
>> >
>> > Dawid
>> >
>> > On Tue, Dec 8, 2015 at 10:05 PM, Dawid Weiss <dawid.we...@gmail.com>
>> wrote:
>> >>> Don’t know how much we have of historic jars in our history.
>> >>
>> >> I actually do know. Or will know. In about ~10 hours. I wrote a script
>> >> that does the following:
>> >>
>> >> 1) git log all revisions touching
>> https://svn.apache.org/repos/asf/lucene
>> >> 2) grep revision numbers
>> >> 3) use svnrdump to get every single commit (revision) above, in
>> >> incremental mode.
>> >>
>> >> This will allow me to:
>> >>
>> >> 1) recreate only Lucene/ Solr SVN, locally.
>> >> 2) measure the size of SVN repo.
>> >> 3) measure the size of any conversion to git (even if it's one-by-one
>> >> checkout, then-sync with git).
>> >>
>> >> From what I see up until now size should not be an issue at all. Even
>> >> with all binary blobs so far the SVN incremental dumps measure ~3.7G
>> >> (and I'm about 75% done). There is one interesting super-large commit,
>> >> this one:
>> >>
>> >> svn log -r1240618 https://svn.apache.org/repos/asf/lucene
>> >>
>> 
>> >> r1240618 | gsingers | 2012-02-04 22:45:17 +0100 (Sat, 04 Feb 2012) | 1
>> >> line
>> >>
>> >> LUCENE-2748: bring in old Lucene docs
>> >>
>> >> This commit diff weights... wait for it... 1.3G! I didn't check what
>> >> it actually was

Re: Lucene/Solr git mirror will soon turn off

2015-12-08 Thread Doug Turnbull
It seems you'd want to preserve that history in a frozen/archiced Apache
Svn repo for Lucene. Then make the new git repo slimmer before switching.
Folks that want very old versions or doing research can at least go through
the original SVN repo.

On Tuesday, December 8, 2015, Dawid Weiss <dawid.we...@gmail.com> wrote:

> One more thing, perhaps of importance, the raw Lucene repo contains
> all the history of projects that then turned top-level (Nutch,
> Mahout). These could also be dropped (or ignored) when converting to
> git. If we agree JARs are not relevant, why should projects not
> directly related to Lucene/ Solr be?
>
> Dawid
>
> On Tue, Dec 8, 2015 at 10:05 PM, Dawid Weiss <dawid.we...@gmail.com
> <javascript:;>> wrote:
> >> Don’t know how much we have of historic jars in our history.
> >
> > I actually do know. Or will know. In about ~10 hours. I wrote a script
> > that does the following:
> >
> > 1) git log all revisions touching
> https://svn.apache.org/repos/asf/lucene
> > 2) grep revision numbers
> > 3) use svnrdump to get every single commit (revision) above, in
> > incremental mode.
> >
> > This will allow me to:
> >
> > 1) recreate only Lucene/ Solr SVN, locally.
> > 2) measure the size of SVN repo.
> > 3) measure the size of any conversion to git (even if it's one-by-one
> > checkout, then-sync with git).
> >
> > From what I see up until now size should not be an issue at all. Even
> > with all binary blobs so far the SVN incremental dumps measure ~3.7G
> > (and I'm about 75% done). There is one interesting super-large commit,
> > this one:
> >
> > svn log -r1240618 https://svn.apache.org/repos/asf/lucene
> > 
> > r1240618 | gsingers | 2012-02-04 22:45:17 +0100 (Sat, 04 Feb 2012) | 1
> line
> >
> > LUCENE-2748: bring in old Lucene docs
> >
> > This commit diff weights... wait for it... 1.3G! I didn't check what
> > it actually was.
> >
> > Will keep you posted.
> >
> > D.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Lucene/Solr git mirror will soon turn off

2015-12-06 Thread Doug Turnbull
I had not heard of git-lfs looks promising

https://git-lfs.github.com/?utm_source=github_site_medium=blog_campaign=gitlfs

On Sunday, December 6, 2015, Jan Høydahl <jan@cominvent.com> wrote:

> If the size of historic jars is the problem here, would looking into
> git-lfs for *.jar be one workaround? I might also be totally off here :-)
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> 6. des. 2015 kl. 00.46 skrev Scott Blum <dragonsi...@gmail.com
> <javascript:_e(%7B%7D,'cvml','dragonsi...@gmail.com');>>:
>
> If lucene was a new project being started today, is there any question
> about whether it would be managed in svn or git?  If not, this might be a
> good impetus for moving to a better world.
>
> On Sat, Dec 5, 2015 at 6:19 PM, Yonik Seeley <ysee...@gmail.com
> <javascript:_e(%7B%7D,'cvml','ysee...@gmail.com');>> wrote:
>
>> On Sat, Dec 5, 2015 at 5:53 PM, david.w.smi...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','david.w.smi...@gmail.com');>
>> <david.w.smi...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','david.w.smi...@gmail.com');>> wrote:
>> > I understand Gus; but we’d like to separate the question of wether we
>> should
>> > move from svn to git from fixing the git mirror.
>>
>> Except moving to git is one path to fixing the issue, so it's not
>> really separate.
>> Give the multiple problems that the svn-git bridge seems to have (both
>> memory leaks + history), perhaps the sooner we switch to git, the
>> better.
>>
>> -Yonik
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> <javascript:_e(%7B%7D,'cvml','dev-unsubscr...@lucene.apache.org');>
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> <javascript:_e(%7B%7D,'cvml','dev-h...@lucene.apache.org');>
>>
>>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Lucene/Solr git mirror will soon turn off

2015-12-04 Thread Doug Turnbull
The only downside is GitHub is a convenient way to run blame, etc. It's
very convenient for sleuthing through code. (If only their search wasn't
abysmal in terms of relevancy, but I digress)

Is the more systemic problem large binaries checked in I'm the past? Can we
do any surgery to svn or git to remove these? IIRC this is one reason
avoiding changing from git to svn to begin with. If removing some jars from
an old version of Lucene fixes it, perhaps this is a better long term
solution. I suppose the issue is having someone with the right svn/git
skills and the time to pull this off.

Doug

On Friday, December 4, 2015, Uwe Schindler <u...@thetaphi.de> wrote:

> Hi,
>
> This looks like a good idea to me. Maybe we just have a limited amount of
> history and branches in Git/Github, so people can work and create pull
> requests. Nobody wants to create pull request on a very old branch or
> against a revision years ago.
>
> Maybe Infra can mirror only the last 2 years of trunk and branch_5x?
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de <javascript:;>
>
> > -Original Message-
> > From: Dyer, James [mailto:james.d...@ingramcontent.com <javascript:;>]
> > Sent: Friday, December 04, 2015 10:48 PM
> > To: dev@lucene.apache.org <javascript:;>
> > Cc: infrastruct...@apache.org <javascript:;>
> > Subject: RE: Lucene/Solr git mirror will soon turn off
> >
> > I know Infra has tried a number of things to resolve this, to no avail.
> But did
> > we try "git-svn --revision=" to only mirror "post-LUCENE-3930" (ivy,
> > r1307099)?  Or if that's not lean enough for the git-svn mirror to work,
> then
> > cut off when 4.x was branched or whenever.  The hope would be to give git
> > users enough of the past that it would be useful for new development but
> > then also we can retain the status quo with svn (which is the best path
> for a
> > 26-day timeframe).
> >
> > James Dyer
> > Ingram Content Group
> >
> >
> > -Original Message-
> > From: Michael McCandless [mailto:luc...@mikemccandless.com
> <javascript:;>]
> > Sent: Friday, December 04, 2015 2:58 PM
> > To: Lucene/Solr dev
> > Cc: infrastruct...@apache.org <javascript:;>
> > Subject: Lucene/Solr git mirror will soon turn off
> >
> > Hello devs,
> >
> > The infra team has notified us (Lucene/Solr) that in 26 days our
> > git-svn mirror will be turned off, because running it consumes too
> > many system resources, affecting other projects, apparently because of
> > a memory leak in git-svn.
> >
> > Does anyone know of a link to this git-svn issue?  Is it a known
> > issue?  If there's something simple we can do (remove old jars from
> > our svn history, remove old branches), maybe we can sidestep the issue
> > and infra will allow it to keep running?
> >
> > Or maybe someone in the Lucene/Solr dev community with prior
> > experience with git-svn could volunteer to play with it to see if
> > there's a viable solution, maybe with command-line options e.g. to
> > only mirror specific branches (trunk, 5.x)?
> >
> > Or maybe it's time for us to switch to git, but there are problems
> > there too, e.g. we are currently missing large parts of our svn
> > history from the mirror now and it's not clear whether that would be
> > fixed if we switched:
> > https://issues.apache.org/jira/browse/INFRA-10828  Also, because we
> > used to add JAR files to svn, the "git clone" would likely take
> > several GBs unless we remove those JARs from our history.
> >
> > Or if anyone has any other ideas, we should explore them, because
> > otherwise in 26 days there will be no more updates to the git mirror
> > of Lucene and Solr sources...
> >
> > Thanks,
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> <javascript:;>
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> <javascript:;>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


[jira] [Commented] (SOLR-8201) Swap space info not showing in new UI (see screenshot)

2015-10-24 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972743#comment-14972743
 ] 

Doug Turnbull commented on SOLR-8201:
-

+1!

These little hints in the admin UI can hint at problems before I have to use a 
more robust profiler

> Swap space info not showing in new UI (see screenshot)
> --
>
> Key: SOLR-8201
> URL: https://issues.apache.org/jira/browse/SOLR-8201
> Project: Solr
>  Issue Type: Bug
>  Components: UI
>Reporter: Youssef Chaker
>Priority: Minor
> Attachments: swap space.png
>
>
> The old UI displays info about the swap space (even if nothing is allocated) 
> whereas the new UI does not (see screenshot).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7341) xjoin - join data from external sources

2015-10-15 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959677#comment-14959677
 ] 

Doug Turnbull commented on SOLR-7341:
-

I am really looking forward to this patch. It has a lot of potential for 
joining search with external ranking systems like recommenders or other systems 
that are more appropriatte for different use cases.

> xjoin - join data from external sources
> ---
>
> Key: SOLR-7341
> URL: https://issues.apache.org/jira/browse/SOLR-7341
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 4.10.3
>Reporter: Tom Winch
>Priority: Minor
> Fix For: Trunk
>
> Attachments: SOLR-7341.patch, SOLR-7341.patch, SOLR-7341.patch, 
> SOLR-7341.patch, SOLR-7341.patch, SOLR-7341.patch, SOLR-7341.patch-trunk, 
> SOLR-7341.patch-trunk, SOLR-7341.patch-trunk
>
>
> h2. XJoin
> The "xjoin" SOLR contrib allows external results to be joined with SOLR 
> results in a query and the SOLR result set to be filtered by the results of 
> an external query. Values from the external results are made available in the 
> SOLR results and may also be used to boost the scores of corresponding 
> documents during the search. The contrib consists of the Java classes 
> XJoinSearchComponent, XJoinValueSourceParser and XJoinQParserPlugin (and 
> associated classes), which must be configured in solrconfig.xml, and the 
> interfaces XJoinResultsFactory and XJoinResults, which are implemented by the 
> user to provide the link between SOLR and the external results source. 
> External results and SOLR documents are matched via a single configurable 
> attribute (the "join field"). The contrib JAR solr-xjoin-4.10.3.jar contains 
> these classes and interfaces and should be included in SOLR's class path from 
> solrconfig.xml, as should a JAR containing the user implementations of the 
> previously mentioned interfaces. For example:
> {code:xml}
> 
>   ..
>   
>/>
>   ..
>   
>   
>   ..
> 
> {code}
> h2. Java classes and interfaces
> h3. XJoinResultsFactory
> The user implementation of this interface is responsible for connecting to an 
> external source to perform a query (or otherwise collect results). Parameters 
> with prefix ".external." are passed from the SOLR query URL 
> to pararameterise the search. The interface has the following methods:
> * void init(NamedList args) - this is called during SOLR initialisation, and 
> passed parameters from the search component configuration (see below)
> * XJoinResults getResults(SolrParams params) - this is called during a SOLR 
> search to generate external results, and is passed parameters from the SOLR 
> query URL (as above)
> For example, the implementation might perform queries of an external source 
> based on the 'q' SOLR query URL parameter (in full,  name>.external.q).
> h3. XJoinResults
> A user implementation of this interface is returned by the getResults() 
> method of the XJoinResultsFactory implementation. It has methods:
> * Object getResult(String joinId) - this should return a particular result 
> given the value of the join attribute
> * Iterable getJoinIds() - this should return an ordered (ascending) 
> list of the join attribute values for all results of the external search
> h3. XJoinSearchComponent
> This is the central Java class of the contrib. It is a SOLR search component, 
> configured in solrconfig.xml and included in one or more SOLR request 
> handlers. There is one XJoin search component per external source, and each 
> has two main responsibilities:
> * Before the SOLR search, it connects to the external source and retrieves 
> results, storing them in the SOLR request context
> * After the SOLR search, it matches SOLR document in the results set and 
> external results via the join field, adding attributes from the external 
> results to documents in the SOLR results set
> It takes the following initialisation parameters:
> * factoryClass - this specifies the user-supplied class implementing 
> XJoinResultsFactory, used to generate external results
> * joinField - this specifies the attribute on which to join between SOLR 
> documents and external results
> * external - this parameter set is passed to configure the 
> XJoinResultsFactory implementation
> For example, in solrconfig.xml:
> {code:xml}
>  class="org.apache.solr.search.xjoin.XJoinSearchComponent">
>   test.TestXJoinResultsFactory
>   id
>   
> 1,2,3
>   
> 
> {code}
> Here, the search 

Re: Mention security as a key feature on the web site "Features" page

2015-09-26 Thread Doug Turnbull
I'm glad some of these changes have made it in. And I admit ignorance to
the work done in this area. However...

My 2 cents would be that I'm still more comfortable locking down Solr
behind something that feels rather battle-tested like Nginx or another
proxy instead of letting Solr be in charge of security. I feel like this is
a better division of responsibilities, and I'm not sure you'd want to start
advertising Solr as super secure, locked down, and hardened.

-Doug

On Saturday, September 26, 2015, Jan Høydahl <jan@cominvent.com> wrote:

> Hi,
>
> Any comments on this suggestion?
>
> Jan
>
> > Den 25. aug. 2015 kl. 10.25 skrev Jan Høydahl <jan@cominvent.com
> <javascript:;>>:
> >
> > Idea: If we do not want to draw new icons, perhaps this could work:
> >
> > Use the “schemaless" icon (with a key) as the new security icon:
> >
> http://lucene.apache.org/solr/assets/images/Solr_Icons_a_real_data_schema.svg
> >
> > And for the schema-less feature, we can instead use the icon from the
> removed “External configuration"
> >
> http://lucene.apache.org/solr/assets/images/Solr_Icons_external_configuration.svg
> >
> >
> > Title: Security built right in
> > Subtitle: Secure Solr with Authentication, Role based Authorization and
> SSL. Pluggable of course!
> >
> >
> > See how it looks here:
> http://www.cominvent.com/solr/Apache%20Solr%20-%20Features.html
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> >> 24. aug. 2015 kl. 21.25 skrev Jan Høydahl <jan@cominvent.com
> <javascript:;>>:
> >>
> >> On the Solr web site
> http://lucene.staging.apache.org/solr/features.html we list key features.
> >> Now with 5.3 out the door, I think one of those icons should be about
> security.
> >>
> >> Suggest to remove one of the existing icons to make room for a new one.
> Candidates:
> >> - "External Configuration via XML” does perhaps not impress much
> anymore.
> >> - "Extensible Plugin Architecture” is almost a duplicate of "Powerful
> Extensions"
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: discountOverlaps option for QueryParser

2015-09-20 Thread Doug Turnbull
/document/bb99e435ba35f2b1
> >
> > What do you think about this? How difficult to implement this?
> > Would this be a Lucene or Solr issue?
> >
> > Thanks,
> > Ahmet
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> <javascript:;>
>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Moving to git?

2015-05-31 Thread Doug Turnbull
You just made my day with that CVS repo! :)

Though I don't really get a vote -- +1 to your plan Robert.

/polishes history degree
-Doug

On Sun, May 31, 2015 at 3:16 PM, Robert Muir rcm...@gmail.com wrote:

 I totally agree Doug. Losing the jars would have a cost: those old
 branches wouldn't work out of box if you wanted to run tests on
 them.

 But I am not sure how bad that cost really is. It might be zero. I
 havent tried to run e.g. lucene 2.x tests with a modern java 7 or java
 8, but i bet they probably do not work due to things like hashmap
 failures. And I think solr before 4.0 will not even compile, because
 of things like wildcard import + base64 clashes.

 So if i had my preference, we'd import all history as much as we can,
 and nuke the silly jars. And I'd like that sourceforge history there
 too if we can get it, but I don't know if it is really legal.

 The sourceforge CVS works, see IndexWriter:

 http://lucene.cvs.sourceforge.net/viewvc/lucene/lucene/com/lucene/index/IndexWriter.java?view=log


 On Sun, May 31, 2015 at 3:10 PM, Doug Turnbull
 dturnb...@opensourceconnections.com wrote:
  I have no dog in the svn vs git debate honestly.
 
  I want to say how important it is to keep healthy history. I recently
 went
  on a bit of code archeology dig recently to figure out why something in
  Lucene was done the way it was. It was handy that the history went as far
  back as it did, but I had to switch around to different places to
 continue
  the history. For example, the abrupt shift that seems to be around when
  Solr/Lucene were put together had me digging for the last pure lucene
 tag.
  Its over at lucene/java/branches NOT lucene/dev/tags with teh other tags.
 
  Then when you get to the branch for lucene-101, the first commit is:
  2001: New repository initialized by cvs2svn.
 
  Unable to find a cvs repo, my hunt stopped (love to hear if anyone has a
 CVS
  repo -- maybe from Jakarta?)
 
  So removing some jars isn't a big deal. But cutting off history and
  restarting at some arbitrary point can be annoying and make it harder to
 dig
  up more about why things are the way they are.
 
  /steps down from soapbox
  -Doug
 
 
 
  On Sunday, May 31, 2015, Dawid Weiss dawid.we...@cs.put.poznan.pl
 wrote:
 
  Yeah, but it misses the point -- history is history, if there were
  jars in it, you shouldn't just strip them, it'd be confusing.
 
  How was it back when Lucene was merging with Solr? Didn't it just
  initiate with a new clean repo? Maybe not all of the history is really
  needed -- if we limited ourselves to, say, all of the history that
  includes ivy then the size of the repo would drop significantly... but
  again, to me size doesn't really matter at all; one initial clone is
  no-cost. Go make yourself a cup of tea, come back and you're set.
 
  Dawid
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Moving to git?

2015-05-31 Thread Doug Turnbull
I have no dog in the svn vs git debate honestly.

I want to say how important it is to keep healthy history. I recently went
on a bit of code archeology dig recently to figure out why something in
Lucene was done the way it was. It was handy that the history went as far
back as it did, but I had to switch around to different places to continue
the history. For example, the abrupt shift that seems to be around when
Solr/Lucene were put together had me digging for the last pure lucene tag.
Its over at lucene/java/branches NOT lucene/dev/tags with teh other tags.

Then when you get to the branch for lucene-101, the first commit is:
 2001: New repository initialized by cvs2svn.

Unable to find a cvs repo, my hunt stopped (love to hear if anyone has a
CVS repo -- maybe from Jakarta?)

So removing some jars isn't a big deal. But cutting off history and
restarting at some arbitrary point can be annoying and make it harder to
dig up more about why things are the way they are.

/steps down from soapbox
-Doug



On Sunday, May 31, 2015, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:

 Yeah, but it misses the point -- history is history, if there were
 jars in it, you shouldn't just strip them, it'd be confusing.

 How was it back when Lucene was merging with Solr? Didn't it just
 initiate with a new clean repo? Maybe not all of the history is really
 needed -- if we limited ourselves to, say, all of the history that
 includes ivy then the size of the repo would drop significantly... but
 again, to me size doesn't really matter at all; one initial clone is
 no-cost. Go make yourself a cup of tea, come back and you're set.

 Dawid

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Where Search Meets Machine Learning

2015-05-04 Thread Doug Turnbull
 we tested this with datasets available
 at the the UCI Machine Learning Repository
 http://archive.ics.uci.edu/ml/ but I have been using this approach for
 real-life response prediction/bidding problems in advertising and its very
 powerful. Of course, this is not the panacea, as there are still some
 issues with the approach, specially on the operational side.  Let's keep
 the conversation going as I think we are on to something useful.

 -- Joaquin


 On Thu, Apr 30, 2015 at 6:26 AM, Doug Turnbull 
 dturnb...@opensourceconnections.com wrote:

 Hi Joaquin

 Very neat, thanks for sharing,

 Viewing search relevance as something akin to a classification problem is
 actually a driving narrative in Taming Search
 http://manning.com/turnbull. We generalize the relevance problem as
 one of measuring the similarity between features of content (locations of
 restaurants, price of a product, the words in the body of articles,
 expanded synonyms in articles, etc) and features of a query (the search
 terms, user usage history, any location, etc). What makes search
 interesting is that unlike other classification systems, search has built
 in similarity systems (largely TF*IDF).

 So we actually cut the other direction from your talk. It appears that
 you amend the search engine to change the underlying scoring to be based on
 machine learning constructs. In our book, we work the opposite way. We
 largely enable feature similarity classifications between document and
 query by massaging features into terms and use the built in TF*IDF or other
 relevant similarity approach.

 We feel this plays to the advantages of a search engine. Search engines
 already have some basic text analysis built in. They've also been heavily
 optimized for most forms of text-based similarity. If you can massage text
 such that your TF*IDF similarity reflects a rough proportion of text-based
 features important to your users, this tends to reflect their intuitive
 notions of relevance. A lot of this work involves feature section, or what
 we term in the book feature modeling. What features should you introduce to
 your documents that can be used to generate good signals at ranking time.

 You can read more about our thoughts here
 http://java.dzone.com/articles/solr-and-elasticsearch.

 That all being said, what makes your stuff interesting is when you have
 enough supervised training data over good-enough features. This can be hard
 to do for a broad swatch of middle tier search applications, but
 increasingly useful as scale goes up. I'd be interested to hear your
 thoughts on this article
 http://opensourceconnections.com/blog/2014/10/08/when-click-scoring-can-hurt-search-relevance-a-roadmap-to-better-signals-processing-in-search/
 I wrote about collecting click tracking and other relevance feedback data:

 Good stuff! Again, thanks for sharing,
 -Doug



 On Wed, Apr 29, 2015 at 6:58 PM, J. Delgado joaquin.delg...@gmail.com
 wrote:

 Here is a presentation on the topic:

 http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final

 Search can be viewed as a combination of a) A problem of constraint
 satisfaction, which is the process of finding a solution to a set of
 constraints (query) that impose conditions that the variables (fields) must
 satisfy with a resulting object (document) being a solution in the feasible
 region (result set), plus b) A scoring/ranking problem of assigning values
 to different alternatives, according to some convenient scale. This
 ultimately provides a mechanism to sort various alternatives in the result
 set in order of importance, value or preference. In particular scoring in
 search has evolved from being a document centric calculation (e.g. TF-IDF)
 proper from its information retrieval roots, to a function that is more
 context sensitive (e.g. include geo-distance ranking) or user centric (e.g.
 takes user parameters for personalization) as well as other factors that
 depend on the domain and task at hand. However, most system that
 incorporate machine learning techniques to perform classification or
 generate scores for these specialized tasks do so as a post retrieval
 re-ranking function, outside of search! In this talk I show ways of
 incorporating advanced scoring functions, based on supervised learning and
 bid scaling models, into popular search engines such as Elastic Search and
 potentially SOLR. I'll provide practical examples of how to construct such
 ML Scoring plugins in search to generalize the application of a search
 engine as a model evaluator for supervised learning tasks. This will
 facilitate the building of systems that can do computational advertising,
 recommendations and specialized search systems, applicable to many domains.

 Code to support it (only elastic search for now):
 https://github.com/sdhu/elasticsearch-prediction

 -- J







 --
 *Doug Turnbull **| *Search Relevance Consultant | OpenSource
 Connections, LLC | 240.476.9983 | http

Re: Where Search Meets Machine Learning

2015-04-30 Thread Doug Turnbull
Hi Joaquin

Very neat, thanks for sharing,

Viewing search relevance as something akin to a classification problem is
actually a driving narrative in Taming Search http://manning.com/turnbull.
We generalize the relevance problem as one of measuring the similarity
between features of content (locations of restaurants, price of a product,
the words in the body of articles, expanded synonyms in articles, etc) and
features of a query (the search terms, user usage history, any location,
etc). What makes search interesting is that unlike other classification
systems, search has built in similarity systems (largely TF*IDF).

So we actually cut the other direction from your talk. It appears that you
amend the search engine to change the underlying scoring to be based on
machine learning constructs. In our book, we work the opposite way. We
largely enable feature similarity classifications between document and
query by massaging features into terms and use the built in TF*IDF or other
relevant similarity approach.

We feel this plays to the advantages of a search engine. Search engines
already have some basic text analysis built in. They've also been heavily
optimized for most forms of text-based similarity. If you can massage text
such that your TF*IDF similarity reflects a rough proportion of text-based
features important to your users, this tends to reflect their intuitive
notions of relevance. A lot of this work involves feature section, or what
we term in the book feature modeling. What features should you introduce to
your documents that can be used to generate good signals at ranking time.

You can read more about our thoughts here
http://java.dzone.com/articles/solr-and-elasticsearch.

That all being said, what makes your stuff interesting is when you have
enough supervised training data over good-enough features. This can be hard
to do for a broad swatch of middle tier search applications, but
increasingly useful as scale goes up. I'd be interested to hear your
thoughts on this article
http://opensourceconnections.com/blog/2014/10/08/when-click-scoring-can-hurt-search-relevance-a-roadmap-to-better-signals-processing-in-search/
I wrote about collecting click tracking and other relevance feedback data:

Good stuff! Again, thanks for sharing,
-Doug



On Wed, Apr 29, 2015 at 6:58 PM, J. Delgado joaquin.delg...@gmail.com
wrote:

 Here is a presentation on the topic:

 http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final

 Search can be viewed as a combination of a) A problem of constraint
 satisfaction, which is the process of finding a solution to a set of
 constraints (query) that impose conditions that the variables (fields) must
 satisfy with a resulting object (document) being a solution in the feasible
 region (result set), plus b) A scoring/ranking problem of assigning values
 to different alternatives, according to some convenient scale. This
 ultimately provides a mechanism to sort various alternatives in the result
 set in order of importance, value or preference. In particular scoring in
 search has evolved from being a document centric calculation (e.g. TF-IDF)
 proper from its information retrieval roots, to a function that is more
 context sensitive (e.g. include geo-distance ranking) or user centric (e.g.
 takes user parameters for personalization) as well as other factors that
 depend on the domain and task at hand. However, most system that
 incorporate machine learning techniques to perform classification or
 generate scores for these specialized tasks do so as a post retrieval
 re-ranking function, outside of search! In this talk I show ways of
 incorporating advanced scoring functions, based on supervised learning and
 bid scaling models, into popular search engines such as Elastic Search and
 potentially SOLR. I'll provide practical examples of how to construct such
 ML Scoring plugins in search to generalize the application of a search
 engine as a model evaluator for supervised learning tasks. This will
 facilitate the building of systems that can do computational advertising,
 recommendations and specialized search systems, applicable to many domains.

 Code to support it (only elastic search for now):
 https://github.com/sdhu/elasticsearch-prediction

 -- J







-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


[jira] [Commented] (SOLR-5800) Admin UI - Analysis form doesn't render results correctly when a CharFilter is used.

2014-03-11 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931319#comment-13931319
 ] 

Doug Turnbull commented on SOLR-5800:
-

Thanks for the patch Stefan. Will this be released in a Solr 4.7.1? This is a 
fairly major issue for folks that depend on the analysis UI.

 Admin UI - Analysis form doesn't render results correctly when a CharFilter 
 is used.
 

 Key: SOLR-5800
 URL: https://issues.apache.org/jira/browse/SOLR-5800
 Project: Solr
  Issue Type: Bug
  Components: web gui
Affects Versions: 4.7
Reporter: Timothy Potter
Assignee: Stefan Matheis (steffkes)
Priority: Minor
 Fix For: 4.8, 5.0

 Attachments: SOLR-5800-sample.json, SOLR-5800.patch


 I have an example in Solr In Action that uses the
 PatternReplaceCharFilterFactory and now it doesn't work in 4.7.0.
 Specifically, the fieldType is:
 fieldType name=text_microblog class=solr.TextField
 positionIncrementGap=100
   analyzer
 charFilter class=solr.PatternReplaceCharFilterFactory
 pattern=([a-zA-Z])\1+
 replacement=$1$1/
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 splitOnCaseChange=0
 splitOnNumerics=0
 stemEnglishPossessive=1
 preserveOriginal=0
 catenateWords=1
 generateNumberParts=1
 catenateNumbers=0
 catenateAll=0
 types=wdfftypes.txt/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=lang/stopwords_en.txt
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.KStemFilterFactory/
   /analyzer
 /fieldType
 The PatternReplaceCharFilterFactory (PRCF) is used to collapse
 repeated letters in a term down to a max of 2, such as #yu would
 be #yumm
 When I run some text through this analyzer using the Analysis form,
 the output is as if the resulting text is unavailable to the
 tokenizer. In other words, the only results being displayed in the
 output on the form is for the PRCF
 This example stopped working in 4.7.0 and I've verified it worked
 correctly in 4.6.1.
 Initially, I thought this might be an issue with the actual analysis,
 but the analyzer actually works when indexing / querying. Then,
 looking at the JSON response in the Developer console with Chrome, I
 see the JSON that comes back includes output for all the components in
 my chain (see below) ... so looks like a UI rendering issue to me?
 {responseHeader:{status:0,QTime:24},analysis:{field_types:{text_microblog:{index:[org.apache.lucene.analysis.pattern.PatternReplaceCharFilter,#Yumm
 :) Drinking a latte at Caffe Grecco in SF's historic North Beach...
 Learning text analysis with #SolrInAction by @ManningBooks on my i-Pad
 foo5,org.apache.lucene.analysis.core.WhitespaceTokenizer,[{text:#Yumm,raw_bytes:[23
 59 75 6d 
 6d],start:0,end:6,position:1,positionHistory:[1],type:word},{text::),raw_bytes:[3a
 29],start:7,end:9,position:2,positionHistory:[2],type:word},{text:Drinking,raw_bytes:[44
 72 69 6e 6b 69 6e
 67],start:10,end:18,position:3,positionHistory:[3],type:word},{text:a,raw_bytes:[61],start:19,end:20,position:4,positionHistory:[4],type:word},{text:latte,raw_bytes:[6c
  ...
 the JSON returned to the browser has evidence that the full analysis chain 
 was applied, so this seems to just be a rendering issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4812) Edismax highlighting query doesn't work.

2013-09-19 Thread Doug Turnbull (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772151#comment-13772151
 ] 

Doug Turnbull commented on SOLR-4812:
-

+1 I've also been able to recreate this.

 Edismax highlighting query doesn't work.
 

 Key: SOLR-4812
 URL: https://issues.apache.org/jira/browse/SOLR-4812
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.2, 4.3
 Environment: When hl.q is a edismax query, Highligting will ignore 
 the query specified in hl.q
Reporter: Nguyen Manh Tien
Priority: Minor
 Fix For: 4.5, 5.0

 Attachments: SOLR-4812.patch


 When hl.q is an edismax query, Highligting will ignore the query specified in 
 hl.q
 edismax highlighting query hl.q={!edismax qf=title v=Software}
 function getHighlightQuery in edismax don't parse highlight query so it 
 always return null so hl.q is ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-5256) Send multiple queries through highlighter

2013-09-19 Thread Doug Turnbull (JIRA)
Doug Turnbull created SOLR-5256:
---

 Summary: Send multiple queries through highlighter
 Key: SOLR-5256
 URL: https://issues.apache.org/jira/browse/SOLR-5256
 Project: Solr
  Issue Type: New Feature
  Components: highlighter
Affects Versions: 4.4
Reporter: Doug Turnbull
 Attachments: Solr-5256.patch

There's been several times when I wish I could specify multiple queries through 
the highlighter. For example, a search over books may have an option to filter 
my author. If I wanted to highlight both the primary search terms and the 
author match I'd have to construct an hl.q that created the desire highlight 
query.

This is complicated by the fact that q might be dismax/edismax while the fq is 
likely going to be a lucene query. It might be rather complex to construct a 
single query that reflects the combination of dismax over many fields plus a 
specific lucene query.

What I would prefer to do is be able to specify additional queries (hl.addlq) 
to the highlighter. The highlighter then highlights the results of those 
queries as well. 

(Unfortunately, while this is useful, its limited somewhat by this bug:
https://issues.apache.org/jira/browse/SOLR-4812#comment-13772151)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5256) Send multiple queries through highlighter

2013-09-19 Thread Doug Turnbull (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-5256:


Attachment: Solr-5256.patch

Patch to add hl.addlq

 Send multiple queries through highlighter
 -

 Key: SOLR-5256
 URL: https://issues.apache.org/jira/browse/SOLR-5256
 Project: Solr
  Issue Type: New Feature
  Components: highlighter
Affects Versions: 4.4
Reporter: Doug Turnbull
 Attachments: Solr-5256.patch


 There's been several times when I wish I could specify multiple queries 
 through the highlighter. For example, a search over books may have an option 
 to filter my author. If I wanted to highlight both the primary search terms 
 and the author match I'd have to construct an hl.q that created the desire 
 highlight query.
 This is complicated by the fact that q might be dismax/edismax while the fq 
 is likely going to be a lucene query. It might be rather complex to construct 
 a single query that reflects the combination of dismax over many fields plus 
 a specific lucene query.
 What I would prefer to do is be able to specify additional queries (hl.addlq) 
 to the highlighter. The highlighter then highlights the results of those 
 queries as well. 
 (Unfortunately, while this is useful, its limited somewhat by this bug:
 https://issues.apache.org/jira/browse/SOLR-4812#comment-13772151)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5256) Send multiple queries through highlighter

2013-09-19 Thread Doug Turnbull (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-5256:


Description: 
There's been several times when I wish I could specify multiple queries through 
the highlighter. For example, a search over books may have an option to filter 
jy author. If I wanted to highlight both the primary search terms and the 
author match I'd have to construct an hl.q that created the desire highlight 
query.

This is complicated by the fact that q might be dismax/edismax while the fq is 
likely going to be a lucene query. It might be rather complex to construct a 
single query that reflects the combination of dismax over many fields plus a 
specific lucene query.

What I would prefer to do is be able to specify additional queries (hl.addlq) 
to the highlighter. The highlighter then highlights the results of those 
queries as well. 

(Unfortunately, while this is useful, its limited somewhat by this bug:
https://issues.apache.org/jira/browse/SOLR-4812#comment-13772151)

  was:
There's been several times when I wish I could specify multiple queries through 
the highlighter. For example, a search over books may have an option to filter 
my author. If I wanted to highlight both the primary search terms and the 
author match I'd have to construct an hl.q that created the desire highlight 
query.

This is complicated by the fact that q might be dismax/edismax while the fq is 
likely going to be a lucene query. It might be rather complex to construct a 
single query that reflects the combination of dismax over many fields plus a 
specific lucene query.

What I would prefer to do is be able to specify additional queries (hl.addlq) 
to the highlighter. The highlighter then highlights the results of those 
queries as well. 

(Unfortunately, while this is useful, its limited somewhat by this bug:
https://issues.apache.org/jira/browse/SOLR-4812#comment-13772151)


 Send multiple queries through highlighter
 -

 Key: SOLR-5256
 URL: https://issues.apache.org/jira/browse/SOLR-5256
 Project: Solr
  Issue Type: New Feature
  Components: highlighter
Affects Versions: 4.4
Reporter: Doug Turnbull
 Attachments: Solr-5256.patch


 There's been several times when I wish I could specify multiple queries 
 through the highlighter. For example, a search over books may have an option 
 to filter jy author. If I wanted to highlight both the primary search terms 
 and the author match I'd have to construct an hl.q that created the desire 
 highlight query.
 This is complicated by the fact that q might be dismax/edismax while the fq 
 is likely going to be a lucene query. It might be rather complex to construct 
 a single query that reflects the combination of dismax over many fields plus 
 a specific lucene query.
 What I would prefer to do is be able to specify additional queries (hl.addlq) 
 to the highlighter. The highlighter then highlights the results of those 
 queries as well. 
 (Unfortunately, while this is useful, its limited somewhat by this bug:
 https://issues.apache.org/jira/browse/SOLR-4812#comment-13772151)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org