Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Trey Grainger
>
> Let’s instead find a new good name for the cluster type. Standalone kind
> of works
> for me, but I see it can be confused with single-node.

Yeah, I've typically referred to it as "standalone", but I don't think it's
descriptive enough. I can see why some people have been calling it
"master/slave" mode in lieu of a more descriptive alternative. I think a
new name (other than "standalone" or "legacy") would be superb.

We have also discussed replacing SolrCloud (which is a terrible name) with
> something more descriptive.

Today: SolrCloud vs Master/slave
> Alt A: SolrCloud vs Standalone
> Alt B: SolrCloud vs Legacy
> Alt C: Clustered vs Independent
> Alt D: Clustered vs Manual mode


+1 SolrCloud is even less descriptive and IMHO just sounds silly at this
point.

re: "Clustered" vs Independent/Manual. The thing I don't like about that is
that you typically have clusters in both modes. I think the key distinction
is whether Solr "manages" the cluster automatically for you or whether you
manage it manually yourself.

What do you think about:
Alt E: "Managed Clustering" vs. "Unmanaged Clustering" Mode
Alt F:  "Managed Clustering" vs. "Manual Clustering" Mode
?

I think I prefer option F.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Thu, Jun 18, 2020 at 5:59 PM Jan Høydahl  wrote:

> I support Mike Drob and Trey Grainger. We shuold re-use the leader/replica
> terminology from Cloud. Even if you hand-configure a master/slave cluster
> and orchestrate what doc goes to which node/shard, and hand-code your
> shards
> parameter, you will still have a cluster where you’d send updates to the
> leader of
> each shard and the replicas would replicate the index from the leader.
>
> Let’s instead find a new good name for the cluster type. Standalone kind
> of works
> for me, but I see it can be confused with single-node. We have also
> discussed
> replacing SolrCloud (which is a terrible name) with something more
> descriptive.
>
> Today: SolrCloud vs Master/slave
> Alt A: SolrCloud vs Standalone
> Alt B: SolrCloud vs Legacy
> Alt C: Clustered vs Independent
> Alt D: Clustered vs Manual mode
>
> Jan
>
> > 18. jun. 2020 kl. 15:53 skrev Mike Drob :
> >
> > I personally think that using Solr cloud terminology for this would be
> fine
> > with leader/follower. The leader is the one that accepts updates,
> followers
> > cascade the updates somehow. The presence of ZK or election doesn’t
> really
> > change this detail.
> >
> > However, if folks feel that it’s confusing, then I can’t tell them that
> > they’re not confused. Especially when they’re working with others who
> have
> > less Solr experience than we do and are less familiar with the
> intricacies.
> >
> > Primary/Replica seems acceptable. Coordinator instead of Overseer seems
> > acceptable.
> >
> > Would love to see this in 9.0!
> >
> > Mike
> >
> > On Thu, Jun 18, 2020 at 8:25 AM John Gallagher
> >  wrote:
> >
> >> While on the topic of renaming roles, I'd like to propose finding a
> better
> >> term than "overseer" which has historical slavery connotations as well.
> >> Director, perhaps?
> >>
> >>
> >> John Gallagher
> >>
> >> On Thu, Jun 18, 2020 at 8:48 AM Jason Gerlowski 
> >> wrote:
> >>
> >>> +1 to rename master/slave, and +1 to choosing terminology distinct
> >>> from what's used for SolrCloud.  I could be happy with several of the
> >>> proposed options.  Since a good few have been proposed though, maybe
> >>> an eventual vote thread is the most organized way to aggregate the
> >>> opinions here.
> >>>
> >>> I'm less positive about the prospect of changing the name of our
> >>> primary git branch.  Most projects that contributors might come from,
> >>> most tutorials out there to learn git, most tools built on top of git
> >>> - the majority are going to assume "master" as the main branch.  I
> >>> appreciate the change that Github is trying to effect in changing the
> >>> default for new projects, but it'll be a long time before that
> >>> competes with the huge bulk of projects, documentation, etc. out there
> >>> using "master".  Our contributors are smart and I'm sure they'd figure
> >>> it out if we used "main" or something else instead, but having a
> >>> non-standard git setup would be one more "papercut" in understanding
> >>> how to contribute

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
@Shawn,

Ok, yeah, apologies, my semantics were wrong.

I was thinking that a TLog replica is a follower role only and becomes an
NRT replica if it gets elected leader. From a pure semantics standpoint,
though, I guess technically the TLog replica doesn't "become" an NRT
replica, but just "acts the same" as if it was an NRT replica when it gets
elected as leader. From the docs regarding TLog replicas: "This type of
replica maintains a transaction log but does not index document changes
locally... When this type of replica needs to update its index, it does so
by replicating the index from the leader... If it does become a leader, it
will behave the same as if it was a NRT type of replica."

The Tlog replicas are a bit of a red herring to the point I was making,
though, which is that Pull Replicas in SolrCloud mode and Slaves in
non-SolrCloud mode both just pull the index from the leader/master and as
opposed to updates being pushed the other way. As such, I don't see a
meaningful distinction between master/slave and leader/follower behavior in
non-SolrCloud mode vs. SolrCloud mode for the specific functionality we're
talking about renaming (Solr cores that pull indices from other Solr cores).

At any rate, this is not a hill I care to die on. My belief is that it's
better to have consistent terminology for what I see as essentially the
same functionality. I respect that others disagree and would rather
introduce new terminology to clearly distinguish between modes. Regardless
of the naming decided on, I'm in support of removing the master/slave
nomenclature.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Wed, Jun 17, 2020 at 7:00 PM Shawn Heisey  wrote:

> On 6/17/2020 2:36 PM, Trey Grainger wrote:
> > 2) TLOG - which can only serve in the role of follower
>
> This is inaccurate.  TLOG can become leader.  If that happens, then it
> functions exactly like an NRT leader.
>
> I'm aware that saying the following is bikeshedding ... but I do think
> it would be as mistake to use any existing SolrCloud terminology for
> non-cloud deployments, including the word "replica".  The top contenders
> I have seen to replace master/slave in Solr are primary/secondary and
> publisher/subscriber.
>
> It has been interesting watching this discussion play out on multiple
> open source mailing lists.  On other projects, I have seen a VERY high
> level of resistance to these changes, which I find disturbing and
> surprising.
>
> Thanks,
> Shawn
>


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
Sorry:
>
> but I maintain that leader vs. follower behavior is inconsistent here.


Sorry, that should have said "I maintain that leader vs. follower behavior
is consistent here."

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Wed, Jun 17, 2020 at 6:03 PM Trey Grainger  wrote:

> Hi Walter,
>
> >In Solr Cloud, the leader knows about each follower and updates them.
> Respectfully, I think you're mixing the "TYPE" of replica with the role of
> the "leader" and "follower"
>
> In SolrCloud, only if the TYPE of a follower is NRT or TLOG does the
> leader push updates those followers.
>
> When the TYPE of a follower is PULL, then it does not.  In Standalone
> mode, the type of a (currently) master would be NRT, and the type of the
> (currently) slaves is always PULL.
>
> As such, this behavior is consistent across both SolrCloud and Standalone
> mode. It is true that Standalone mode does not currently have support for
> two of the replica TYPES that SolrCloud mode does, but I maintain that
> leader vs. follower behavior is inconsistent here.
>
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
>
>
>
> On Wed, Jun 17, 2020 at 5:41 PM Walter Underwood 
> wrote:
>
>> But they are not the same. In Solr Cloud, the leader knows about each
>> follower and updates them. In standalone, the master has no idea that
>> slaves exist until a replication request arrives.
>>
>> In Solr Cloud, the leader is elected. In standalone, that role is fixed at
>> config load time.
>>
>> Looking ahead in my email inbox, publisher/subscriber is an excellent
>> choice.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> > On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
>> >
>> > I guess I don't see it as polysemous, but instead simplifying.
>> >
>> > In my proposal, the terms "leader" and "follower" would have the exact
>> same
>> > meaning in both SolrCloud and standalone mode. The only difference
>> would be
>> > that SolrCloud automatically manages the leaders and followers, whereas
>> in
>> > standalone mode you have to manage them manually (as is the case with
>> most
>> > things in SolrCloud vs. Standalone).
>> >
>> > My view is that having an entirely different set of terminology
>> describing
>> > the same thing is way more cognitive overhead than having consistent
>> > terminology.
>> >
>> > Trey Grainger
>> > Founder, Searchkernel
>> > https://searchkernel.com
>> >
>> > On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood > >
>> > wrote:
>> >
>> >> I strongly disagree with using the Solr Cloud leader/follower
>> terminology
>> >> for non-Cloud clusters. People in my company are confused enough
>> without
>> >> using polysemous terminology.
>> >>
>> >> “This node is the leader, but it means something different than the
>> leader
>> >> in this other cluster.” I’m dreading that conversation.
>> >>
>> >> I like “principal”. How about “clone” for the slave role? That suggests
>> >> that
>> >> it does not accept updates and that it is loosely-coupled, only
>> depending
>> >> on the state of the no-longer-called-master.
>> >>
>> >> Chegg has five production Solr Cloud clusters and one production
>> >> master/slave
>> >> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts
>> in
>> >> production.
>> >>
>> >> wunder
>> >> Walter Underwood
>> >> wun...@wunderwood.org
>> >> http://observer.wunderwood.org/  (my blog)
>> >>
>> >>> On Jun 17, 2020, at 1:36 PM, Trey Grainger 
>> wrote:
>> >>>
>> >>> Proposal:
>> >>> "A Solr COLLECTION is composed of one or more SHARDS, which each have
>> one
>> >>> or more REPLICAS. Each replica can have a ROLE of either:
>> >>> 1) A LEADER, which can process external updates for the shard
>> >>> 2) A FOLLOWER, which receives updates from another replica"
>> >>>
>> >>> (Note: I prefer "role" but if others think it's too overloaded due to
>> the
>> >>> overseer role, we could replace it with "mode" or something similar)
>> >>> --

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
Hi Walter,

>In Solr Cloud, the leader knows about each follower and updates them.
Respectfully, I think you're mixing the "TYPE" of replica with the role of
the "leader" and "follower"

In SolrCloud, only if the TYPE of a follower is NRT or TLOG does the leader
push updates those followers.

When the TYPE of a follower is PULL, then it does not.  In Standalone mode,
the type of a (currently) master would be NRT, and the type of the
(currently) slaves is always PULL.

As such, this behavior is consistent across both SolrCloud and Standalone
mode. It is true that Standalone mode does not currently have support for
two of the replica TYPES that SolrCloud mode does, but I maintain that
leader vs. follower behavior is inconsistent here.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com



On Wed, Jun 17, 2020 at 5:41 PM Walter Underwood 
wrote:

> But they are not the same. In Solr Cloud, the leader knows about each
> follower and updates them. In standalone, the master has no idea that
> slaves exist until a replication request arrives.
>
> In Solr Cloud, the leader is elected. In standalone, that role is fixed at
> config load time.
>
> Looking ahead in my email inbox, publisher/subscriber is an excellent
> choice.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
> >
> > I guess I don't see it as polysemous, but instead simplifying.
> >
> > In my proposal, the terms "leader" and "follower" would have the exact
> same
> > meaning in both SolrCloud and standalone mode. The only difference would
> be
> > that SolrCloud automatically manages the leaders and followers, whereas
> in
> > standalone mode you have to manage them manually (as is the case with
> most
> > things in SolrCloud vs. Standalone).
> >
> > My view is that having an entirely different set of terminology
> describing
> > the same thing is way more cognitive overhead than having consistent
> > terminology.
> >
> > Trey Grainger
> > Founder, Searchkernel
> > https://searchkernel.com
> >
> > On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
> > wrote:
> >
> >> I strongly disagree with using the Solr Cloud leader/follower
> terminology
> >> for non-Cloud clusters. People in my company are confused enough without
> >> using polysemous terminology.
> >>
> >> “This node is the leader, but it means something different than the
> leader
> >> in this other cluster.” I’m dreading that conversation.
> >>
> >> I like “principal”. How about “clone” for the slave role? That suggests
> >> that
> >> it does not accept updates and that it is loosely-coupled, only
> depending
> >> on the state of the no-longer-called-master.
> >>
> >> Chegg has five production Solr Cloud clusters and one production
> >> master/slave
> >> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts
> in
> >> production.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> >>>
> >>> Proposal:
> >>> "A Solr COLLECTION is composed of one or more SHARDS, which each have
> one
> >>> or more REPLICAS. Each replica can have a ROLE of either:
> >>> 1) A LEADER, which can process external updates for the shard
> >>> 2) A FOLLOWER, which receives updates from another replica"
> >>>
> >>> (Note: I prefer "role" but if others think it's too overloaded due to
> the
> >>> overseer role, we could replace it with "mode" or something similar)
> >>> ---
> >>>
> >>> To be explicit with the above definitions:
> >>> 1) In SolrCloud, the roles of leaders and followers can dynamically
> >> change
> >>> based upon the status of the cluster. In standalone mode, they can be
> >>> changed by manual intervention.
> >>> 2) A leader does not have to have any followers (i.e. only one active
> >>> replica)
> >>> 3) Each shard always has one leader.
> >>> 4) A follower can also pull updates from another follower instead of a
> >>> leader (traditionally known as a REPEATER). A repeater is still a
> >> follower,
> >>> but would not be considered a leader because it can't pr

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
I guess I don't see it as polysemous, but instead simplifying.

In my proposal, the terms "leader" and "follower" would have the exact same
meaning in both SolrCloud and standalone mode. The only difference would be
that SolrCloud automatically manages the leaders and followers, whereas in
standalone mode you have to manage them manually (as is the case with most
things in SolrCloud vs. Standalone).

My view is that having an entirely different set of terminology describing
the same thing is way more cognitive overhead than having consistent
terminology.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
wrote:

> I strongly disagree with using the Solr Cloud leader/follower terminology
> for non-Cloud clusters. People in my company are confused enough without
> using polysemous terminology.
>
> “This node is the leader, but it means something different than the leader
> in this other cluster.” I’m dreading that conversation.
>
> I like “principal”. How about “clone” for the slave role? That suggests
> that
> it does not accept updates and that it is loosely-coupled, only depending
> on the state of the no-longer-called-master.
>
> Chegg has five production Solr Cloud clusters and one production
> master/slave
> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in
> production.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> >
> > Proposal:
> > "A Solr COLLECTION is composed of one or more SHARDS, which each have one
> > or more REPLICAS. Each replica can have a ROLE of either:
> > 1) A LEADER, which can process external updates for the shard
> > 2) A FOLLOWER, which receives updates from another replica"
> >
> > (Note: I prefer "role" but if others think it's too overloaded due to the
> > overseer role, we could replace it with "mode" or something similar)
> > ---
> >
> > To be explicit with the above definitions:
> > 1) In SolrCloud, the roles of leaders and followers can dynamically
> change
> > based upon the status of the cluster. In standalone mode, they can be
> > changed by manual intervention.
> > 2) A leader does not have to have any followers (i.e. only one active
> > replica)
> > 3) Each shard always has one leader.
> > 4) A follower can also pull updates from another follower instead of a
> > leader (traditionally known as a REPEATER). A repeater is still a
> follower,
> > but would not be considered a leader because it can't process external
> > updates.
> > 5) A replica cannot be both a leader and a follower.
> >
> > In addition to the above roles, each replica can have a TYPE of one of:
> > 1) NRT - which can serve in the role of leader or follower
> > 2) TLOG - which can only serve in the role of follower
> > 3) PULL - which can only serve in the role of follower
> >
> > A replica's type may be changed automatically in the event that its role
> > changes.
> >
> > I think this terminology is consistent with the current Leader/Follower
> > usage while also being able to easily accomodate a rename of the
> historical
> > master/slave terminology without mental gymnastics or the introduction or
> > more cognitive load through new terminology. I think adopting the
> > Primary/Replica terminology will be incredibly confusing given the
> already
> > specific and well established meaning of "replica" within Solr.
> >
> > All the Best,
> >
> > Trey Grainger
> > Founder, Searchkernel
> > https://searchkernel.com
> >
> >
> >
> > On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta 
> wrote:
> >
> >> Hi everyone,
> >>
> >> Moving a conversation that was happening on the PMC list to the public
> >> forum. Most of the following is just me recapping the conversation that
> has
> >> happened so far.
> >>
> >> Some members of the community have been discussing getting rid of the
> >> master/slave nomenclature from Solr.
> >>
> >> While this may require a non-trivial effort, a general consensus so far
> >> seems to be to start this process and switch over incrementally, if a
> >> single change ends up being too big.
> >>
> >> There have been a lot of suggestions around what the new nomenclature
> might
> >> look like, a few people don’t want to overlap the naming here with what
> >> already exists in Solr

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
Proposal:
"A Solr COLLECTION is composed of one or more SHARDS, which each have one
or more REPLICAS. Each replica can have a ROLE of either:
1) A LEADER, which can process external updates for the shard
2) A FOLLOWER, which receives updates from another replica"

(Note: I prefer "role" but if others think it's too overloaded due to the
overseer role, we could replace it with "mode" or something similar)
---

To be explicit with the above definitions:
1) In SolrCloud, the roles of leaders and followers can dynamically change
based upon the status of the cluster. In standalone mode, they can be
changed by manual intervention.
2) A leader does not have to have any followers (i.e. only one active
replica)
3) Each shard always has one leader.
4) A follower can also pull updates from another follower instead of a
leader (traditionally known as a REPEATER). A repeater is still a follower,
but would not be considered a leader because it can't process external
updates.
5) A replica cannot be both a leader and a follower.

In addition to the above roles, each replica can have a TYPE of one of:
1) NRT - which can serve in the role of leader or follower
2) TLOG - which can only serve in the role of follower
3) PULL - which can only serve in the role of follower

A replica's type may be changed automatically in the event that its role
changes.

I think this terminology is consistent with the current Leader/Follower
usage while also being able to easily accomodate a rename of the historical
master/slave terminology without mental gymnastics or the introduction or
more cognitive load through new terminology. I think adopting the
Primary/Replica terminology will be incredibly confusing given the already
specific and well established meaning of "replica" within Solr.

All the Best,

Trey Grainger
Founder, Searchkernel
https://searchkernel.com



On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta  wrote:

> Hi everyone,
>
> Moving a conversation that was happening on the PMC list to the public
> forum. Most of the following is just me recapping the conversation that has
> happened so far.
>
> Some members of the community have been discussing getting rid of the
> master/slave nomenclature from Solr.
>
> While this may require a non-trivial effort, a general consensus so far
> seems to be to start this process and switch over incrementally, if a
> single change ends up being too big.
>
> There have been a lot of suggestions around what the new nomenclature might
> look like, a few people don’t want to overlap the naming here with what
> already exists in SolrCloud i.e. leader/follower.
>
> Primary/Replica was an option that was suggested based on what other
> vendors are moving towards based on Wikipedia:
> https://en.wikipedia.org/wiki/Master/slave_(technology)
> , however there were concerns around the use of “replica” as that denotes a
> very specific concept in SolrCloud. Current terminology clearly
> differentiates the use of the traditional replication model from SolrCloud
> and reusing the names would make it difficult for that to happen.
>
> There were similar concerns around using Leader/follower.
>
> Let’s continue this conversation here while making sure that we converge
> without much bike-shedding.
>
> -Anshum
>


[PSA] Activate 2019 Call for Speakers ends May 8

2019-05-04 Thread Trey Grainger
Hi everyone,

I wanted to do a quick PSA for anyone who may have missed the announcement
last month to let you know the call for speakers is currently open
through *Wednesday,
May 8th*, for Activate 2019 (the Search and AI Conference), focused on the
Apache Solr ecosystem and the intersection of Search and AI:
https://lucidworks.com/2019/04/02/activate-2019-call-for-speakers/

The Activate Conference will be held September 9-12 in Washington, D.C.

The conference, rebranded last year from "Lucene/Solr Revolution", is
expected to grow considerably this year, and I'd like to encourage all of
you working on advancements in the Lucene/Solr project or working on
solving interesting problems in this space to consider submitting a talk if
you haven't already. There are tracks dedicated to Solr Development,
AI-powered Search, Search Development at Scale, and numerous other related
topics - including tracks for key use cases like digital commerce - that I
expect most on this list will find appealing.

If you're interested in presenting (your conference registration fee will
be covered if accepted), please submit a talk here:
https://activate-conf.com/speakers/

Just wanted to make sure everyone in the development and user community
here was aware of the conference and didn't miss the opportunity to submit
a talk by Wednesday if interested.

All the best,

Trey Grainger
Chief Algorithms Officer @ Lucidworks
https://www.linkedin.com/in/treygrainger/


Re: IRA or IRA the Person

2019-04-01 Thread Trey Grainger
Hi Brett,

There are a couple of angles you can take here. If you are only concerned
about this specific term or a small number of other known terms like "IRA"
and want to spot fix it, you can use something like the query elevation
component in Solr (
https://lucene.apache.org/solr/guide/7_7/the-query-elevation-component.html)
to explicitly include or exclude documents.

Otherwise, if you are looking for a more data-driven approach to solving
this, you can leverage the aggregate click-streams for your users across
all of the searches on your platform to boost documents higher that are
more popular for any given search. We do this in our commercial product
(Lucidworks Fusion) through our Signals Boosting feature, but you could
implement something similar yourself with some work, as the general
architecture is fairly well-documented here:
https://doc.lucidworks.com/fusion-ai/4.2/user-guide/signals/index.html

If you do not have long-lived content OR your do not have sufficient
signals history, you could alternatively use something like Solr's Semantic
Knowledge Graph to automatically find term vectors that are the most
related to your terms within your content. In that case, if the "individual
retirement account" meaning is more common across your documents, you'd
probably end up with terms more related to that which could be used to do
data-driven boosts on your query to that concept (instead of the person, in
this case).

I gave a presentation at Activate ("the Search & AI Conference") last year
on some of the more data-driven approaches to parsing and understanding the
meaning of terms within queries, that included things like disambiguation
(similar to what you're doing here) and some additional approaches
leveraging a combination of query log mining, the semantic knowledge graph,
and the Solr Text Tagger. If you start handling these use cases in a more
systematic and data-driven way, you might want to check out some of the
techniques I mentioned there: Video:
https://www.youtube.com/watch?v=4fMZnunTRF8 | Slides:
https://www.slideshare.net/treygrainger/how-to-build-a-semantic-search-system


All the best,

Trey Grainger
Chief Algorithms Officer @ Lucidworks


On Mon, Apr 1, 2019 at 11:45 AM Moyer, Brett  wrote:

> Hello,
>
> Looking for ideas on how to determine intent and drive results to
> a person result or an article result. We are a financial institution and we
> have IRA's Individual Retirement Accounts and we have a page that talks
> about an Advisor, IRA Black.
>
> Our users are in a bad habit of only using single terms for
> search. A very common search term is "ira". The PERSON page ranks higher
> than the article on IRA's. With essentially no information from the user,
> what are some way we can detect and rank differently? Thanks!
>
> Brett Moyer
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender
> immediately and then delete it.
>
> TIAA
> *
>


Re: Disabling XmlQParserPlugin through solrconfig

2017-10-12 Thread Trey Grainger
You can also just "replace" the registered xml query parser with another
parser. I imagine you're doing this for security reasons, which means you
just want the actual xml query parser to not be executable through a query.
Try adding the following line to your solrconfig.xml:


This way, the xml query parser is loaded in as a version of the eDismax
query parser instead, and any queries the are trying to reference the xml
query parser through local params will instead hit the eDismax query parser
and use its parsing logic instead.

All the best,

Trey Grainger
SVP of Engineering @ Lucidworks
Co-author, Solr in Action <http://solrinaction.com/>
http://www.treygrainger.com

-

On Thu, Oct 12, 2017 at 6:56 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 10/12/2017 3:18 PM, Manikandan Sivanesan wrote:
>
>> I'm looking for a way to disable the query parser XmlQParserPlugin
>> (org.apache.solr.search.XmlQParserPlugin) through solrconfig.xml .
>> Following the instructions mentioned here
>> <https://wiki.apache.org/solr/SolrConfigXml#Enable.2Fdisable_components>
>> to
>> disable a query parser.
>>
>> This is the part that I added to solrconfig.
>> > enable="{enable.xmlparser:false}/>
>>
>> I have uploaded it to zk and reloaded the collection. But I still see the
>> XmlQParserPlugin loaded in
>> in the Plugin/Stats => QUERYPARSER section of Solr Admin Console.
>>
>
> Through experimentation, I was able to figure out that the configuration
> of query parsers DOES support the "enable" attribute.  Initially I thought
> it might not.
>
> With this invalid configuration (the class is missing a character), Solr
> will start correctly:
>
> 
>
> But if I change the enable attribute to "true" instead of "false", Solr
> will NOT successfully load the core with that config, because it contains a
> class that cannot be found.
>
> The actual problem you're running into is that almost every query parser
> implementation that Solr has is hard-coded and explicitly loaded by code in
> QParserPlugin.  One of those parsers is the XML parser that you want to
> disable.
>
> I think it would be a good idea to go through the list of hard-coded
> parsers in the QParserPlugin class and make it a MUCH smaller list.  Some
> of the parsers, especially the XML parser, probably should require explicit
> configuration rather than being included by default.
>
> Thanks,
> Shawn
>
>


Re: Semantic Knowledge Graph

2017-10-09 Thread Trey Grainger
Hi David, that's my fault. I need to do a final proofread through them
before they get posted (and may have to push one quick code change, as
well). I'll try to get that done within the next few days.

All the best,

Trey Grainger
SVP of Engineering @ Lucidworks
Co-author, Solr in Action <http://solrinaction.com>
http://www.treygrainger.com


On Mon, Oct 9, 2017 at 10:14 AM, David Hastings <
hastings.recurs...@gmail.com> wrote:

> Hey All, slides form the 2017 lucene revolution were put up recently, but
> unfortunately, the one I have the most interest in, the semantic knowledge
> graph, have not been put up:
>
> https://lucenesolrrevolution2017.sched.com/event/BAwX/the-
> apache-solr-semantic-knowledge-graph?iframe=no=100%=yes=no
>
>
> dont suppose any one knows where i may be able to find them, or point me in
> a direction to get more information about this tool.
>
> Thanks - dave
>


Re: "on deck" searcher vs warming searcher

2016-12-09 Thread Trey Grainger
Shawn and Joel both answered the question with seemingly opposite answers,
but Joel's should be right. On Deck, as an idiom, means "getting ready to
go next". I think it has it's history in military / naval terminology (a
plane being "on deck" of an aircraft carrier was the next one to take off),
and was later used heavily in baseball (the "on deck" batter was the one
warming up to go next) and probably elsewhere.

I've always understood the "on deck" searcher(s) being the same as the
warming searcher(s). So you have the "active" searcher and them the warming
or on deck searchers.

-Trey


On Fri, Dec 9, 2016 at 11:54 AM, Erick Erickson 
wrote:

> Jihwan:
>
> Correct. Do note that there are two distinct warnings here:
> 1> "Error opening new searcher. exceeded limit of maxWarmingSearchers"
> 2> "PERFORMANCE WARNING: Overlapping onDeckSearchers=..."
>
> in <1>, the new searcher is _not_ opened.
> in <2>, the new searcher _is_ opened.
>
> In practice, getting either warning is an indication of
> mis-configuration. Consider a very large filterCache with large
> autowarm values. Every new searcher will then allocate space for the
> filterCache so having <1> is there to prevent runaway situations that
> lead to OOM errors.
>
> <2> is just letting you know that you should look at your usage of
> commit so you can avoid <1>.
>
> Best,
> Erick
>
> On Fri, Dec 9, 2016 at 8:44 AM, Jihwan Kim  wrote:
> > why is there a setting (maxWarmingSearchers) that even lets you have more
> > than one:
> > Isn't it also for a case of (frequent) update? For example, one update is
> > committed.  During the warming up  for this commit, another update is
> > made.  In this case the new commit also go through another warming.  If
> the
> > value is 1, the second warming will fail.  More number of concurrent
> > warming-up requires larger memory usage.
> >
> >
> > On Fri, Dec 9, 2016 at 9:14 AM, Erick Erickson 
> > wrote:
> >
> >> bq: because shouldn't there only be one active
> >> searcher at a time?
> >>
> >> Kind of. This is a total nit, but there can be multiple
> >> searchers serving queries briefly (one hopes at least).
> >> S1 is serving some query when S2 becomes
> >> active and starts getting new queries. Until the last
> >> query S1 is serving is complete, they both are active.
> >>
> >> bq: why is there a setting
> >> (maxWarmingSearchers) that even lets
> >> you have more than one
> >>
> >> The contract is that when you commit (assuming
> >> you're opening a new searcher), then all docs
> >> indexed up to that point are visible. Therefore you
> >> _must_ open a new searcher even if one is currently
> >> warming or that contract would be violated. Since
> >> warming can take minutes, not opening a new
> >> searcher if one was currently warming could cause
> >> quite a gap.
> >>
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Dec 9, 2016 at 7:30 AM, Brent  wrote:
> >> > Hmmm, conflicting answers. Given the infamous "PERFORMANCE WARNING:
> >> > Overlapping onDeckSearchers" log message, it seems like the "they're
> the
> >> > same" answer is probably correct, because shouldn't there only be one
> >> active
> >> > searcher at a time?
> >> >
> >> > Although it makes me curious, if there's a warning about having
> multiple
> >> > (overlapping) warming searchers, why is there a setting
> >> > (maxWarmingSearchers) that even lets you have more than one, or at
> least,
> >> > why ever set it to anything other than 1?
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context: http://lucene.472066.n3.
> >> nabble.com/on-deck-searcher-vs-warming-searcher-tp4309021p4309080.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>


Re: Related Search

2016-10-26 Thread Trey Grainger
Yeah, the approach listed by Grant and Markus is a common approach. I've
worked on systems that mined query logs like this, and it's a good approach
if you have sufficient query logs to pull it off.

There are a lot of linguistic nuances you'll encounter along the way,
including how you disambiguate homonyms and their related terms, identify
synonyms/acronyms as having the same underlying meaning, how you parse and
handle unknown phrases, removing noise present in the query logs, and even
how you weight the strength or relationship between related queries. I gave
a presentation on this topic at Lucene/Solr Revolution in 2015 if you're
interested in learning more about how to build such a system (
http://www.treygrainger.com/posts/presentations/leveraging-lucene-solr-as-a-knowledge-graph-and-intent-engine/
).

Another approach (also referenced in the above presentation), for those
with more of a cold-start problem with query logs, is to mine related terms
and phrases out of the underlying content in the search engine (inverted
index) itself. The Semantic Knowledge Graph that was recently open sourced
by CareerBuilder and contributed back to Solr (disclaimer: I worked on it,
and it's available both a Solr plugin and patch, but it's not ready to be
committed into Solr yet.) enables such a capability. See
https://issues.apache.org/jira/browse/SOLR-9480 for the most current patch.

It is a request handler that can take in any query and discover the most
related other terms to that entire query from the inverted index, sorted by
strength of relationship to that query (it can also traverse from those
terms across fields/relationships to other terms, but that's probably
overkill for the basic related searches use case). Think of it as a way to
run a query and find the most relevant other keywords, as opposed to
finding the most relevant documents.

Using this, you can then either return the related keywords as your related
searches, or you can modify your query to include them and power a
conceptual/semantic search instead of the pure text-based search you
started with. It's effectively a (better) way to implement More Like This,
where instead of taking a document and using tf-idf to extract out the
globally-interesting terms from the document (like MLT), you can instead
use a query to find contextually-relevant keywords across many documents,
score them based upon their similarity to the original query, and then turn
around and use the top most semantically-relevant terms as your related
search(es).

I don't have near-term plans to expose the semantic knowledge graph as a
search component (it's a request handler right now), but once it's finished
that could certainly be done. Just wanted to mention it as another approach
to solve this specific problem.

-Trey Grainger
SVP of Engineering @ Lucidworks
Co-author, Solr in Action



On Wed, Oct 26, 2016 at 1:59 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Indeed, we have similar processes running of which one generates a
> 'related query collection' which just contains a (normalized) query and its
> related queries. I would not know how this is even possible without
> continuously processing query and click logs.
>
> M.
>
>
> -Original message-
> > From:Grant Ingersoll <gsing...@apache.org>
> > Sent: Tuesday 25th October 2016 23:51
> > To: solr-user@lucene.apache.org
> > Subject: Re: Related Search
> >
> > Hi Rick,
> >
> > I typically do this stuff just by searching a different collection that I
> > create offline by analyzing query logs and then indexing them and
> searching.
> >
> > On Mon, Oct 24, 2016 at 8:32 PM Rick Leir <rl...@leirtech.com> wrote:
> >
> > > Hi all,
> > >
> > > There is an issue 'Create a Related Search Component' which has been
> > > open for some years now.
> > >
> > > It has a priority: major.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-2080
> > >
> > >
> > > I discovered it linked from Lucidwork's very useful blog on ecommerce:
> > >
> > >
> > > https://lucidworks.com/blog/2011/01/25/implementing-the-
> ecommerce-checklist-with-apache-solr-and-lucidworks/
> > >
> > >
> > > Did people find a better way to accomplish Related Search? Perhaps MLT
> > > http://wiki.apache.org/solr/MoreLikeThis ?
> > >
> > > cheers -- Rick
> > >
> > >
> > >
> >
>


Re: Hackday next month

2016-09-21 Thread Trey Grainger
I know a bunch of folks who would be likely attend the hackday (including
committers) will have some other meetings on Wednesday before the
conference, so I think that Tuesday is actually a pretty good time to have
this.

My 2 cents,

Trey Grainger
SVP of Engineering @ Lucidworks
Co-author, Solr in Action

On Wed, Sep 21, 2016 at 1:20 PM, Anshum Gupta <ans...@anshumgupta.net>
wrote:

> This is good but is there a way to instead do this on Wednesday?
> Considering that the conference starts on Thursday, perhaps it makes sense
>  to do it just a day before ? Not sure about others but it certainly would
> work much better for me.
>
> -Anshum
>
> On Wed, Sep 21, 2016 at 2:18 PM Charlie Hull <char...@flax.co.uk> wrote:
>
> > Hi all,
> >
> > If you're coming to Lucene Revolution next month in Boston, we're
> > running a Lucene-focused hackday (Lucene, Solr, Elasticsearch)
> > kindly hosted by BA Insight. There will be Lucene committers there, it's
> > free to attend and we also need ideas on what to do! Come and join us.
> >
> > http://www.meetup.com/New-England-Search-Technologies-
> NEST-Group/events/233492535/
> >
> > Cheers
> >
> > Charlie
> >
> > --
> > Charlie Hull
> > Flax - Open Source Enterprise Search
> >
> > tel/fax: +44 (0)8700 118334
> > mobile:  +44 (0)7767 825828
> > web: www.flax.co.uk
> >
>


Re: [ANN] Relevant Search by Manning out! (Thanks Solr community!)

2016-06-21 Thread Trey Grainger
Congrats Doug and John! Writing a book like this is a very long, arduous
process (as several folks on this list can attest to). Writing a great book
like this is considerably more challenging.

I read through this entire book a few months ago before they put the final
touches on it, and (for anyone on the mailing list who is contemplating
buying it), it is a REALLY great book that will teach you the ins and outs
of how search relevancy works under the covers and how you can manipulate
and improve it. It's very well-written, and definitely worth the read.

Congrats again, guys.

Trey Grainger
Co-author, Solr in Action
SVP of Engineering @ Lucidworks

On Tue, Jun 21, 2016 at 2:12 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Not much more to add than my post here! This book is targeted towards
> Lucene-based search (Elasticsearch and Solr) relevance.
>
> Announcement with discount code:
> http://opensourceconnections.com/blog/2016/06/21/relevant-search-published/
>
> Related hacker news thread:
> https://news.ycombinator.com/item?id=11946636
>
> Thanks to everyone in the Solr community that was helpful to my efforts.
> Specifically Trey Grainger, Eric Pugh (for keeping me employed), Charlie
> Hull and the Flax team, Alex Rafalovitch, Timothy Potter, Yonik Seeley,
> Grant Ingersoll (for basically teaching me Solr back in the day), Drew
> Farris (for encouraging my early blogging), everyone at OSC, and many
> others I'm probably forgetting!
>
> Best
> -Doug
>


Re: Lucene Revolution ?

2015-10-18 Thread Trey Grainger
Lucene/Solr Revolution just keeps getting better every year, and this year
was clearly the best year yet!

I saw two major themes that I'd say about about 2/3 of the talk were
focused on:
  1) Search Relevancy
  2) Analytics

I'd definitely say that that there's a greatly emerging landscape of
presentations covering the cutting-edge of search relevancy. Michael
Nilsson and Diego Ceccarelli from Bloomberg gave a presentation on a
Learning to Rank (aka "Machine-Learned Ranking") Solr plugin they are
developing and hoping to open source soon, which I took particular interest
in, as I've got a bit of background there and am working toward developing
something similar over the next few months. In other words, I'm sitting on
the edge of my chair waiting on them to open source it to hopefully save my
team months of similar work : )

Fiona Condon from Etsy also gave a great talk on relevancy from a different
perspective - preventing keyword stuffing/seo gaming/monopoly of their
search results and ensuring uniqueness and fairness in search results in a
system where those contributing the content are all incentivized to game
the system to achieve maximum exposure.

There were also several other relevancy talks I missed, including one from
Simon Hughes from Dice.com on leveraging Latent Semantic Indexing and
Word2Vec to add conceptual search into Solr. This is a topic I remember
being talked about by folks like John Berryman back as early as 2013, but
it looks like Dice released some open source code that can be easily tied
into Solr, which is really exciting to see.  There were many other
presentation on emerging relevancy strategies (sorry if I left your name
off), but I'll have to wait to review the videos with everyone else once
they are posted.

My talk (which Alexandre mentioned earlier) was also on relevancy,
specifically describing building a knowledge graph and intent engine within
Solr that can be used to intelligently parse entities and understand their
relationships dynamically from queries and documents using nothing but the
search index and query logs. (Slides here:
http://www.treygrainger.com/posts/presentations/leveraging-lucene-solr-as-a-knowledge-graph-and-intent-engine/
)

In addition to the many relevancy topic, there was another thread within
the presentations (more committer-driven) around analytics. Specifically,
Tim Potter from LucidWorks (my co-author on Solr in Action) gave a great
presentation on using Spark with Solr, Joel Bernstein and Erick Erickson
gave talks on the recent streaming analytics and parallel computing work
that's being added to Solr, and Yonik Seeley presented on the new JSON
faceting API and the enhanced analytical capabilities therein. Once again,
several other talks on faceting and analytics, but there was quite a strong
committer focus on that topic.

Definitely worth checking out the slides and videos when they are posted -
lots of really good material all around.


Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder



On Sun, Oct 18, 2015 at 7:54 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Here's a bit from my colleague Eric Pugh summarizing Grants keynote.
> Admittedly he's also focussing a lot on our firms relevance
> capabilities/products (the keynote was on relevance) so extensive shameless
> plug warning included with this link :)
>
>
> http://opensourceconnections.com/blog/2015/10/15/bad-behaviors-in-tuning-search-results/
>
> On Sunday, October 18, 2015, Susheel Kumar <susheel2...@gmail.com> wrote:
>
> > I couldn't also make it.  Would love to hear more who make it.
> >
> > Thanks,
> > Susheel
> >
> > On Sun, Oct 18, 2015 at 10:53 AM, Jack Krupansky <
> jack.krupan...@gmail.com
> > <javascript:;>>
> > wrote:
> >
> > > Sorry I missed out this year. I thought it was next month and hadn't
> seen
> > > any reminders. Just last Tuesday I finally got around to googling the
> > > conference and was shocked to read that it was the next day. Oh well.
> > > Personally I'm less interested in the formal sessions than the informal
> > > networking.
> > >
> > > In any case, keep those user reports flowing. I'm sure there are plenty
> > of
> > > people who didn't make it to the conference.
> > >
> > > -- Jack Krupansky
> > >
> > > On Sun, Oct 18, 2015 at 8:52 AM, Erik Hatcher <erik.hatc...@gmail.com
> > <javascript:;>>
> > > wrote:
> > >
> > > > The Revolution was not televised (though heavily tweeted, and videos
> of
> > > > sessions to follow eventually).  A great time was had by all.  Much
> > > > learning!  Much collaboration. Awesome event if I may say so myself.
> >

Re: catchall fields or multiple fields

2015-10-12 Thread Trey Grainger
Elisabeth,

Yes, it will almost always be more efficient to search within a catch-all
field than to search across multiple fields. Think of it this way: when you
search on a single field, you are doing a single keyword search against the
index per term. When you search across multiple fields, you are executing
the search for that term multiple times (once for each field) against the
index, and then doing the necessary intersections/unions/etc. of the
document sets.

As you continue to add more and more fields to search across, the search
continues to grow slower. If you're only searching a few fields then it
will probably not be noticeably slower, but the more and more you add, the
slower your response times will become. This slowdown may be measured in
milliseconds, in which case you may not care, but it will be slower.

The idf point you mentioned can be both a pro and a con depending upon the
use case. For example, if you are searching news content that has a
"french_text" field and an "english_text" field, it would be suboptimal if
for the search "Barack Obama" you got only French documents at the top
because the US president's name is much more commonly found in English
documents. When you're searching fields with different types of content,
however, you might find examples where you'd actually want idf differences
maintained and documents differentiated based upon underlying field.

One particularly nice thing about the multi-field approach is that it is
very easy to apply different boosts to the fields and to dynamically change
the boosts. You can similarly do this with payloads within a catch-all
field. You could even assign each term a payload corresponding to which
field the content came from, and then dynamically change the boosts
associated with those payloads at query time (caveat - custom code
required). See this blog post for an end-to-end payload scoring example,
https://lucidworks.com/blog/2014/06/13/end-to-end-payload-example-in-solr/.


Sharing my personal experience: at CareerBuilder, we use the catch-all
field with payloads (one per underlying field) that we can dynamically
change the weight of at query time. We found that for most of our corpus
sizes (ranging between 2 and 100 million full text jobs or resumes), that
is is more efficient to search between 1 and 3 fields than to do the
multi-field search with payload scoring, but once we get to the 4th field
the extra cost associated with the payload scoring was overtaken by the
additional time required to search each additional field.   These numbers
(3 vs 4 fields, etc.) are all anecdotal, of course, as it is dependent upon
a lot of environmental and corpus factors unique to our use case.

The main point of this approach, however, is that there is no additional
cost per-field beyond the upfront cost to add and score payloads, so we
have been able to easily represent over a hundred of these payload-based
"virtual fields" with different weights within a catch-all field (all with
a fixed query-time cost).

*In summary*: yes, you should expect a performance decline as you add more
and more fields to your query if you are searching across multiple fields.
You can overcome this by using a single catch-all field if you are okay
losing IDF per-field (you'll still have it globally across all fields). If
you want to use a catch-all field, but still want to boost content based
upon the field it originated within, you can accomplish this with payloads.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder


On Mon, Oct 12, 2015 at 9:12 AM, Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

> Hi,
>
> Catch-all field: No need to worry about how to aggregate scores coming
> from different fields.
> But you cannot utilize different analysers for different fields.
>
> Multiple-fields: You can play with edismax's parameters on-the-fly,
> without having to re-index.
> It is flexible that you can include/exclude fields from search.
>
> Ahmet
>
>
>
> On Monday, October 12, 2015 3:39 PM, elisabeth benoit <
> elisaelisael...@gmail.com> wrote:
> Hello,
>
> We're using solr 4.10 and storing all data in a catchall field. It seems to
> me that one good reason for using a catchall field is when using scoring
> with idf (with idf, a word might not have same score in all fields). We got
> rid of idf and are now considering using multiple fields. I remember
> reading somewhere that using a catchall field might speed up searching
> time. I was wondering if some of you have any opinion (or experience)
> related to this subject.
>
> Best regards,
> Elisabeth
>


Re: are there any SolrCloud supervisors?

2015-10-12 Thread Trey Grainger
I'd be very interested in taking a look if you post the code.

Trey Grainger
Co-Author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder

On Fri, Oct 2, 2015 at 3:09 PM, r b <chopf...@gmail.com> wrote:

> I've been working on something that just monitors ZooKeeper to add and
> remove nodes from collections. the use case being I put SolrCloud in
> an autoscaling group on EC2 and as instances go up and down, I need
> them added to the collection. It's something I've built for work and
> could clean up to share on GitHub if there is much interest.
>
> I asked in the IRC about a SolrCloud supervisor utility but wanted to
> extend that question to this list. are there any more "full featured"
> supervisors out there?
>
>
> -renning
>


Re: JSON Facet Analytics API in Solr 5.1

2015-04-17 Thread Trey Grainger
Agreed, I also prefer the second way. I find it more readible, less verbose
while communicating the same information, less confusing to mentally parse
(is 'terms' the name of my facet, or the type of my facet?...), and less
prone to syntactlcally valid, but logically invalid inputs.  Let's break
those topics down.

*1) Less verbose while communicating the same information:*
The flatter structure is particularly useful when you have nested facets to
reduce unnecessary verbosity / extra levels. Let's contrast the two
approaches with just 2 levels of subfacets:

** Current Format **
top_genres:{
terms:{
field: genre,
limit: 5,
facet:{
top_authors:{
terms:{
field: author,
limit: 4,
facet: {
top_books:{
terms:{
field: title,
limit: 5
   }
   }
}
}
}
}
}
}

** Flat Format **
top_genres:{
type: terms,
field: genre,
limit: 5,
facet:{
top_authors:{
type: terms
field: author,
limit: 4,
facet: {
top_books:{
type: terms
field: title,
limit: 5
   }
}
}
}
}

The flat format is clearly shorter and more succinct, while communicating
the same information. What value do the extra levels add?


*2) Less confusing to mentally parse*
I also find the flatter structure less confusing, as I'm consistently
having to take a mental pause with the current format to verify whether
terms is the name of my facet or the type of my facet and have to count
the curly braces to figure this out.  Not that I would name my facets like
this, but to give an extreme example of why that extra mental calculation
is necessary due to the name of an attribute in the structure being able to
represent both a facet name and facet type:

terms: {
terms: {
field: genre,
limit: 5,
facet: {
terms: {
terms:{
field: author
limit: 4
}
}
}
}
}

In this example, the first terms is a facet name, the second terms is a
facet type, the third is a facet name, etc. Even if you don't name your
facets like this, it still requires parsing someone else's query mentally
to ensure that's not what was done.

3) *Less prone to syntactically valid, but logically invalid inputs*
Also, given this first format (where the type is indicated by one of
several possible attributes: terms, range, etc.), what happens if I pass in
multiple of the valid JSON attributes... the flatter structure prevents
this from being possible (which is a good thing!):

top_authors : {
terms : {
field : author,
limit : 5
},
range : {
field : price,
start : 0,
end : 100,
gap : 20
}
}

I don't think the response format can currently handle this without adding
in extra levels to make it look like the input side, so this is an
exception case even thought it seems syntactically valid.

So in conclusion, I'd give a strong vote to the flatter structure. Can
someone enumerate the benefits of the current format over the flatter
structure (I'm probably dense and just failing to see them currently)?

Thanks,

-Trey


On Fri, Apr 17, 2015 at 2:28 PM, Jean-Sebastien Vachon 
jean-sebastien.vac...@wantedanalytics.com wrote:

 I prefer the second way. I find it more readable and shorter.

 Thanks for making Solr even better ;)

 
 From: Yonik Seeley ysee...@gmail.com
 Sent: Friday, April 17, 2015 12:20 PM
 To: solr-user@lucene.apache.org
 Subject: Re: JSON Facet  Analytics API in Solr 5.1

 Does anyone have any thoughts on the current general structure of JSON
 facets?
 The current general form of a facet command is:

 facet_name : { facet_type : facet_args }

 For example:

 top_authors : { terms : {
   field : author,
   limit : 5,
 }}

 One alternative I considered in the past is having the type in the args:

 top_authors : {
   type : terms,
   field : author,
   limit : 5
 }

 It's a flatter structure... probably better in some ways, but worse in
 other ways.
 Thoughts / preferences?

 -Yonik


 On Tue, Apr 14, 2015 at 4:30 PM, Yonik Seeley ysee...@gmail.com wrote:
  Folks, there's a new JSON Facet API in the just released Solr 5.1
  (actually, a new facet module under the covers too).
 
  It's marked as experimental so we have time to change the API based on
  your feedback.  So let us know what you like, what you would change,
  what's missing, or any other ideas you may have!
 
  I've just started the documentation for the reference guide (on our
  confluence wiki), so 

Re: Basic Multilingual search capability

2015-02-23 Thread Trey Grainger
Hi Rishi,

I don't generally recommend a language-insensitive approach except for
really simple multilingual use cases (for most of the reasons Walter
mentioned), but the ICUTokenizer is probably the best bet you're going to
have if you really want to go that route and only need exact-match on the
tokens that are parsed. It won't work that well for all languages (CJK
languages, for example), but it will work fine for many.

It is also possible to handle multi-lingual content in a more intelligent
(i.e. per-language configuration) way in your search index, of course.
There are three primary strategies (i.e. ways that actually work in the
real world) to do this:
1) create a separate field for each language and search across all of them
at query time
2) create a separate core per language-combination and search across all of
them at query time
3) invoke multiple language-specific analyzers within a single field's
analyzer and index/query using one or more of those language's analyzers
for each document/query.

These are listed in ascending order of complexity, and each can be valid
based upon your use case. For at least the first and third cases, you can
use index-time language detection to map to the appropriate
fields/analyzers if you are otherwise unaware of the languages of the
content from your application layer. The third option requires custom code
(included in the large Multilingual Search chapter of Solr in Action
http://solrinaction.com and soon to be contributed back to Solr via
SOLR-6492 https://issues.apache.org/jira/browse/SOLR-6492), but it
enables you to index an arbitrarily large number of languages into the same
field if needed, while preserving language-specific analysis for each
language.

I presented in detail on the above strategies at Lucene/Solr Revolution
last November, so you may consider checking out the presentation and/or
slides to asses if one of these strategies will work for your use case:
http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

For the record, I'd highly recommend going with the first strategy (a
separate field per language) if you can, as it is certainly the simplest of
the approaches (albeit the one that scales the least well after you add
more than a few languages to your queries). If you want to stay simple and
stick with the ICUTokenizer then it will work to a point, but some of the
problems Walter mentioned may eventually bite you if you are supporting
certain groups of languages.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Recommendations @ CareerBuilder

On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood wun...@wunderwood.org
wrote:

 It isn’t just complicated, it can be impossible.

 Do you have content in Chinese or Japanese? Those languages (and some
 others) do not separate words with spaces. You cannot even do word search
 without a language-specific, dictionary-based parser.

 German is space separated, except many noun compounds are not
 space-separated.

 Do you have Finnish content? Entire prepositional phrases turn into word
 endings.

 Do you have Arabic content? That is even harder.

 If all your content is in space-separated languages that are not heavily
 inflected, you can kind of do OK with a language-insensitive approach. But
 it hits the wall pretty fast.

 One thing that does work pretty well is trademarked names (LaserJet, Coke,
 etc). Those are spelled the same in all languages and usually not inflected.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

 On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com
 wrote:

  Hi Alex,
 
  There is no specific language list.
  For example: the documents that needs to be indexed are emails or any
 messages for a global customer base. The messages back and forth could be
 in any language or mix of languages.
 
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide basic search capability for any language. Ex: When the document
 contains hello or здравствуйте, the analyzer creates tokens and provides
 exact match search results.
 
  Now it would be great if it had capability to tokenize email addresses
 (ex:he...@aol.com- i think standardTokenizer already does this),
 filenames (здравствуйте.pdf), but maybe we can use filters to accomplish
 that.
 
  Thanks,
  Rishi.
 
  -Original Message-
  From: Alexandre Rafalovitch arafa...@gmail.com
  To: solr-user solr-user@lucene.apache.org
  Sent: Mon, Feb 23, 2015 5:49 pm
  Subject: Re: Basic Multilingual search capability
 
 
  Which languages are you expecting to deal with? Multilingual support
  is a complex issue. Even if you think you don't need much, it is
  usually a lot more complex than expected, especially around relevancy.
 
  Regards,
Alex.
  
  Sign up for my Solr resources

What's the most efficient way to sort by number of terms matched?

2014-11-05 Thread Trey Grainger
Just curious if there are some suggestions here. The use case is fairly
simple:

Given a query like  python OR solr OR hadoop, I want to sort results by
number of keywords matched first, and by relevancy separately.

I can think of ways to do this, but not efficiently. For example, I could
do:
q=python OR solr OR hadoop
  p1=python
  p2=solr
  p3=hadoop
  sort=sum(if(query($p1,0),1,0),if(query($p2,0),1,0),if(query($p3,0),1,0))
desc, score desc

Other than the obvious downside that this requires me to pre-parse the
user's query, it's also somewhat inefficient to run the query function once
for each term in the original query since it is re-executing multiple
queries and looping through every document in the index during scoring.

Ideally, I would be able to do something like the below that could just
pull the count of unique matched terms from the main query (q parameter)
execution::
q=python OR solr OR hadoopsort=uniquematchedterms() desc,score desc.

I don't think anything like this exists, but would love some suggestions if
anyone else has solved this before.

Thanks,

-Trey


Re: How to implement multilingual word components fields schema?

2014-09-08 Thread Trey Grainger
Hi Ilia,

When writing *Solr in Action*, I implemented a feature which can do what
you're asking (allow multiple, dynamic analyzers to be used in a single
text field). This would allow you to use the same field and dynamically
change the analyzers (for example, you could do language-identification on
documents and only stem to the identified languages). It also support more
than one Analyzer per field (i.e. if you single documents or queries
containing multiple languages).

This seems to be a feature request which comes up regularly, so I just
submitted a new feature request on JIRA to add this feature to Solr and
track the progress:
https://issues.apache.org/jira/browse/SOLR-6492

I included a comment showing how to use the functionality currently
described in *Solr in Action*, but I plan to make it easier to use over the
next 2 months before calling it done. I'm going to be talking about
multilingual search in November at Lucene/Solr Revolution, so I'd ideally
like to finish before then so I can demonstrate it there.

Thanks,

-Trey Grainger
Director of Engineering, Search  Analytics @ CareerBuilder


On Mon, Sep 8, 2014 at 3:31 PM, Jorge Luis Betancourt Gonzalez 
jlbetanco...@uci.cu wrote:

 In one of the talks by Trey Grainger (author of Solr in Action) it touches
 how on CareerBuilder are dealing with multilingual with payloads, its a
 little more of work but I think it would payoff.

 On Sep 8, 2014, at 7:58 AM, Jack Krupansky j...@basetechnology.com
 wrote:

  You also need to take a stance as to whether you wish to auto-detect the
 language at query time vs. have a UI selection of language vs. attempt to
 perform the same query for each available language and then determine
 which has the best relevancy. The latter two options are very sensitive
 to short queries. Keep in mind that auto-detection for indexing full
 documents is a different problem that auto-detection for very short queries.
 
  -- Jack Krupansky
 
  -Original Message- From: Ilia Sretenskii
  Sent: Sunday, September 7, 2014 10:33 PM
  To: solr-user@lucene.apache.org
  Subject: Re: How to implement multilingual word components fields schema?
 
  Thank you for the replies, guys!
 
  Using field-per-language approach for multilingual content is the last
  thing I would try since my actual task is to implement a search
  functionality which would implement relatively the same possibilities for
  every known world language.
  The closest references are those popular web search engines, they seem to
  serve worldwide users with their different languages and even
  cross-language queries as well.
  Thus, a field-per-language approach would be a sure waste of storage
  resources due to the high number of duplicates, since there are over 200
  known languages.
  I really would like to keep single field for cross-language searchable
 text
  content, witout splitting it into specific language fields or specific
  language cores.
 
  So my current choice will be to stay with just the ICUTokenizer and
  ICUFoldingFilter as they are without any language specific
  stemmers/lemmatizers yet at all.
 
  Probably I will put the most popular languages stop words filters and
  stemmers into the same one searchable text field to give it a try and see
  if it works correctly in a stack.
  Does specific language related filters stacking work correctly in one
 field?
 
  Further development will most likely involve some advanced custom
 analyzers
  like the SimplePolyGlotStemmingTokenFilter to utilize the ICU generated
  ScriptAttribute.
  http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
 
 https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java
 
  So I would like to know more about those academic papers on this issue
 of
  how best to deal with mixed language/mixed script queries and documents.
  Tom, could you please share them?

 Concurso Mi selfie por los 5. Detalles en
 http://justiciaparaloscinco.wordpress.com



Re: facet.field counts when q includes field

2014-04-27 Thread Trey Grainger
So my question basically is: which restrictions are applied to the docset
from which (field) facets are computed?

Facets are generated based upon values found within the documents matching
your q= parameter and also all of your fq= parameters. Basically, if
you do an intersection of the docsets from all q= and fq= parameters
then you end up with the docset the facet calculations are based upon.

When you say if I add type=book, *no* documents match, but I get facet
counts: { chapter=4 }, I'm not exactly sure what you mean. If you are
adding q=tototype=bookfacet=truefacet.field=type then the problem is
that the type=book parameter doesn't do anything... it is not a valid
Solr parameter for filtering here. In this case, all 4 of your documents
matching the q=toto query are still being returned, which is why the
facet count for chapters is 4.

If instead you specify q=totofq=type:bookfacet=truefacet.field=type
then this will filter down to ONLY the documents with a type of book. Since
it looks like in your data there are no documents which are both a type of
book and also match the q=toto query, you should get 0 documents and thus
the counts of all your facet values will be zero.

As you mentioned, it is possible to utilize tags and excludes to change the
behavior described above, but hopefully this answers your question about
the default behavior.

Thanks,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Analytics @ CareerBuilder


On Sun, Apr 27, 2014 at 4:51 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 I'm trying to understand the facet counts I'm getting back from Solr when
 the main query includes a term that restricts on a field that is being
 faceted.  After reading the docs on the wiki (both wikis) I'm confused.

 In my little test dataset, if I facet on type and use q=*:*, I get facet
 counts for type: [ chapter=5, book=1 ]

 With q=toto, only four of the chapters match, so I get facet counts for
 type: { chapter=4 } .

 Now if I add type=book, *no* documents match, but I get facet counts: {
 chapter=4 }.

 It's as if the type term from the query is being ignored when the facets
 are computed.  This is actually what we want, in general, but the
 documentation doesn't reflect it and I'd like to understand better the
 mechanism so I can tell what I can rely on.

 I see that there is the possibility of tagging and excluding filters (fq)
 so they don't effect the facet counting, but there's no mention on the wiki
 of any sort of term exclusion from the main query.  I poked around in the
 source a bit, but wasn't able to find an answer quickly, so I thought I'd
 ask here.

 So my question basically is: which restrictions are applied to the docset
 from which (field) facets are computed?

 -Mike





Re: facet.field counts when q includes field

2014-04-27 Thread Trey Grainger
No problem, Mike. Glad you got it sorted out.

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Analytics @ CareerBuilder


On Sun, Apr 27, 2014 at 7:23 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 On 4/27/14 7:02 PM, Michael Sokolov wrote:

 On 4/27/2014 6:30 PM, Trey Grainger wrote:

 So my question basically is: which restrictions are applied to the docset

 from which (field) facets are computed?

 Facets are generated based upon values found within the documents
 matching
 your q= parameter and also all of your fq= parameters. Basically, if
 you do an intersection of the docsets from all q= and fq= parameters
 then you end up with the docset the facet calculations are based upon.

 When you say if I add type=book, *no* documents match, but I get facet
 counts: { chapter=4 }, I'm not exactly sure what you mean. If you are
 adding q=tototype=bookfacet=truefacet.field=type then the problem
 is
 that the type=book parameter doesn't do anything... it is not a valid
 Solr parameter for filtering here. In this case, all 4 of your documents
 matching the q=toto query are still being returned, which is why the
 facet count for chapters is 4.

 In fact my query looks like:

 q=fulltext_t%3A%28toto%29+AND+dc_type_s%3A%28book%29+%
 2Bdirectory_b%3Afalsestart=0rows=20fl=uri%2Ctimestamp%
 2Cdirectory_b%2Csize_i%2Cmeta_ss%2Cmime_type_ssfacet.field=dc_type_s

 or without url encoding:

  q=fulltext_t:(toto) AND dc_type_s:(book) (directory_b:false)
 facet.field=dc_type_s

 default operator is AND

  ... so I don't think that the query is broken like you described?

 -Mike

 OK the problem wasn't with the query, but while I tried to write out a
 clearer explanation, I found it -- an issue in a unit test too boring to
 describe.  Facets do seem to work like you said, and how they're
 documented, and as I assumed they did :)

 Thanks, and sorry for the noise.

 -Mike



Re: multiple analyzers for one field

2014-04-10 Thread Trey Grainger
Hi Michael,

It IS possible to utilize multiple Analyzers within a single field, but
it's not a built in capability of Solr right now. I wrote something I
called a MultiTextField which provides this capability, and you can see
the code here:
https://github.com/treygrainger/solr-in-action/tree/master/src/main/java/sia/ch14

The general idea is that you can pass in a prefix for each piece of your
content and then use that prefix to dynamically select one or more
Analyzers for each piece of content. So, for example, you could pass in
something like this when indexing your document (for a multiValued field):
field name=someMultiTextFielden|some text/field
field name=someMultiTextFieldes|some more text/field
field name=someMultiTextFieldde,fr|some other text/field

Then, the MultiTextField will parse the prefixes and dynamically grab an
Analyzer based upon the prefix. In this case, the first input will be
processed using an English Analyzer, the second input will use a spanish
analyzer, and the third input will use both a German and French analyzer,
as defined when the field is defined in the schema.xml:

fieldType name=multiText
class=sia.ch14.MultiTextField sortMissingLast=true
defaultFieldType=text_general
fieldMappings=en:text_english,
   es:text_spanish,
   fr:text_french,
   fr:text_german/

field name=someMultiTextField type=multiText indexed=true
multiValued=true /


If you want to automagically map separate fields into one of these dynamic
analyzer (MultiText) fields with prefixes, you could either pass the text
in multiple times from the client to the same field (with different
Analyzer prefixes each time like shown above), OR you could write an Update
Request Processor that does this for you. I don't think it is possible to
just have the copyField add in prefixes automatically for you, though
someone please correct me if I'm wrong.

If you implement an Update Request Processor, then inside it you would
simply grab the text from each of the relevant fields (i.e. author and
title fields) and then add that field's value to the named MultiText field
with the appropriate Analyzer prefix based upon each field. I made an
example Update Request Processor (see the previous github link and look for
MultiTextFieldLanguageIdentifierUpdateProcessor) that you could look at as
an example of how to supply different analyzer prefixes to different values
within a multiValued field, though you would obviously want to throw away
all the language detection stuff since it doesn't match your specific use
case.

All that being said, this solution may end up being overly complicated for
your use case, so your idea of creating a custom analyzer to just handle
your example might be much less complicated. At any rate, that's the
specific answer to your specific question about whether it is possible to
utilize multiple Analyzers within a field based upon multiple inputs.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Analytics @ CareerBuilder


On Thu, Apr 10, 2014 at 9:05 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 The lack of response to this question makes me think that either there is
 no good answer, or maybe the question was too obtuse.  So I'll give it one
 more go with some more detail ...

 My main goal is to implement autocompletion with a mix of words and short
 phrases, where the words are drawn from the text of largish documents, and
 the phrases are author names and document titles.

 I think the best way to accomplish this is to concoct a single field that
 contains data from these other source fields (as usual with copyField),
 but with some of the fields treated as keywords (ie with their values
 inserted as single tokens), and others tokenized.  I believe this would be
 possible at the Lucene level by calling Document.addField () with multiple
 fields having the same name: some marked as TOKENIZED and others not.  I
 think the tokenized fields would have to share the same analyzer, but
 that's OK for my case.

 I can't see how this could be made to happen in Solr without a lot of
 custom coding though. It seems as if the conversion from Solr fields to
 Lucene fields is not an easy thing to influence.  If anyone has an idea how
 to achieve the subgoal, or perhaps a different way of getting at the main
 goal, I'd love to hear about it.

 So far my only other idea is to write some kind of custom analyzer that
 treats short texts as keywords and tokenizes longer ones, which is probably
 what I'll look at if nothing else comes up.

 Thanks

 Mike



 On 4/9/2014 4:16 PM, Michael Sokolov wrote:

 I think I would like to do something like copyfield from a bunch of
 fields into a single field, but with different analysis for each source,
 and I'm pretty sure that's not a thing. Is there some alternate way to
 accomplish my goal?

 Which is to have a suggester that suggests

[ANN] Solr in Action book release (Solr 4.7)

2014-03-27 Thread Trey Grainger
I'm excited to announce the final print release of *Solr in Action*, the
newest Solr book by Manning publications covering through Solr 4.7 (the
current version). The book is available for immediate purchase in print and
ebook formats, and the *outline*, some *free chapters* as well as the *full
source code are also available* at http://solrinaction.com.

I would love it if you would check the book out, and I would also
appreciate your feedback on it, especially if you find the book to be a
useful guide as you are working with Solr! Timothy Potter and I (Trey
Grainger) worked tirelessly on the book for nearly 2 years to bring you a
thorough (664 pg.) and fantastic example-driven guide to the best Solr has
to offer.

*Solr in Action* is intentionally designed to be a learning guide as
opposed to a reference manual. It builds from an initial introduction to
Solr all the way to advanced topics such as implementing a predictive
search experience, writing your own Solr plugins for function queries and
multilingual text analysis, using Solr for big data analytics, and even
building your own Solr-based recommendation engine. The book uses fun
real-world examples, including analyzing the text of tweets, searching and
faceting on restaurants, grouping similar items in an ecommerce
application, highlighting interesting keywords in UFO sighting reports, and
even building a personalized job search experience.

For a more detailed write-up about the book and it's contents, you can also
visit the Solr homepage at
https://lucene.apache.org/solr/books.html#solr-in-action. Thanks in advance
for checking it out, and I really hope many of you find the book to be
personally useful!

All the best,

Trey Grainger
Co-author,
*Solr in Action*Director of Engineering, Search  Analytics @CareerBuilder


Re: [ANN] Solr in Action book release (Solr 4.7)

2014-03-27 Thread Trey Grainger
Hi Philippe,

Yes if you've purchased the eBook then the PDF is available now and the
other formats (ePub and Kindle) are supposed to be available for download
on April 8th.
It's also worth mentioning that the eBook formats are all available for
free with the purchase of the print book.

Best regards,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Analytics @CareerBuilder


On Thu, Mar 27, 2014 at 12:04 PM, Philippe Soares soa...@genomequest.com
wrote:

 Thanks Trey !
 I just tried to download my copy from my manning account, and this final
 version appears only in PDF format.
 Any idea about when they'll release the other formats ?


Re: Multiple Languages in Same Core

2014-03-27 Thread Trey Grainger
In addition to the two approaches Liu Bo mentioned (separate core per
language and separate field per language), it is also possible to put
multiple languages in a single field. This saves you the overhead of
multiple cores and of having to search across multiple fields at query
time. The idea here is that you can run multiple analyzers (i.e. one for
German, one for English, one for Chinese, etc.) and stack the outputted
TokenStreams for each of these within a single field. It is also possible
to swap out the languages you want to use on a case-by-case basis (i.e.
per-document, per field, or even per word) if you really need to for
advanced use cases.

All three of these methods, including code examples and the pros and cons
of each are discussed in the Multilingual Search chapter of Solr in Action,
which Alexandre referenced. If you don't have the book, you can also just
download and run the code examples for free, though they may be harder to
follow without the context from the book.

Thanks,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Analytics @CareerBuilder





On Wed, Mar 26, 2014 at 4:34 AM, Liu Bo diabl...@gmail.com wrote:

 Hi Jeremy

 There're a lot of multi language discussions, two main approaches
  1. like yours, a language is one core
  2. all in one core, different language has it's own field.

 We have multi-language support in a single core, each multilingual field
 has it's own suffix such as name_en_US. We customized query handler to hide
 the query details to client.
 The main reason we want to do this is about NRT index and search,
 take product for example:

 product has price, quantity which is common and it's used by filtering
 and sorting, name, description is multi language field,
 if we split product in do different cores, the common field updating
 may end up a update in all of the multi language cores.

 As to scalability, we don't change solr cores/collections when a new
 language is added, but we probably need update our customized index process
 and run a full re-index.

 This approach suits our requirement for now, but you may have your own
 concerns.

 We have similar suggest filter problem like yours, we want to return
 suggest result filtering by stores. I can't find a way to build dictionary
 with query at my version of solr 4.6

 What I do is run a query on a N-Gram analyzed field and with filter queries
 on store_id field. The suggest is actually a query. It may not perform as
 well as suggestion but can do the trick.

 You can try it to build a additional N-GRAM field for suggestion only and
 search on it with fq on your Locale field.

 All the best

 Liu Bo




 On 25 March 2014 09:15, Alexandre Rafalovitch arafa...@gmail.com wrote:

  Solr In Action has a significant discussion on the multi-lingual
  approach. They also have some code samples out there. Might be worth a
  look
 
  Regards,
 Alex.
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all
  at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
  book)
 
 
  On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson
  jer...@thomersonfamily.com wrote:
   I recently deployed Solr to back the site search feature of a site I
 work
   on. The site itself is available in hundreds of languages. With the
  initial
   release of site search we have enabled the feature for ten of those
   languages. This is distributed across eight cores, with two Chinese
   languages plus Korean combined into one CJK core and each of the other
   seven languages in their own individual cores. The reason for splitting
   these into separate cores was so that we could have the same field
 names
   across all cores but have different configuration for analyzers, etc,
 per
   core.
  
   Now I have some questions on this approach.
  
   1) Scalability: Considering I need to scale this to many dozens more
   languages, perhaps hundreds more, is there a better way so that I don't
  end
   up needing dozens or hundreds of cores? My initial plan was that many
   languages that didn't have special support within Solr would simply get
   lumped into a single default core that has some default analyzers
 that
   are applicable to the majority of languages.
  
   1b) Related to this: is there a practical limit to the number of cores
  that
   can be run on one instance of Lucene?
  
   2) Auto Suggest: In phase two I intend to add auto-suggestions as a
 user
   types a query. In reviewing how this is implemented and how the
  suggestion
   dictionary is built I have concerns. If I have more than one language
 in
  a
   single core (and I keep the same field name for suggestions on all
   languages within a core) then it seems that I could get suggestions
 from
   another language returned with a suggest query. Is there a way to
 build a
   separate dictionary

Re: analyzer with multiple stem-filters for more languages

2014-03-14 Thread Trey Grainger
I wouldn't recommend putting multiple stemmers in the same Analyzer. Like
Jack said, the second stemmer could take the results of the first stemmer
and stem the stem, wreaking all kinds of havoc on the resulting terms.

Since the Stemmers replace the original word, running two of them in
sequence will mean the second stemmer never sees the original input in
cases where the first stemmer modified it. Also, many languages require
multiple different CharFilters and TokenFilters (some for accent
normalization, some for stopwords and/or synonyms, some for stemming,
etc.), so it will get VERY complicated trying to safely coordinate when
each token filter runs... probably impossible for many language
combinations.

What you CAN do, however, is define multiple language-specific Analyzers
and then invoke both Analyzers separately within your field, stacking the
resulting tokens from each Analyzer's outputted token stream according to
their position increments. Think of it as having sub-fields within a field,
where each sub-field has it's own dedicated Analyzer.

Shameless plug: We cover how to do this (and provide the sample code) in
the Multilingual Search chapter of *Solr in Action
http://solrinaction.com*, the new book from Manning Publications that is
being to be released within the next few days. The source code is all
publicly available, though, if want to get an idea of how this works:
https://github.com/treygrainger/solr-in-action/tree/master/src/main/java/sia/ch14

Of course, if you want to take a simpler route, you can always just copy
your text to two separate fields (one per language) and then search across
them at query time using the eDisMax query parser. There are pros and cons
to both approaches.

All the best,

-Trey Grainger




On Fri, Mar 14, 2014 at 8:00 PM, Jack Krupansky j...@basetechnology.comwrote:

 You would have to carefully analyze the source code and tables of these
 two stemmers to determine if one might incorrectly stem words in the other
 language. Technically, that could be fine for indexing, but it might give
 users some unexpected results for queries. There might also be cases where
 the second stemmer would stem a term that was already stemmed by the first
 stemmer.

 You could avoid the latter issue by using the duplicate token technique.
 For a single stemmer this is generally:

 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordRepeatFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/

 For two (or more) languages:

 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordRepeatFilterFactory/
 filter class=solr.PorterStemFilterFactory/

 filter class=solr.SnowballPorterFilterFactory language=German2 /
 filter class=solr.RemoveDuplicatesTokenFilterFactory/

 This would produce the stemmed term for both languages, or either
 language, or neither, as the case may be.

 -- Jack Krupansky

 -Original Message- From: Croci Francesco Luigi (ID SWS)
 Sent: Friday, March 14, 2014 8:17 AM
 To: solr-user@lucene.apache.org
 Subject: analyzer with multiple stem-filters for more languages


 It is possible to define an analyzer with more than one Stem-filter for
 more languages?

 Something like this:

 analyzer type=index
...
 filter class=solr.PorterStemFilterFactory/  (default for english)
 filter class=solr.SnowballPorterFilterFactory language=German2 /
 /analyzer

 Greetings
 Francesco



Re: Facet pivot and distributed search

2014-02-07 Thread Trey Grainger
FYI, the last distributed pivot facet patch functionally works, but there
are some sub-optimal data structures being used and some unnecessary
duplicate processing of values. As a result, we found that for certain
worst-case scenarios (i.e. data is not randomly distributed across Solr
cores and requires significant refinement) pivot facets with multiple
levels could take over a minute to aggregate and process results. This was
using a dataset of several hundred million documents and dozens of pivot
facets across 120 Solr cores distributed over 20 servers, so it is a more
extreme use-case than most will encounter.

Nevertheless, we've refactored the code and data structures and brought the
processing time from over a minute down to less than a second using the
above configuration. We plan to post the patch within the next week.


On Fri, Feb 7, 2014 at 3:08 AM, Geert Van Huychem ge...@iframeworx.bewrote:

 Thx!

 Geert Van Huychem
 IT Consultant
 iFrameWorx BVBA

 Mobile: +32 497 27 69 03
 E-mail: ge...@iframeworx.be
 Site: http://www.iframeworx.be
 LinkedIn: http://www.linkedin.com/in/geertvanhuychem


 On Fri, Feb 7, 2014 at 8:55 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

  Yes this is a open issue.
 
  https://issues.apache.org/jira/browse/SOLR-2894
 
  On Fri, Feb 7, 2014 at 1:13 PM, Geert Van Huychem ge...@iframeworx.be
  wrote:
   Hi
  
   I'm using Solr 4.5 in a multi-core environment.
  
   I've setup
   - one core per documenttype: text, rss, tweet and external documents.
   - one distrib core which basically distributes the query to the 4 cores
   mentioned hereabove.
  
   Facet pivot works on each core individually, but when I send the exact
  same
   query to the distrib core, I get no results.
  
   Anyone? Bug? Open issue?
  
   Best
  
   Geert Van Huychem
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 



Re: Single multilingual field analyzed based on other field values

2013-12-19 Thread Trey Grainger
Hi Dave,

Sorry for the delayed reply.  Did you end up trying the (scary) caching
idea?

Yeah, there's no reasonable way today to access data from other fields from
the document in the analyzers.  Creating an update request processor which
pulls the data prior to the field-by-field analysis and injects it (in some
format) into the field that needs the data pulled from other fields is how
to do this today.

In my examples, I only inserted a prefix prior to the entire field (i.e.
en,es|hables espanol is what she asks), but if you need something more
complicated to identify specific sections of the field to use different
analyzers then you could pull that off, as well.  For example:
field name=multilingual_field[langs=en]hello world
[langs=en,es]hables espanol is what she asks.[
autodetectOtherLangs=true fallbackLangs=en]some unknown language text
for identification/field

Then, you would just have the analyzer for the field parse the content,
pass each chunk of text into the appropriate analyzer, and then modify the
term positions and offsets as necessary.  My example in chapter 14 of Solr
in Action assumed you would be using the same languages throughout the
whole field, but it would just require a little bit of pre-parsing work to
direct the use of specific analyers only for specific parts of the content.

Frankly, I'm not sure pulling the data from another field (particularly if
you want different sections processed with different languages) is going to
be much simpler than putting it all into the field to be analyzed to begin
with (or better yet having an update request processor do it for you -
including the detection of language boundaries - inside of Solr so the
customer doesn't have to worry about it).

-Trey


On Tue, Oct 29, 2013 at 12:18 PM, davetroiano dtroi...@basistech.comwrote:

 Hi Trey,

 I was reading v9 of the Solr in Action MEAP but browsing your github repo,
 so I think I'm looking at the latest stuff.

 Agreed that the thread caching idea is dangerous.  Perhaps it would work
 now, but it could easily break in a later version of Solr.

 I didn't mention another reason why I'd like to analyze based on other
 field
 values, which is that I'd like the ability to run analyzers on sub-sections
 of the MultiTextField.  e.g., given a multilingual document, run my
 text_english analyzer on the first half of a document and my text_french
 analyzer on the second half.  Of course, I could extend the prepend
 approach
 to take start and end offsets (e.g., field
 name=myField[en_0_1000,fr_1001_2500|]blah, blah, .../field), but if it
 were possible I'd rather grab that data from another field and simplify the
 tokenizer (in terms of the string manipulation and having to adjust
 position
 offsets to ignore the prepended data... though you've already done the
 tricky part).

 Based on what I'm seeing on the message boards and JIRA (e.g., SOLR-1536 /
 SOLR-1327 not being fixed), it seems like there isn't a clean way to run
 analyzers dynamically based on data in other field(s).  If I end up trying
 the caching idea, I'll report my findings here.

 Thanks,
 Dave



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Single-multilingual-field-analyzed-based-on-other-field-values-tp4098141p4098242.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

2013-12-12 Thread Trey Grainger
Hmm... haven't run into the case where null was returned in a multi-valued
scenario yet... I probably just haven't tested that case.  I likely need to
add a null check there - thanks for pointing it out.

-Trey


On Fri, Nov 29, 2013 at 6:10 AM, Müller, Stephan 
muel...@ponton-consulting.de wrote:

 Hello Trey, thank you for this example.

 We've solved it by omitting the multivalued field and passing the distinct
 string fields instead, still I go with proposing a patch, so the language
 processor is able to concatenate multivalues by default. I think it's a
 reasonable feature (and can't remember to have ever contributed a patch to
 an open source project)
 My thoughts on the patch implementation are quite the same as Yours,
 iterating on getValues(). I'll have this discussed in the dev-list and
 probably in JIRA.


 One thing: How do you guard against a possible NPE in line 129
  (final Object inputValue : inputField.getValues()) {

 SolrInputField.getValues() will return NULL if the associated value was
 null. It does not create an empty Collection.
 That, btw, seems to be a minor bug in the javadoc, not stating that this
 method returns null.


 Regards,
 Stephan - srm

 [...]

  The langsToPrepend variable above will contain a set of languages,
 where
  detectLanguage was called separately for each value in the multivalued
  field.  If you just want to concatenate all the values and detect
  languages once (as opposed to only using the first value in the
  multivalued field, like it does today), just concatenate each of the
 input
  values in the first loop and call detectLanguage once at the end.
 
  I wrote code that does this for an example in the Solr in Action book.
   The particular example was detecting languages for each value in a
  multivalued field and then pre-pending the language to the text for the
  multivalued field (so the analyzer would know which stemmer to use, as
  they were being dynamically substituted in based upon the language).  The
  code is available here if you are interested:
  https://github.com/treygrainger/solr-in-
 
 action/blob/master/src/main/java/sia/ch14/MultiTextFieldLanguageIdentifier
  UpdateProcessor.java
 
  Good luck!
 
  -Trey
 
 
 
 
  On Wed, Nov 27, 2013 at 10:16 AM, Müller, Stephan  Mueller@ponton-
  consulting.de wrote:
 
I suspect that it is an oversight for a use case that was not
  considered.
I mean, it should probably either ignore or convert non text/string
values.
   Ok, I'll see that I provide a patch against trunk. It actually ignores
   non string values, but is unable to check the remaining values of a
   multivalued field.
  
Hmmm... are you using JSON input? I mean, how are the types being
 set?
Solr XML doesn't have a way to set the value types.
   
   No. It's a field with multivalued=true. That results in a
   SolrInputField where value (which is defined to be Object) actually
  holds a List.
   This list is populated with Integer, String, Date, you name it.
   I'm talking about the actual Java-Datatypes. The values in the list
   are probably set by this 3rdparty Textbodyprocessor thingy.
  
   Now the Language processor just asks for field.getValue().
   This is delegated to the SolrInputField which in turn calls
   firstValue() Interestingly enough, already is able to handle a
  Collection as its value.
   But if the value is a collection, it just returns the first element.
  
You could workaround it with an update processor that copied the
field
   and
massaged the multiple values into what you really want the language
detection to see. You could even implement that processor as a
JavaScript script with the stateless script update processor.
   
   Our workaround would be to not feed the multivalued field but only the
   String fields (which are also included in the multivalued field)
  
  
   Filing a Bug/Feature request and providing the patch will take some
   time as I haven't setup a fully working trunk in my IDEA installation.
   But I'm eager to do it :)
  
   Regards,
   Stephan
  
  
-- Jack Krupansky
   
-Original Message-
From: Müller, Stephan
Sent: Wednesday, November 27, 2013 5:02 AM
To: solr-user@lucene.apache.org
Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on
multivalued fields
   
Hello,
   
this is a repost. This message was originally posted on the 'general'
   list
but it was suggested, that the 'user' list might be a better place
to
   ask.
   
 Original Message 
Hi,
   
we are passing a multivalued field to the
LanguageIdentifierUpdateProcessor.
This multivalued field contains arbitrary types (Integer, String,
  Date).
   
Now, the
LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument
doc, String[] fields), which btw does not use the parameter fields,
is unable to parse all fields of the/a multivalued field. The call
Object content = 

Re: Function query matching

2013-12-02 Thread Trey Grainger
We're working on the same problem with the combination of the
scale(query(...)) combination, so I'd like to share a bit more information
that may be useful.

*On the scale function:*
Even thought the scale query has to calculate the scores for all documents,
it is actually doing this work twice for each ValueSource (once to
calculate the min and max values, and then again when actually scoring the
documents), which is inefficient.

To solve the problem, we're in the process of putting a cache inside the
scale function to remember the values for each document when they are
initially computed (to find the min and max) so that the second pass can
just use the previously computed values for each document.  Our theory is
that most of the extra time due to the scale function is really just the
result of doing duplicate work.

No promises this won't be overly costly in terms of memory utilization, but
we'll see what we get in terms of speed improvements and will share the
code if it works out well.  Alternate implementation suggestions (or
criticism of a cache like this) are also welcomed.


*On the NoOp product function: scale(prod(1, query(...))):*
We do the same thing, which ultimately is just an unnecessary waste of a
loop through all documents to do an extra multiplication step.  I just
debugged the code and uncovered the problem.  There is a Map (called
context) that is passed through to each value source to store intermediate
state, and both the query and scale functions are passing the ValueSource
for the query function in as the KEY to this Map (as opposed to using some
composite key that makes sense in the current context).  Essentially, these
lines are overwriting each other:

Inside ScaleFloatFunction: context.put(this.source, scaleInfo);
 //this.source refers to the QueryValueSource, and the scaleInfo refers to
a ScaleInfo object
Inside QueryValueSource: context.put(this, w); //this refers to the same
QueryValueSource from above, and the w refers to a Weight object

As such, when the ScaleFloatFunction later goes to read the ScaleInfo from
the context Map, it unexpectedly pulls the Weight object out instead and
thus the invalid case exception occurs.  The NoOp multiplication works
because it puts an different ValueSource between the query and the
ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this
(in QueryValueSource).

This should be an easy fix.  I'll create a JIRA ticket to use better key
names in these functions and push up a patch.  This will eliminate the need
for the extra NoOp function.

-Trey


On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan peterlkee...@gmail.comwrote:

 I'm persuing this possible PostFilter solution, I can see how to collect
 all the hits and recompute the scores in a PostFilter, after all the hits
 have been collected (for scaling). Now, I can't see how to get the custom
 doc/score values back into the main query's HitQueue. Any advice?

 Thanks,
 Peter


 On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan peterlkee...@gmail.com
 wrote:

  Instead of using a function query, could I use the edismax query (plus
  some low cost filters not shown in the example) and implement the
  scale/sum/product computation in a PostFilter? Is the query's maxScore
  available there?
 
  Thanks,
  Peter
 
 
  On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
 
  Although the 'scale' is a big part of it, here's a closer breakdown.
 Here
  are 4 queries with increasing functions, and theei response times
 (caching
  turned off in solrconfig):
 
  100 msec:
  select?q={!edismax v='news' qf='title^2 body'}
 
  135 msec:
  select?qq={!edismax v='news' qf='title^2
  body'}q={!func}product(field(myfield),query($qq)fq={!query v=$qq}
 
  200 msec:
  select?qq={!edismax v='news' qf='title^2
 
 body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfieldfq={!query
  v=$qq}
 
  320 msec:
   select?qq={!edismax v='news' qf='title^2
 
 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!query
  v=$qq}
 
  Btw, that no-op product is necessary, else you get this exception:
 
  org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to
 org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo
 
  thanks,
 
  peter
 
 
 
  On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter 
  hossman_luc...@fucit.org wrote:
 
 
  : So, this query does just what I want, but it's typically 3 times
 slower
  : than the edismax query  without the functions:
 
  that's because the scale() function is inhernetly slow (it has to
  compute the min  max value for every document in order to know how to
  scale them)
 
  what you are seeing is the price you have to pay to get that query
 with a
  normalized 0-1 value.
 
  (you might be able to save a little bit of time by eliminating that
  no-Op multiply by 1: product(query($qq),1) ... but i doubt you'll
 even
  notice much of a chnage 

Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

2013-11-28 Thread Trey Grainger
Yeah, the documentation is definitely wrong - it definitely doesn't
concatenate the values in a multivalued field, it only uses the first one
like you mentioned.

If you want to detect the language of each of the values in the
multi-valued field (as opposed to specifying multiple separate string
values), however, this is easy enough to accomplish by modifying the code
in the language detect update processor to loop through each of the values:

LinkedHashSetString langsToPrepend = new LinkedHashSetString();
for (final Object inputValue : inputField.getValues()) {
 Object outputValue = inputValue;
 ListDetectedLanguage fieldValueLangs = null;
  if (inputValue instanceof String){
   fieldValueLangs = this.detectLanguage(inputValue.toString());
  }

 for (DetectedLanguage lang : fieldValueLangs){

 langsToPrepend.add(lang.getLangCode());
}
}

The langsToPrepend variable above will contain a set of languages,
where detectLanguage was called separately for each value in the
multivalued field.  If you just want to concatenate all the values and
detect languages once (as opposed to only using the first value in the
multivalued field, like it does today), just concatenate each of the
input values in the first loop and call detectLanguage once at the
end.

I wrote code that does this for an example in the Solr in Action book.
 The particular example was detecting languages for each value in a
multivalued field and then pre-pending the language to the text for
the multivalued field (so the analyzer would know which stemmer to
use, as they were being dynamically substituted in based upon the
language).  The code is available here if you are interested:
https://github.com/treygrainger/solr-in-action/blob/master/src/main/java/sia/ch14/MultiTextFieldLanguageIdentifierUpdateProcessor.java

Good luck!

-Trey




On Wed, Nov 27, 2013 at 10:16 AM, Müller, Stephan 
muel...@ponton-consulting.de wrote:

  I suspect that it is an oversight for a use case that was not considered.
  I mean, it should probably either ignore or convert non text/string
  values.
 Ok, I'll see that I provide a patch against trunk. It actually
 ignores non string values, but is unable to check the remaining values
 of a multivalued field.

  Hmmm... are you using JSON input? I mean, how are the types being set?
  Solr XML doesn't have a way to set the value types.
 
 No. It's a field with multivalued=true. That results in a SolrInputField
 where value (which is defined to be Object) actually holds a List.
 This list is populated with Integer, String, Date, you name it.
 I'm talking about the actual Java-Datatypes. The values in the list are
 probably set by this 3rdparty Textbodyprocessor thingy.

 Now the Language processor just asks for field.getValue().
 This is delegated to the SolrInputField which in turn calls firstValue()
 Interestingly enough, already is able to handle a Collection as its value.
 But if the value is a collection, it just returns the first element.

  You could workaround it with an update processor that copied the field
 and
  massaged the multiple values into what you really want the language
  detection to see. You could even implement that processor as a JavaScript
  script with the stateless script update processor.
 
 Our workaround would be to not feed the multivalued field but only the
 String fields (which are also included in the multivalued field)


 Filing a Bug/Feature request and providing the patch will take some time
 as I haven't setup a fully working trunk in my IDEA installation.
 But I'm eager to do it :)

 Regards,
 Stephan


  -- Jack Krupansky
 
  -Original Message-
  From: Müller, Stephan
  Sent: Wednesday, November 27, 2013 5:02 AM
  To: solr-user@lucene.apache.org
  Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on
  multivalued fields
 
  Hello,
 
  this is a repost. This message was originally posted on the 'general'
 list
  but it was suggested, that the 'user' list might be a better place to
 ask.
 
   Original Message 
  Hi,
 
  we are passing a multivalued field to the
  LanguageIdentifierUpdateProcessor.
  This multivalued field contains arbitrary types (Integer, String, Date).
 
  Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument
  doc, String[] fields), which btw does not use the parameter fields, is
  unable to parse all fields of the/a multivalued field. The call Object
  content = doc.getFieldValue(fieldName); does not care what type the
 field
  is and just delegates to SolrInputDocument which in turn calls
  getFirstValue.
 
  So, two issues:
  First - if the first value of the multivalued field is not of type
 String,
  the field is ignored completely.
 
  Second - the concat method does not concat all values of a multivalued
  field.
 
  While http://www.mail-archive.com/solr-
  u...@lucene.apache.org/msg90530.html
  states: The feature is designed to detect exactly one 

Re: Single multilingual field analyzed based on other field values

2013-10-28 Thread Trey Grainger
Hi David,

What version of the Solr in Action MEAP are you looking at (current version
is 12, and version 13 is coming out later this week, and prior versions had
significant bugs in the code you are referencing)?  I added an update
processor in the most recent version that can do language identification
and prepend the language codes for you (even removing them from the stored
version of the field and only including them on the indexed version for
text analysis).

You could easily modify this update processor to read the value from the
language field and use it as the basis of the pre-pended languages.

Otherwise, if you want to do language detection instead of passing in the
language manually, MultiTextField in chapter 14 of Solr in Action and the
corresponding MultiTextFieldLanguageIdentifierUpdateProcessor should handle
all of the language detection and pre-pending automatically for you (and
also append the identified language to a separate field).

If it were easy/possible to have access to the rest of the fields in the
document from within a field's Analyzer then I would have certainly opted
for that approach instead of the whole pre-pending languages to content
option.  If it is too cumbersome, you could probably rewrite the
MultiTextField to pull the language from the field name instead of the
content  (i.e.  field name=myField|en,frblah, blah/field instead of
field name=myFielden,fr|blah, blah/field as currently designed).
 This would make specifying the language much easier (especially at query
time since you only have to specify the languages once instead of on each
term), and you could have Solr still search the same underlying field for
all languages.  Same general idea, though.

In terms of your ThreadLocal cache idea... that sounds really scary to me.
 The Analyzers' TokenStreamComponents are cached in a ThreadLocal context
depending upon to the internal ReusePolicy, and I'm skeptical that you'll
be able to pull this off cleanly.  It would really be hacking around the
Lucene API's even if you were able to pull it off.

-Trey


On Mon, Oct 28, 2013 at 5:15 PM, Jack Krupansky j...@basetechnology.comwrote:

 Consider an update processor - it can operate on any field and has access
 to all fields.

 You could have one update processor to combine all the fields to process,
 into a temporary, dummy field. Then run a language detection update
 processor on the combined field. Then process the results and place in the
 desired field. And finally remove any temporary fields.

 -- Jack Krupansky
 -Original Message- From: David Anthony Troiano
 Sent: Monday, October 28, 2013 4:47 PM
 To: solr-user@lucene.apache.org
 Subject: Single multilingual field analyzed based on other field values


 Hello,

 First some background...

 I am indexing a multilingual document set where documents themselves can
 contain multiple languages.  The language(s) within my documents are known
 ahead of time.  I have tried separate fields per language, and due to the
 poor query performance I'm seeing with that approach (many languages /
 fields), I'm trying to create a single multilingual field.

 One approach to this problem is given in Section
 14.6.4https://docs.google.**com/a/basistech.com/file/d/**
 0B3NlE_uL0pqwR0hGV0M1QXBmZm8/**edithttps://docs.google.com/a/basistech.com/file/d/0B3NlE_uL0pqwR0hGV0M1QXBmZm8/edit
 of
 the new Solr In Action book.  The approach is to take the document
 content field and prepend it with the list contained languages followed by
 a special delimiter.  A new field type is defined that maps languages to
 sub field types, and the new type's tokenizer then runs all of the sub
 field type analyzers over the field and merges results, adjusts offsets for
 the prepended data, etc.

 Due to the tokenizer complexity incurred, I'd like to pursue a more
 flexible approach, which is to run the various language-specific analyzers
 not based on prepended codes, but instead based on other field values
 (i.e., a language field).

 I don't see a straightforward way to do this, mostly because a field
 analyzer doesn't have access to the rest of the document.  On the flip
 side, an UpdateRequestProcessor would have access to the document but
 doesn't really give a path to wind up where I want to be (single field with
 different analyzers run dynamically).

 Finally, my question: is it possible to thread cache document language(s)
 during UpdateRequestProcessor execution (where we have access to the full
 document), so that the analyzer can then read from the cache to determine
 which analyzer(s) to run?  More specifically, if a document is run through
 it's URP chain on thread T, will its analyzer(s) also run on thread T and
 will no other documents be run through the URP on that thread in the
 interim?

 Thanks,
 Dave



Re: Getting a query parameter in a TokenFilter

2013-09-22 Thread Trey Grainger
Hi Isaac,

In the process of writing Solr in Action (http://solrinaction.com), I have
built the solution to SOLR-5053 for the multilingual search chapter (I
didn't realize this ticket existed at the time).  The solution was
something I called a MultiTextField.  Essentially, the field let's you
map a list of defined pre-fixes to field types and dynamically substitute
in one or more field types based upon the incoming content.

For example:

#schema.xml#
 fieldType name=multiText
class=sia.ch14.MultiTextField sortMissingLast=true
defaultFieldType=text_general
fieldMappings=en:text_english,
   es:text_spanish,
   fr:text_french/

fieldType name=text_english ... /
fieldType name=text_spanish ... /
fieldType name=text_french ... /

field name=content type=multiText indexed=true ... /
#document#
adddoc
  field name=id1/field
  field name=contenten,es|the schools, la escuala/field
/doc/add

#Outputted Token Stream#:
[Position 1]   [Position 2]   [Position 3] [Position 4]
 the   school   la
escuela
 schools
escuel

#query on two languages#
q=en,es|la OR en,es|escuela

 Essentially, this MultiText field type lets you dynamically combine one or
more Analyzers (from a defined field type) and stack the tokens based upon
term positions within each independent Analyzer.  The use case here was
multiple

To answer your original question... at query time, this implementation
requires that you pass the prefix before EACH term in the query, not just
the first term (you can see this in the q= I demonstrated above).  If you
have a Token Filter you have developed, you could probably accomplish
what you are trying to do the same way.

You could write a custom QParserPlugin that would do this for you I think.
 Alternatively, it may be possible to create a similar implementation that
makes use of a dynamic field name (i.e.  content|en,fr as the field
name), which would pull the prefix from the field name and apply it to all
tokens instead of requiring/allowing each token to specify it's own prefix.
 I haven't done this in my implementation, but I could see where it might
be more user-friendly for many Solr users.

I'm just finishing up the multilingual search chapter and code now and
will be happy to post it to SOLR-5053 once I finish in the next few days if
this would be helpful to you.

-Trey


On Sat, Sep 21, 2013 at 4:15 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Thought about that again,
 We can do this work as a search component, manipulating the query string.
 The cons are the double QParser work, and the double tokenization work.

 Another approach which might solve this issue easily is Dynamic query
 analyze chain: https://issues.apache.org/jira/browse/SOLR-5053

 What would you do?


 On Tue, Sep 17, 2013 at 10:31 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Hi everyone,
 
  We developed a TokenFilter.
  It should act differently, depends on a parameter supplied in the
  query (for query chain only, not the index one, of course).
  We found no way to pass that parameter into the TokenFilter flow. I guess
  that the root cause is because TokenFilter is a pure lucene object.
 
  As a last resort, we tried to pass the parameter as the first term in the
  query text (q=...), and save it as a member of the TokenFilter instance.
 
  Although it is ugly, it might work fine.
  But, the problem is that it is not guaranteed that all the terms of a
  particular query will be analyzed by the same instance of a TokenFilter.
 In
  this case, some terms will be analyzed without the required information
 of
  that parameter. We can produce such a race very easily.
 
  How should I overcome this issue?
  Do anyone have a better resolution?
 



Re: Need help understanding the use cases behind core auto-discovery

2013-09-21 Thread Trey Grainger
While on this topic...

Is it still true in Solr 4.5 (RC) that it is not possible to have a shared
config directory?  In general, I like the new core.properties mechanism
better as it removes the unnecessary centralized configuration of cores in
solr.xml, but I have an infrastructure where I have thousands of Solr Cores
with the same configs on a single server, and as last I could tell with
Solr 4.4 the only way to support this in core.properties was to copy and
paste or create symbolic links for the whole conf/ folder for every core
(i.e. thousands of identical copies of all config files in my case).

In the old solr.xml format, we could set the instanceDir to have all cores
reference the same folder, but in core.properties there doesn't seem to be
anything like this.  I tried just referencing solrconfig.xml in another
directory, but because everything is now relative to the conf/ directory
under the folder containing core.properties, none of the referenced files
were in the right place.

Is there any better guidance on migrating to core autodiscovery with the
need for a shared config directory (non-SolrCloud mode)?  This looked
promising, but it sounds dead from Erick's JIRA comment:
https://issues.apache.org/jira/browse/SOLR-4478

Thanks,

-Trey


On Sat, Sep 21, 2013 at 2:25 PM, Erick Erickson erickerick...@gmail.comwrote:

 Also consider where SolrCloud is going. Trying to correctly maintain
 all the solr.xml files yourself on all the nodes would have
 been...interesting. On all the machines in your 200 node cluster.
 With 17 different collections. With nodes coming and going. With
 splitting shards. With.

 Collections are almost guaranteed to be distributed unevenly (e.g. a
 big collection might have 20 shards and a small collection 3 in the
 same cluster). So each node used to require solr.xml to be unique as
 far as everything in the cores tag. But everything  _not_ in the
 cores tags is common. Say you wanted to change the
 shardHandlerFactory (or any other setting we put in solr.xml that
 wouldn't have gone into the old cores tag). In the old-style way of
 doing things, since each solr.xml file on each node has potentially a
 different set of cores, you'd have to edit each and every one of them.

 The older way of doing this is fine as long as each solr.xml on each
 machine is self-consistent. So auto-discovery essentially automates
 that self-consistency.

 It also makes it possible to have Zookeeper manage your solr.xml and
 auto-distribute it to new nodes (or update existing) which would have
 taken a lot of effort to get right without auto-discovery. So changing
 the shardHandlerFactory consists of changing the solr.xml file and
 pushing it to ZooKeeper (don't quite remember the right JIRA, but you
 can do this now).

 I suppose it's like all other refactorings. Solr.xml had it's origin
 in the single-core days, then when multi-cores came into being it was
 expanded to include that information, but eventually became, as Yonik
 says, unnecessary central configuration which started becoming a
 limitation.

 FWIW,
 Erick

 On Fri, Sep 20, 2013 at 9:45 AM, Timothy Potter thelabd...@gmail.com
 wrote:
  Exactly the insight I was looking for! Thanks Yonik ;-)
 
 
  On Fri, Sep 20, 2013 at 10:37 AM, Yonik Seeley yo...@lucidworks.com
 wrote:
 
  On Fri, Sep 20, 2013 at 11:56 AM, Timothy Potter thelabd...@gmail.com
  wrote:
   Trying to add some information about core.properties and
 auto-discovery
  in
   Solr in Action and am at a loss for what to tell the reader is the
  purpose
   of this feature.
 
  IMO, it was more a removal of unnecessary central configuration.
  You previously had to list the core in solr.xml, and now you don't.
  Cores should be fully self-describing so that it should be easy to
  move them in the future just by moving the core directory (although
  that may not yet work...)
 
  -Yonik
  http://lucidworks.com
 
   Can anyone point me to any background information about core
   auto-discovery? I'm not interested in the technical implementation
  details.
   Mainly I'm trying to understand the motivation behind having this
 feature
   as it seems unnecessary with the Core Admin API. Best I can tell is it
   removes a manual step of firing off a call to the Core Admin API or
  loading
   a core from the Admin UI. If that's it and I'm overthinking it, then
 cool
   but was expecting more of an ah-ha moment with this feature ;-)
  
   Any insights you can share are appreciated.
  
   Thanks.
   Tim
 



Re: [ANNOUNCE] Solr wiki editing change

2013-03-30 Thread Trey Grainger
Please add TreyGrainger to the the contributors group.  Thanks!

-Trey


On Sun, Mar 24, 2013 at 11:18 PM, Steve Rowe sar...@gmail.com wrote:

 The wiki at http://wiki.apache.org/solr/ has come under attack by
 spammers more frequently of late, so the PMC has decided to lock it down in
 an attempt to reduce the work involved in tracking and removing spam.

 From now on, only people who appear on
 http://wiki.apache.org/solr/ContributorsGroup will be able to
 create/modify/delete wiki pages.

 Please request either on the solr-user@lucene.apache.org or on
 d...@lucene.apache.org to have your wiki username added to the
 ContributorsGroup page - this is a one-time step.

 Steve
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Can I invert the inverted index?

2011-07-05 Thread Trey Grainger
Gabriele,

I created a patch that does this about a year ago.  See
https://issues.apache.org/jira/browse/SOLR-1837.  It was written for Solr
1.4 and is based upon the Document Reconstructor in Luke.  The patch adds a
link to the main solr admin page to a docinspector page which will
reconstruct the document given a uniqueid (required).  Keep in mind that
you're only looking at what's in the index for non-stored fields, not the
original text.

If you have any issues using this on the most recent release, let me know
and I'd be happy to create a new patch for solr 3.3.  One of these days I'll
remove the JSP dependency and this may eventually making it into trunk.

Thanks,

-Trey Grainger
Search Technology Development Team Lead, Careerbuilder.com
Site Architect, Celiaccess.com


On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout
gabri...@mysimpatico.comwrote:

 Hello,

 With an inverted index the term is the key, and the documents are the
 values. Is it still however possible that given a document id I get the
 terms indexed for that document?

 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).



Re: Indexes in ramdisk don't show performance improvement?

2011-06-02 Thread Trey Grainger
Linux will cache the open index files in RAM (in the filesystem cache)
after their first read which makes the ram disk generally useless.
Unless you're processing other files on the box with a size greater
than your total unused ram (and thus need to micro-manage what stays
in RAM), then I wouldn't recommend using a ramdisk - it's just more to
manage.  If you reboot the box and run a few searches, those first few
will likely be slower until all the index files are cached in Memory.
After that point, the performance should be comparable because all
files are read out of RAM from that point forward.

If solr caches are enabled and your queries are repetitive then that
could also be contributing to the speed of repetitive queries.  Note
that the above advice assumes your total unused ram (not allocated to
the JVM or any other processes) is greater than the size of your
lucene index files, which should be a safe assumption considering
you're trying to put the whole index in a ramdisk.

-Trey


On Thu, Jun 2, 2011 at 7:15 PM, Erick Erickson erickerick...@gmail.com wrote:
 What I expect is happening is that the Solr caches are effectively making the
 two tests identical, using memory to hold the vital parts of the code in both
 cases (after disk warming on the instance using the local disk). I suspect if
 you measured the first few queries (assuming no auto-warming) you'd see the
 local disk version be slower.

 Were you running these tests for curiosity or is running from /dev/shm 
 something
 you're considering for production?

 Best
 Erick

 On Thu, Jun 2, 2011 at 5:47 PM, Parker Johnson parker_john...@gap.com wrote:

 Hey everyone.

 Been doing some load testing over the past few days. I've been throwing a
 good bit of load at an instance of solr and have been measuring response
 time.  We're running a variety of different keyword searches to keep
 solr's cache on its toes.

 I'm running two exact same load testing scenarios: one with indexes
 residing in /dev/shm and another from local disk.  The indexes are about
 4.5GB in size.

 On both tests the response times are the same.  I wasn't expecting that.
 I do see the java heap size grow when indexes are served from disk (which
 is expected).  When the indexes are served out of /dev/shm, the java heap
 stays small.

 So in general is this consistent behavior?  I don't really see the
 advantage of serving indexes from /dev/shm.  When the indexes are being
 served out of ramdisk, is the linux kernel or the memory mapper doing
 something tricky behind the scenes to use ramdisk in lieu of the java heap?

 For what it is worth, we are running x_64 rh5.4 on a 12 core 2.27Ghz Xeon
 system with 48GB ram.

 Thoughts?

 -Park






Re: Apache Spam Filter Blocking Messages

2011-04-21 Thread Trey Grainger
Good to know; I'll go change those settings, then.  Thanks for the feedback.

-Trey


On Thu, Apr 21, 2011 at 4:42 AM, Em mailformailingli...@yahoo.de wrote:

 This really helps at the mailinglists.
 If you send your mails with Thunderbird, be sure to check that you enforce
 plain-text-emails. If not, it will often send HTML-mails.

 Regards,
 Em


 Marvin Humphrey wrote:
 
  On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote:
  (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
                              
  Note the HTML_MESSAGE in the list of things SpamAssassin didn't like.
 
  Apparently I sound like spam when I write perfectly good English and
  include
  some xml and a link to a jira ticket in my e-mail (I tried a couple
  different variations).  Anyone know a way around this filter, or should I
  just respond to those involved in the e-mail chain directly and avoid the
  mailing list?
 
  Send plain text email instead of HTML.  That solves the problem 99% of the
  time.
 
  Marvin Humphrey
 


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Apache-Spam-Filter-Blocking-Messages-tp2845854p2846304.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: old searchers not closing after optimize or replication

2011-04-21 Thread Trey Grainger
Hey Bernd,

Checkout https://issues.apache.org/jira/browse/SOLR-2469.  There is a
pretty bad bug in Solr 3.1 which occurs if you have  str
name=replicateAfterstartup/str set in your replication
configuration in solrconfig.xml.  See the thread between Yonik and
myself from a few days ago titled Solr 3.1: Old Index Files Not
Removed on Optimize.

You can disable startup replication and perform an optimize to see if
this fixes your problem of old index files being left behind (though
you may have some old index files left behind from before this change
that you still need to clean-up).  Yonik has already pushed up a patch
into the 3x branch and trunk for this issue.  I can confirm that
applying the patch (or just removing startup replication) resolved the
issue for us.

Do you think this is your issue?

Thanks,

-Trey



On Thu, Apr 21, 2011 at 2:27 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Hi Erik,

 deletionPolicy class=solr.SolrDeletionPolicy
 str name=maxCommitsToKeep1/str
 str name=maxOptimizedCommitsToKeep0/str
 /deletionPolicy

 Due to 44 minutes optimization time we do an optimization once a day
 during the night.

 I will try with an smaler index on my development system.

 Best regards,
 Bernd


 Am 20.04.2011 17:50, schrieb Erick Erickson:

 It looks OK, but still doesn't explain keeping the old files around. What
 is
 yourdeletionPolicy  in your solrconfig.xml look like? It's
 possible that you're seeing Solr attempt to keep around several
 optimized copies of the index, but that still doesn't explain why
 restarting Solr removes them unless the deletionPolicy gets invoked
 on sometime and you're index files are aging out (I don't know the
 internals of deletion well enough to say).

 About optimization. It's become less important with recent code. Once
 upon a time, it made a substantial difference in search speed. More
 recently, it has very little impact on search speed, and is used
 much more sparingly. Its greatest benefit is reclaiming unused resources
 left over from deleted documents. So you might want to avoid the pain
 of optimizing (44 minutes!) and only optimize rarely of if you have
 deleted a lot of documents.

 It might be worthwhile to try (with a smaller index !) a bunch of optimize
 cycles and see if thedeletionPolicy  idea has any merit. I'd expect
 your index to reach a maximum and stay there after the saved
 copies of the index was reached...

 But otherwise I'm puzzled...

 Erick

 On Wed, Apr 20, 2011 at 10:30 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de  wrote:

 Hi Erik,

 Am 20.04.2011 15:42, schrieb Erick Erickson:

 H, this isn't right. You've pretty much eliminated the obvious
 things. What does lsof show? I'm assuming it shows the files are
 being held open by your Solr instance, but it's worth checking.

 Just commited new content 3 times and finally optimized.
 Again having old index files left.

 Then checked on my master, only the newest version of index files are
 listed with lsof. No file handles to the old index files but the
 old index files remain in data/index/.
 Thats strange.

 This time replication worked fine and cleaned up old index on slaves.


 I'm not getting the same behavior, admittedly on a Windows box.
 The only other thing I can think of is that you have a query that's
 somehow never ending, but that's grasping at straws.

 Do your log files show anything interesting?

 Lets see:
 - it has the old generation (generation=12) and its files
 - and recognizes that there have been several commits (generation=18)

 20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start

 commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=2


  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
 _3xm.fdx, segment
 s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]


  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
 _3xo.tis, _3xp.pr
 x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
 _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
 _3xn.fdt, _3x
 p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
 _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
 _3xo.fdt, _3xp.fr
 q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
 _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
 _3xs.tis, _3x
 m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm,
 _3xr.fdt]
 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits
 INFO: newest commit = 1302159868447


 - after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets
  the SolrDeletionPolicy onCommit and 

Apache Spam Filter Blocking Messages

2011-04-20 Thread Trey Grainger
Hey (solr-user) Mailing list admin's,

I've tried replying to a thread multiple times tonight, and keep getting a
bounce-back with this response:
Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient
domain. We recommend contacting the other email provider for further
information about the cause of this error. The error that the other server
returned was: 552 552 spam score (5.1) exceeded threshold
(FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
(state 18).

Apparently I sound like spam when I write perfectly good English and include
some xml and a link to a jira ticket in my e-mail (I tried a couple
different variations).  Anyone know a way around this filter, or should I
just respond to those involved in the e-mail chain directly and avoid the
mailing list?

Thanks,

-Trey


Solr 3.1: Old Index Files Not Removed on Optimize?

2011-04-15 Thread Trey Grainger
I was just hoping someone might be able to point me in the right direction
here.  We just upgraded from Solr 1.4 to Solr 3.1 this past week and we're
having issues running out of disk space on our Master servers.  Our Master
has dozens of cores.  We have a script that kicks off once per day to do a
rolling optimize.  The script optimizes a single core, waits 5 minutes to
give the server some breathing room to catch up on indexing in a non-i/o
intensive state, and then moves onto the next core (repeating until done).

The problem we are facing is that under Solr 1.4, the old index files were
deleted very quickly after each optimize, but under Solr 3.1, the old index
files hang around for hours... in many cases they don't disappear until we
restart Solr completely.  This is leading to us running out of disk space,
as each core's index doubles in size during the optimize process and stays
that way until the next solr restart.

I was just wondering if anyone could point me to some specific changes or
settings which may be leading to the difference between solr versions (or
any other environmental issues you may know about).  I see several tickets
in Jira about similar issues, but they mostly appear to have been resolved
in the past.

Has anyone else seen this behavior under Solr 3.1, or do you think we may be
missing some kind of new configuration setting?

For reference, we are running on 64bit RedHat Linux.  This is what I have
right now: [From SolrConfig.xml]:
reopenReaderstrue/reopenReaders

requestHandler name=/replication class=solr.ReplicationHandler
lst name=master
str name=replicateAftercommit/str
str name=replicateAfteroptimize/str
str name=replicateAfterstartup/str
/lst
/requestHandler

  updateHandler class=solr.DirectUpdateHandler2
autoCommit
  maxDocs10/maxDocs
  maxTime30/maxTime
/autoCommit
  /updateHandler

deletionPolicy class=solr.SolrDeletionPolicy
  str name=keepOptimizedOnlyfalse/str
  str name=maxCommitsToKeep1/str
/deletionPolicy


Thanks in advance,

-Trey


Re: Solr 3.1: Old Index Files Not Removed on Optimize?

2011-04-15 Thread Trey Grainger
Thank you, Yonik!

I see the Jira issue you created and am guessing it's due to this issue.
 We're going to remove replicateAfter=startup in the mean-time to see if
that helps (assuming this is the issue the jira ticket described).

I appreciate you taking a look at this.

Thanks

-Trey


On Fri, Apr 15, 2011 at 2:58 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 I can reproduce this with the example server w/ your deletionPolicy
 and replicationHandler configs.
 I'll dig further to see what's behind this behavior.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

 On Fri, Apr 15, 2011 at 1:14 PM, Trey Grainger solrt...@gmail.com wrote:
  I was just hoping someone might be able to point me in the right
 direction
  here.  We just upgraded from Solr 1.4 to Solr 3.1 this past week and
 we're
  having issues running out of disk space on our Master servers.  Our
 Master
  has dozens of cores.  We have a script that kicks off once per day to do
 a
  rolling optimize.  The script optimizes a single core, waits 5 minutes to
  give the server some breathing room to catch up on indexing in a non-i/o
  intensive state, and then moves onto the next core (repeating until
 done).
 
  The problem we are facing is that under Solr 1.4, the old index files
 were
  deleted very quickly after each optimize, but under Solr 3.1, the old
 index
  files hang around for hours... in many cases they don't disappear until
 we
  restart Solr completely.  This is leading to us running out of disk
 space,
  as each core's index doubles in size during the optimize process and
 stays
  that way until the next solr restart.
 
  I was just wondering if anyone could point me to some specific changes or
  settings which may be leading to the difference between solr versions (or
  any other environmental issues you may know about).  I see several
 tickets
  in Jira about similar issues, but they mostly appear to have been
 resolved
  in the past.
 
  Has anyone else seen this behavior under Solr 3.1, or do you think we may
 be
  missing some kind of new configuration setting?
 
  For reference, we are running on 64bit RedHat Linux.  This is what I have
  right now: [From SolrConfig.xml]:
  reopenReaderstrue/reopenReaders
 
  requestHandler name=/replication class=solr.ReplicationHandler
 lst name=master
 str name=replicateAftercommit/str
 str name=replicateAfteroptimize/str
 str name=replicateAfterstartup/str
 /lst
  /requestHandler
 
   updateHandler class=solr.DirectUpdateHandler2
 autoCommit
   maxDocs10/maxDocs
   maxTime30/maxTime
 /autoCommit
   /updateHandler
 
 deletionPolicy class=solr.SolrDeletionPolicy
   str name=keepOptimizedOnlyfalse/str
   str name=maxCommitsToKeep1/str
 /deletionPolicy
 
 
  Thanks in advance,
 
  -Trey
 



Re: Luke browser does not show non-String Solr fields?

2010-05-31 Thread Trey Grainger
I submitted a patch a few months back for a Solr Document Inspector which
allows one to see the indexed values for any document in a Solr index (
https://issues.apache.org/jira/browse/SOLR-1837). This is more or less a
port of Luke's DocumentReconstructor into Solr, but the tool additionally
has access to all the solr schema/field type information for display
purposes (i.e. Trie Fields are human-readable).

This won't help you search for values in an index or inspect anything at a
macro level (i.e. term counts across the index), but there are other tools
in Solr for that.  Given a UniqueID, however, you can view all the indexed
values for each field in that particular document.  You can always do a
search within Solr for the values you are looking for and then use this tool
to view the indexed values for any documents which match.

This may or may not help you (I'm can't tell what problem you are trying to
solve), but I thought it would be worth mentioning as one tool in your
toolbox.

-Trey







Re: resetting stats

2010-03-31 Thread Trey Grainger
: reloading the core just to reset the stats definitely seems like throwing
: out the baby with the bathwater.

Agreed about throwing out the baby with the bath water - if stats need to be
reset, though, then that's the only way today.  A reset stats button would
be a nice way to prevent having to do this.

: Huh? ... how would having an extra core (with no data) help you with
: getting aggregate stats from your request handlers?

Say I have 3 Cores names core0, core1, and core2, where only core1 and core2
have documents and caches.  If all my searches hit core0, and core0 shards
out to core1 and core2, then the stats from core0 would be accurate for
errors, timeouts, totalTime, avgTimePerRequest, avgRequestsPerSecond, etc.
Obviously this is based upon the following two assumptions: 1) The request
handlers you are using/monitoring are distributed aware, and 2) you are
using distributed search and all your queries are going to an aggregating
core.

I'm not suggesting that anyone needs a setup like this, just pointing out
that this type of setup somewhat avoids throwing the baby out with the bath
water by not putting a baby in the bath water that is going to be thrown out
(core0).


On Wed, Mar 31, 2010 at 6:40 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : You can reload the core on which you want to reset the stats - this lets
 you
 : keep the engine up and running without requiring you restart Solr.  If
 you

 reloading the core just to reset the stats definitely seems like throwing
 out the baby with the bathwater.

 : have an separate core for aggregating (i.e. a core that contains no data
 and
 : has no caches) then the overhead for reloading that core is negligable
 and
 : the time to reload is essentially zero.

 Huh? ... how would having an extra core (with no data) help you with
 getting aggregate stats from your request handlers?  If you want to know
 the avgTImePerRequest from handlerA, that numberisn't going to be useful
 if it comes from a core that isn't what your users are querying
 against

 :  : Is there a way to reset the stats counters? For example in the Query
 :  handler
 :  : avgTimePerRequest is not much use after a while as it is an avg since
 the
 :  : server started.


 -Hoss




Re: resetting stats

2010-03-30 Thread Trey Grainger
You can reload the core on which you want to reset the stats - this lets you
keep the engine up and running without requiring you restart Solr.  If you
have an separate core for aggregating (i.e. a core that contains no data and
has no caches) then the overhead for reloading that core is negligable and
the time to reload is essentially zero.

The primary disadvantage of the core reloading approach is that your warmed
caches are dropped (if you are using caches on that core), but as long as
you have good warmup queries you should be okay as long as the reload isn't
constant.

-Trey

On Tue, Mar 30, 2010 at 8:10 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Is there a way to reset the stats counters? For example in the Query
 handler
 : avgTimePerRequest is not much use after a while as it is an avg since the
 : server started.

 not at the moment ... but it would probably be fairly straight forward to
 add as a new option if you want to file a Jira isssue (and maybe take a
 crak at a patch)



 -Hoss