Denormalization and data retrieval

2016-04-18 Thread Bastien Latard - MDPI AG

Hi,

What's the correct way to create index(es) using denormalization?

1. something like that?
   
   
   
   

OR even:
   


2. OR a different index for each SQL table?
 -> if yes, how can I then retrieve all the needed data (i.e.: 
intersection)?...JOIN/Streaming exp.?


I have more than 68 millions of articles, which are all linked to 1 
journal and 1 publisher...And I have 8 different services requesting the 
data (so I cannot really provide a specific use case, I'd like to know a 
more general answer).


But in general, would it be better/faster to query:
- a single normalized index with all the data at the same place (but 
larger index because of duplicated data)

- several indexes (smaller indexes, but need to make a solr "join")

I got good tips about using 'Streaming expressions' & 'Parallel SQL 
interface', but I first want to know the best way to store the data.


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/



Re: add field requires collection reload

2016-04-18 Thread Hendrik Haddorp
Thanks, I knew I had seen a bug like this somewhere but could not find
it yesterday.

In yesterday's test run I actually had only one node and still got this
problem. So I'll keep the collection reload until switching to 6.1 then.

On 19/04/16 01:51, Erick Erickson wrote:
> The key here is you say "sometimes". It takes a while for the reload
> operation to propagate to _all_ the replicas that makeup your
> collection. My bet is that by immediately indexing after changing the
> data, your updates are getting to a core that hasn't reloaded yet.
>
> That said, https://issues.apache.org/jira/browse/SOLR-8662 addresses
> this very issue I believe, but it's in 6.1
>
> Best,
> Erick
>
> On Mon, Apr 18, 2016 at 1:34 PM, Hendrik Haddorp
>  wrote:
>> Hi,
>>
>> I'm using SolrCloud 6.0 with a managed schema. When I add fields using
>> SolrJ and immediately afterwards try to index data I sometimes get an
>> error telling me that a field that I just added does not exist. If I do
>> an explicit collection reload after the schema modification things seem
>> to work. Is that works as designed?
>>
>> According to https://cwiki.apache.org/confluence/display/solr/Schema+API
>> a core reload will happen automatically when using the schema API: "When
>> modifying the schema with the API, a core reload will automatically
>> occur in order for the changes to be available immediately for documents
>> indexed thereafter."
>>
>> regards,
>> Hendrik



Re: Overall large size in Solr across collections

2016-04-18 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Thanks for your explanation.

I have set my segment size to 20GB under the TieredMergePolicy

 10 10 20480 

Does it means that the segment merging will occurs more often, as it will
need to keep merging during indexing till it reaches 20GB.

I do have 192GB of RAM on my server which Solr is running on.

Regards,
Edwin


On 18 April 2016 at 21:35, Shawn Heisey  wrote:

> On 4/18/2016 4:22 AM, Zheng Lin Edwin Yeo wrote:
> > I have many collections in Solr, but with only 1 shard. I found that the
> > index size across all the collections has passed the 1TB mark. Currently
> > the query speed is still normal, but the indexing speed seems to be
> become
> > slower.
> >
> > Will it affect the performance if I continue to increase the index size
> but
> > stick to 1 shard?
>
> I have noticed overall *bulk* indexing speed slows down as the index
> gets bigger, but I suspect that a big part of the reason this happens is
> segment merging involves more *large* segments, tying up I/O resources.
>
> The amount of time required to index a small number of documents should
> not be affected much by index size, but something that is likely to take
> longer with a large index is the *commit* operation -- especially if
> Solr's caches are configured to autowarm.
>
> Running the index on SSD, or on a RAID10 volume with of a lot of regular
> disks, can greatly speed up indexing.  The parity-based RAID levels
> (primarily 5 and 6) have a fairly severe write penalty, so I do not
> recommend them for Solr, unless indexing happens infrequently.
>
> Installing plenty of memory is very helpful for *query* speed, but it
> can also *indirectly* speed up indexing.  If the disk is not busy when
> queries are happening, there's more I/O bandwidth available for writes.
>
> Thanks,
> Shawn
>
>


Re: Not seeing the tokenized values when using solr.PathHierarchyTokenizerFactory

2016-04-18 Thread Mark Robinson
Thanks much Eric!
Got it.

Best,
Mark.

On Mon, Apr 18, 2016 at 7:53 PM, Erick Erickson 
wrote:

> Assuming that you're talking about the docs returned in the result
> sets, these are the _stored_ fields, not the analyzed field. Stored
> fields are a verbatim copy of the original input.
>
> Best,
> Erick
>
> On Mon, Apr 18, 2016 at 12:51 PM, Mark Robinson 
> wrote:
> > Hi,
> >
> > I was using the solr.PathHierarchyTokenizerFactory for a field say
> fieldB.
> > An input data like A/B/C when I check using the ANALYSIS facility in the
> > admin UI, is tokenized as A, A/B, A/B/C in fieldB.
> > A/B/C in my system is a "string" value in a fieldA which is both
> > indexed=stored=true. I copyField fieldA to fieldB which has the above
> > solr.PathHierarchyTokenizerFactory.
> >
> > fieldB also has indexed=stored=true as well as multiValued=true.
> >
> > Even then, when results displayed, fieldB shows only the original A/B/C
> ie
> > same as what is in fieldA.
> > But ANALYSIS as mentioned above shows all the different hierarchies for
> > fieldB. Also a querylike:-  fieldA:"A/B" yields no results but
> > fieldB:"A/B" gives results as fieldB has all the hierarchies in it.
> >
> > But then why can't I see all the different hierarchies in my result for
> > fieldB as I clearly see when I check through the ANALYSIS in admin UI?
> >
> > Could some one pls help understand this behavior.
> >
> > Thanks!
> > Mark.
>


Re: Querying of multiple string value

2016-04-18 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Regarding the terms query parser, is it possible to search for query that
are not in the list?

In the normal OR parameters, I can do something like
http://localhost:8983/solr/collection1/highlight?q=!id:collection1_0001
 OR
!id:collection1_0002

For this query, all the records will be returned, except for ID 0001 and
0002.

However, I can't find a way that works for this in the terms query parser.

Regards,
Edwin



On 14 April 2016 at 11:20, Zheng Lin Edwin Yeo  wrote:

> Hi Shawn,
>
> Thanks for the reply. It works.
>
> Regards,
> Edwin
>
>
> On 14 April 2016 at 01:40, Shawn Heisey  wrote:
>
>> On 4/13/2016 9:25 AM, Zheng Lin Edwin Yeo wrote:
>> > Would like to find out, is there any way to do a multiple value query
>> of a
>> > field that is of type String, besides using the OR parameters?
>> >
>> > Currently, I am using the OR parameters like
>> > http://localhost:8983/solr/collection1/highlight?q=id:collection1_0001
>> OR
>> > id:collection1_0002
>> >
>> > But this will get longer and longer if I were to have many records to
>> > retrieve based on their ID. The fieldType is string, so it is not
>> possible
>> > to do things like sorting, more than or less than.
>> >
>> > I'm using Solr 5.4.0
>>
>> The terms query parser was added in 4.10, and would do what you need:
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser
>>
>> Thanks,
>> Shawn
>>
>>
>


Re: block join rollups

2016-04-18 Thread Nick Vasilyev
Hi Yonik,

Well, no one replied to this yet, so I thought I'd chime in with some of
the use cases that I am working with. Please note that I am lagging a big
behind the last few releases, so I haven't had time to experiment with Solr
5.3+, I am sure that some of this is included in there already and I am
very excited to play around with the new streaming API, json facets and SQL
interface when I have a bit more time.

I am indexing click stream data into Solr. Each set of records represents a
user's unique visit to our website. They all share a common session id, as
well as several session attributes, such as IP and user attributes if they
log in. Each record represents an individual action, such as a search,
product view or a visit to a particular page, all attributes and data
elements of each request are stored with each record, additionally, session
attributes get copied down to each event item. The current goal of this
system is to provide less tech savvy users with easy access to this data in
a way they can explore it and drill down on particular elements; we are
using Banana for this.

Currently, I have to copy a lot of session fields to each event so I can
filter on them, for example, show all searches for users associated with
organization X. This is super redundant and I am really looking for a
better way. It would be great if I could make parent document fields appear
as if they are a part of child documents.

Additionally, I am counting various events for each session during
processing. For example, I count the number of searches, product views, add
to carts, etc... This information is also indexed in each record. This
allows me to pull up specific events (like product views) where the number
of searches in a given session is greater than X. However, again, indexing
this information for each event creates a lot of redundancy.

Finally, a slightly different use cases involves running functions on items
in a group (even if they aren't a part of the result set) and returning
that as a part of the document. Almost like a dynamically generated
document, based on aggregations from child documents. This is currently
somewhat available, but I can't include it in sort. For example, I am
grouping items on a field, I want to get the minimum value of a field per
group and sort the result (of groups) on that calculated value.

I am not sure if this helps you at all, but wanted to share some of my pain
points, hope it helps.

On Sun, Apr 17, 2016 at 6:50 PM, Yonik Seeley  wrote:

> Hey folks, we're at the point of figuring out the API for block join
> child rollups for the JSON Facet API.
> We already have simple block join faceting:
> http://yonik.com/solr-nested-objects/
> So now we need an API to carry over more information from children to
> parents (say rolling up average rating of all the reviews to the
> corresponding parent book objects).
>
> I've gathered some of my notes/thoughts on the API here:
> https://issues.apache.org/jira/browse/SOLR-8998
>
> Feedback welcome, and we can discuss here in this thread rather than
> cluttering the JIRA.
>
> -Yonik
>


Re: Cannot use Phrase Queries in eDisMax and filtering

2016-04-18 Thread Doug Turnbull
Also you mentioned your field was a string? This means the field must match
*exactly* to be considered.a phrase match. Have you considered changing the
field to text field type with a tokenizer and doing phrase matching -- it
might work more like you'd expect.

Thanks
-Doug

On Mon, Apr 18, 2016 at 7:59 PM Erick Erickson 
wrote:

> bq: I cannot find either the condition on the field analyzer to be able to
> use
> pf, pf2 and pf3.
>
> These don't apply to field analysis at all. What they translate into
> is a series of
> phrase queries against different sets of fields. So, you may have
> pf=fieldA^5 fieldB
> pf2=fieldA^3 fieldC
>
> Now a query like (without quotes) "big dog" would be
> translated into something like
> ...
> fieldA:"big dog"^5 fieldB:"big dog" fieldA:"big dog"^3 fieldC:"big dog"
>
> Having multiple pf fields allows you to query with different slop values,
> different boosts etc. on the same or different fields.
>
> Best,
> Erick
>
>
> On Mon, Apr 18, 2016 at 12:25 PM, Antoine LE FLOC'H 
> wrote:
> > Hello,
> >
> > I don't have Solr source code handy but is
> > pf3=1&
> > pf2=1&
> > valid ? What would that do ? use the df or qf fields ?
> >
> > This
> >
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
> > says that the value of pf2 is a multivalued list of fields ? There are
> not
> > many example about this in this link.
> >
> > I cannot find either the condition on the field analyzer to be able to
> use
> > pf, pf2 and pf3.
> >
> > Feedback would be appreciated, thanks.
> >
> > Antoine.
> >
> >
> >
> >
> >
> > On Mon, Nov 3, 2014 at 8:29 PM, Ramzi Alqrainy  >
> > wrote:
> >
> >> I tried to produce your case in my machine with below queries, but
> >> everything
> >> worked fine with me. I just want to ask you a question what is the field
> >> type of "tag" field ?
> >>
> >> q=bmw&
> >> fl=score,*&
> >> wt=json&
> >> fq=city_id:59&
> >> qt=/query&
> >> defType=edismax&
> >> pf=title^15%20discription^5&
> >> pf3=1&
> >> pf2=1&
> >> ps=1&
> >> qroup=true&
> >> group.field=member_id&
> >> group.limit=10&
> >> sort=score desc&
> >> group.ngroups=true
> >>
> >>
> >>
> >>
> >> q=bmw&
> >> fl=score,*&
> >> wt=json&
> >> fq=city_id:59&
> >> qt=/query&
> >> defType=edismax&
> >> pf=title^15%20discription^5&
> >> pf3=1&
> >> pf2=1&
> >> ps=1&
> >> qroup=true&
> >> group.field=member_id&
> >> group.limit=10&
> >> group.ngroups=true&
> >> sort=score desc&
> >> fq=category_id:1777
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Cannot-use-Phrase-Queries-in-eDisMax-and-filtering-tp4167302p4167338.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>


Re: what is opening realtime Searcher

2016-04-18 Thread Doug Turnbull
Erick can correct me. I think "searcher" here might just sound a bit
misleading. Real time get is really about fetching by id, not issuing
searches per-se. Only after a soft or hard commit does a document truly
become searchable.

On Mon, Apr 18, 2016 at 8:02 PM Erick Erickson 
wrote:

> This is about real-time get. The idea is this. Suppose
> you have a doc doc1 already in your index at time T1
> and update it at time T2 and your soft commit happens
> at time T3.
>
> If a search a search happens between time T1 and T2
> but the fetch happens between T2 and T3, you get
> back the updated document, not the doc that was in
> the index. So the reatime get is outside the
> soft and hard commit issues.
>
> It's a pretty lightweight operation, no caches are invalidated
> or warmed etc.
>
> Best,
> Erick
>
> On Mon, Apr 18, 2016 at 9:59 AM, Jaroslaw Rozanski
>  wrote:
> > Hi,
> >
> >  What exactly triggers opening new "realtime" searcher?
> >
> > 2016-04-18_16:28:02.33289 INFO  (qtp1038620625-13) [c:col1 s:shard1
> r:core_node3 x:col1_shard1_replica3] o.a.s.s.SolrIndexSearcher Opening
> Searcher@752e986f[col1_shard1_replica3] realtime
> >
> > I am seeing above being triggered when adding documents to index. The
> > frequency (from few milliseconds to few seconds) does not correlate with
> > maxTime of either autoCommit or autoSoftCommit (which are fixed to tens
> > of seconds).
> >
> > Client never sends commit message explicitly (and there is
> > IgnoreCommitOptimizeUpdateProcessorFactory in processor chain).
> >
> > Re: Solr 5.5.0
> >
> >
> >
> > Thanks,
> > Jarek
> >
>


Re: Solr Support for BM25F

2016-04-18 Thread Doug Turnbull
It's worth adding that Lucene's BlendedTermQuery, (used in Elasticsearch's
cross_field search), attempts to blend field's document frequency together.
So I wonder what BlendedTermQuery plus BM25 similarity per-field would do?
It might be close to true BM25F aside for the length issue.

(You'd have to write a QParserPlugin and build the BlendedTermQuery
yourself, AFAIK there's not a direct Solr interface to it yet.)

Best
-Doug

On Mon, Apr 18, 2016 at 4:52 PM Tom Burton-West  wrote:

> Hi David,
>
> It may not matter for your use case  but just in case you really are
> interested in the "real BM25F" there is a difference between configuring K1
> and B for different fields in Solr and a "real" BM25F implementation.  This
> has to do with Solr's model of fields being mini-documents (i.e. each field
> has its own length, idf and tf)   See the discussion in
> https://issues.apache.org/jira/browse/LUCENE-2959, particularly these
> comments by Robert Muir:
>
> "Actually as far as BM25f, this one presents a few challenges (some already
> discussed on LUCENE-2091 <
> https://issues.apache.org/jira/browse/LUCENE-2091>
> ).
>
> To summarize:
>
>- for any field, Lucene has a per-field terms dictionary that contains
>that term's docFreq. To compute BM25f's IDF method would be challenging,
>because it wants a docFreq "across all the fields". (its not clear to
> me at
>a glance either from the original paper, if this should be across only
> the
>fields in the query, across all the fields in the document, and if a
>"static" schema is implied in this scoring system (in lucene document 1
> can
>have 3 fields and document 2 can have 40 different ones, even with
>different properties).
>- the same issue applies to length normalization, lucene has a "field
>length" but really no concept of document length."
>
> Tom
>
> On Thu, Apr 14, 2016 at 12:41 PM, David Cawley 
> wrote:
>
> > Hello,
> > I am developing an enterprise search engine for a project and I was
> hoping
> > to implement BM25F ranking algorithm to configure the tuning parameters
> on
> > a per field basis. I understand BM25 similarity is now supported in Solr
> > but I was hoping to be able to configure k1 and b for different fields
> such
> > as title, description, anchor etc, as they are structured documents.
> > I am fairly new to Solr so any help would be appreciated. If this is
> > possible or any steps as to how I can go about implementing this it would
> > be greatly appreciated.
> >
> > Regards,
> >
> > David
> >
> > Current Solr Version 5.4.1
> >
>


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread John Bickerstaff
Thanks Eric, for the  confirmation.
On Apr 18, 2016 5:48 PM, "Erick Erickson"  wrote:

> In short, I'm afraid I have to agree with your IT guy.
>
> I like SolrCloud, it's wy cool. But in your situation I really
> can't say it's compelling.
>
> The places SolrCloud shines: automatically routing docs to shards..
> You're not sharing.
>
> Automatically electing a new leader (analogous to master) ... You
> don't care since the pain of reindexing is so little.
>
> Not losing data when a leader/master goes down during indexing... You
> don't care since you can reindex quickly and you're indexing so
> rarely.
>
> In fact, I'd also optimize the index, Something I rarely recommend.
>
> Even the argument that you get to use all your nodes for searching
> doesn't really pertain since you can index on a node, then just copy
> the index to all your nodes, you could get by without even configuring
> master/slave. Or just, as you say, index to all your Solr nodes
> simultaneously.
>
> About the only downside is that you've got to create your Solr nodes
> independently, making sure the proper configurations are on each one
> etc, but even if those changed 2-3 times a year it's hardly onerous.
>
> You _are_ getting all the latest and greatest indexing and search
> improvements, all the SolrCloud stuff is built on top of exactly the
> Solr you'd get without using SolrCloud.
>
> And finally, there is certainly a learning curve to SolrCloud,
> particularly in this case the care and feeding of Zookeeper.
>
> The instant you need to have shards, the argument changes quite
> dramatically. The argument changes some under significant indexing
> loads. The argument totally changes if you need low latency. It
> doesn't sound like your situation is sensitive to any of these
> though
>
> Best,
> Erick
>
> On Apr 18, 2016 10:41 AM, "John Bickerstaff" 
> wrote:
> >
> > Nice - thanks Daniel.
> >
> > On Mon, Apr 18, 2016 at 11:38 AM, Davis, Daniel (NIH/NLM) [C] <
> > daniel.da...@nih.gov> wrote:
> >
> > > One thing I like about SolrCloud is that I don't have to configure
> > > Master/Slave replication in each "core" the same way to get them to
> > > replicate.
> > >
> > > The other thing I like about SolrCloud, which is largely theoretical at
> > > this point, is that I don't need to test changes to a collection's
> > > configuration by bringing up a whole new solr on a whole new server -
> > > SolrCloud already virtualizes this, and so I can make up a random
> > > collection name that doesn't conflict, and create the thing, and smoke
> test
> > > with it.   I know that standard practice is to bring up all new nodes,
> but
> > > I don't see why this is needed.
> > >
> > > -Original Message-
> > > From: John Bickerstaff [mailto:j...@johnbickerstaff.com]
> > > Sent: Monday, April 18, 2016 1:23 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Verifying - SOLR Cloud replaces load balancer?
> > >
> > > So - my IT guy makes the case that we don't really need Zookeeper /
> Solr
> > > Cloud...
> > >
> > > He may be right - we're serving static data (changes to the collection
> > > occur only 2 or 3 times a year and are minor)
> > >
> > > We probably could have 3 or 4 Solr nodes running in non-Cloud mode --
> each
> > > configured the same way, behind a load balancer and do fine.
> > >
> > > I've got a Kafka server set up with the solr docs as topics.  It takes
> > > about 10 minutes to reload a "blank" Solr Server from the Kafka
> topic...
> > > If I target 3-4 SOLR servers from my microservice instead of one, it
> > > wouldn't take much longer than 10 minutes to concurrently reload all 3
> or 4
> > > Solr servers from scratch...
> > >
> > > I'm biased in terms of using the most recent functionality, but I'm
> aware
> > > that bias is not necessarily based on facts and want to do my due
> > > diligence...
> > >
> > > Aside from the obvious benefits of spreading work across nodes (which
> may
> > > not be a big deal in our application and which my IT guy proposes is
> more
> > > transparently handled with a load balancer he understands) are there
> any
> > > other considerations that would drive a choice for Solr Cloud
> (zookeeper
> > > etc)?
> > >
> > >
> > >
> > > On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans 
> > > wrote:
> > >
> > > > On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
> > > >  wrote:
> > > > > Thanks all - very helpful.
> > > > >
> > > > > @Shawn - your reply implies that even if I'm hitting the URL for a
> > > > > single endpoint via HTTP - the "balancing" will still occur across
> > > > > the Solr
> > > > Cloud
> > > > > (I understand the caveat about that single endpoint being a
> > > > > potential
> > > > point
> > > > > of failure).  I just want to verify that I'm interpreting your
> > > > > response correctly...
> > > > >
> > > > > (I have been asked to provide IT with a comprehensive list of
> > > > > options
> > > > prior
> > > > > to a design discussion - which is why I'm trying to get clear ab

Re: what is opening realtime Searcher

2016-04-18 Thread Erick Erickson
This is about real-time get. The idea is this. Suppose
you have a doc doc1 already in your index at time T1
and update it at time T2 and your soft commit happens
at time T3.

If a search a search happens between time T1 and T2
but the fetch happens between T2 and T3, you get
back the updated document, not the doc that was in
the index. So the reatime get is outside the
soft and hard commit issues.

It's a pretty lightweight operation, no caches are invalidated
or warmed etc.

Best,
Erick

On Mon, Apr 18, 2016 at 9:59 AM, Jaroslaw Rozanski
 wrote:
> Hi,
>
>  What exactly triggers opening new "realtime" searcher?
>
> 2016-04-18_16:28:02.33289 INFO  (qtp1038620625-13) [c:col1 s:shard1 
> r:core_node3 x:col1_shard1_replica3] o.a.s.s.SolrIndexSearcher Opening 
> Searcher@752e986f[col1_shard1_replica3] realtime
>
> I am seeing above being triggered when adding documents to index. The
> frequency (from few milliseconds to few seconds) does not correlate with
> maxTime of either autoCommit or autoSoftCommit (which are fixed to tens
> of seconds).
>
> Client never sends commit message explicitly (and there is
> IgnoreCommitOptimizeUpdateProcessorFactory in processor chain).
>
> Re: Solr 5.5.0
>
>
>
> Thanks,
> Jarek
>


Re: Cannot use Phrase Queries in eDisMax and filtering

2016-04-18 Thread Erick Erickson
bq: I cannot find either the condition on the field analyzer to be able to use
pf, pf2 and pf3.

These don't apply to field analysis at all. What they translate into
is a series of
phrase queries against different sets of fields. So, you may have
pf=fieldA^5 fieldB
pf2=fieldA^3 fieldC

Now a query like (without quotes) "big dog" would be
translated into something like
...
fieldA:"big dog"^5 fieldB:"big dog" fieldA:"big dog"^3 fieldC:"big dog"

Having multiple pf fields allows you to query with different slop values,
different boosts etc. on the same or different fields.

Best,
Erick


On Mon, Apr 18, 2016 at 12:25 PM, Antoine LE FLOC'H  wrote:
> Hello,
>
> I don't have Solr source code handy but is
> pf3=1&
> pf2=1&
> valid ? What would that do ? use the df or qf fields ?
>
> This
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
> says that the value of pf2 is a multivalued list of fields ? There are not
> many example about this in this link.
>
> I cannot find either the condition on the field analyzer to be able to use
> pf, pf2 and pf3.
>
> Feedback would be appreciated, thanks.
>
> Antoine.
>
>
>
>
>
> On Mon, Nov 3, 2014 at 8:29 PM, Ramzi Alqrainy 
> wrote:
>
>> I tried to produce your case in my machine with below queries, but
>> everything
>> worked fine with me. I just want to ask you a question what is the field
>> type of "tag" field ?
>>
>> q=bmw&
>> fl=score,*&
>> wt=json&
>> fq=city_id:59&
>> qt=/query&
>> defType=edismax&
>> pf=title^15%20discription^5&
>> pf3=1&
>> pf2=1&
>> ps=1&
>> qroup=true&
>> group.field=member_id&
>> group.limit=10&
>> sort=score desc&
>> group.ngroups=true
>>
>>
>>
>>
>> q=bmw&
>> fl=score,*&
>> wt=json&
>> fq=city_id:59&
>> qt=/query&
>> defType=edismax&
>> pf=title^15%20discription^5&
>> pf3=1&
>> pf2=1&
>> ps=1&
>> qroup=true&
>> group.field=member_id&
>> group.limit=10&
>> group.ngroups=true&
>> sort=score desc&
>> fq=category_id:1777
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Cannot-use-Phrase-Queries-in-eDisMax-and-filtering-tp4167302p4167338.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>


Re: Not seeing the tokenized values when using solr.PathHierarchyTokenizerFactory

2016-04-18 Thread Erick Erickson
Assuming that you're talking about the docs returned in the result
sets, these are the _stored_ fields, not the analyzed field. Stored
fields are a verbatim copy of the original input.

Best,
Erick

On Mon, Apr 18, 2016 at 12:51 PM, Mark Robinson  wrote:
> Hi,
>
> I was using the solr.PathHierarchyTokenizerFactory for a field say fieldB.
> An input data like A/B/C when I check using the ANALYSIS facility in the
> admin UI, is tokenized as A, A/B, A/B/C in fieldB.
> A/B/C in my system is a "string" value in a fieldA which is both
> indexed=stored=true. I copyField fieldA to fieldB which has the above
> solr.PathHierarchyTokenizerFactory.
>
> fieldB also has indexed=stored=true as well as multiValued=true.
>
> Even then, when results displayed, fieldB shows only the original A/B/C ie
> same as what is in fieldA.
> But ANALYSIS as mentioned above shows all the different hierarchies for
> fieldB. Also a querylike:-  fieldA:"A/B" yields no results but
> fieldB:"A/B" gives results as fieldB has all the hierarchies in it.
>
> But then why can't I see all the different hierarchies in my result for
> fieldB as I clearly see when I check through the ANALYSIS in admin UI?
>
> Could some one pls help understand this behavior.
>
> Thanks!
> Mark.


Re: add field requires collection reload

2016-04-18 Thread Erick Erickson
The key here is you say "sometimes". It takes a while for the reload
operation to propagate to _all_ the replicas that makeup your
collection. My bet is that by immediately indexing after changing the
data, your updates are getting to a core that hasn't reloaded yet.

That said, https://issues.apache.org/jira/browse/SOLR-8662 addresses
this very issue I believe, but it's in 6.1

Best,
Erick

On Mon, Apr 18, 2016 at 1:34 PM, Hendrik Haddorp
 wrote:
> Hi,
>
> I'm using SolrCloud 6.0 with a managed schema. When I add fields using
> SolrJ and immediately afterwards try to index data I sometimes get an
> error telling me that a field that I just added does not exist. If I do
> an explicit collection reload after the schema modification things seem
> to work. Is that works as designed?
>
> According to https://cwiki.apache.org/confluence/display/solr/Schema+API
> a core reload will happen automatically when using the schema API: "When
> modifying the schema with the API, a core reload will automatically
> occur in order for the changes to be available immediately for documents
> indexed thereafter."
>
> regards,
> Hendrik


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Erick Erickson
In short, I'm afraid I have to agree with your IT guy.

I like SolrCloud, it's wy cool. But in your situation I really
can't say it's compelling.

The places SolrCloud shines: automatically routing docs to shards..
You're not sharing.

Automatically electing a new leader (analogous to master) ... You
don't care since the pain of reindexing is so little.

Not losing data when a leader/master goes down during indexing... You
don't care since you can reindex quickly and you're indexing so
rarely.

In fact, I'd also optimize the index, Something I rarely recommend.

Even the argument that you get to use all your nodes for searching
doesn't really pertain since you can index on a node, then just copy
the index to all your nodes, you could get by without even configuring
master/slave. Or just, as you say, index to all your Solr nodes
simultaneously.

About the only downside is that you've got to create your Solr nodes
independently, making sure the proper configurations are on each one
etc, but even if those changed 2-3 times a year it's hardly onerous.

You _are_ getting all the latest and greatest indexing and search
improvements, all the SolrCloud stuff is built on top of exactly the
Solr you'd get without using SolrCloud.

And finally, there is certainly a learning curve to SolrCloud,
particularly in this case the care and feeding of Zookeeper.

The instant you need to have shards, the argument changes quite
dramatically. The argument changes some under significant indexing
loads. The argument totally changes if you need low latency. It
doesn't sound like your situation is sensitive to any of these
though

Best,
Erick

On Apr 18, 2016 10:41 AM, "John Bickerstaff"  wrote:
>
> Nice - thanks Daniel.
>
> On Mon, Apr 18, 2016 at 11:38 AM, Davis, Daniel (NIH/NLM) [C] <
> daniel.da...@nih.gov> wrote:
>
> > One thing I like about SolrCloud is that I don't have to configure
> > Master/Slave replication in each "core" the same way to get them to
> > replicate.
> >
> > The other thing I like about SolrCloud, which is largely theoretical at
> > this point, is that I don't need to test changes to a collection's
> > configuration by bringing up a whole new solr on a whole new server -
> > SolrCloud already virtualizes this, and so I can make up a random
> > collection name that doesn't conflict, and create the thing, and smoke test
> > with it.   I know that standard practice is to bring up all new nodes, but
> > I don't see why this is needed.
> >
> > -Original Message-
> > From: John Bickerstaff [mailto:j...@johnbickerstaff.com]
> > Sent: Monday, April 18, 2016 1:23 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Verifying - SOLR Cloud replaces load balancer?
> >
> > So - my IT guy makes the case that we don't really need Zookeeper / Solr
> > Cloud...
> >
> > He may be right - we're serving static data (changes to the collection
> > occur only 2 or 3 times a year and are minor)
> >
> > We probably could have 3 or 4 Solr nodes running in non-Cloud mode -- each
> > configured the same way, behind a load balancer and do fine.
> >
> > I've got a Kafka server set up with the solr docs as topics.  It takes
> > about 10 minutes to reload a "blank" Solr Server from the Kafka topic...
> > If I target 3-4 SOLR servers from my microservice instead of one, it
> > wouldn't take much longer than 10 minutes to concurrently reload all 3 or 4
> > Solr servers from scratch...
> >
> > I'm biased in terms of using the most recent functionality, but I'm aware
> > that bias is not necessarily based on facts and want to do my due
> > diligence...
> >
> > Aside from the obvious benefits of spreading work across nodes (which may
> > not be a big deal in our application and which my IT guy proposes is more
> > transparently handled with a load balancer he understands) are there any
> > other considerations that would drive a choice for Solr Cloud (zookeeper
> > etc)?
> >
> >
> >
> > On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans 
> > wrote:
> >
> > > On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
> > >  wrote:
> > > > Thanks all - very helpful.
> > > >
> > > > @Shawn - your reply implies that even if I'm hitting the URL for a
> > > > single endpoint via HTTP - the "balancing" will still occur across
> > > > the Solr
> > > Cloud
> > > > (I understand the caveat about that single endpoint being a
> > > > potential
> > > point
> > > > of failure).  I just want to verify that I'm interpreting your
> > > > response correctly...
> > > >
> > > > (I have been asked to provide IT with a comprehensive list of
> > > > options
> > > prior
> > > > to a design discussion - which is why I'm trying to get clear about
> > > > the various options)
> > > >
> > > > In a nutshell, I think I understand the following:
> > > >
> > > > a. Even if hitting a single URL, the Solr Cloud will "balance"
> > > > across all available nodes for searching
> > > >   Caveat: That single URL represents a potential single
> > > > point of failure and th

Re: Solr Support for BM25F

2016-04-18 Thread Tom Burton-West
Hi David,

It may not matter for your use case  but just in case you really are
interested in the "real BM25F" there is a difference between configuring K1
and B for different fields in Solr and a "real" BM25F implementation.  This
has to do with Solr's model of fields being mini-documents (i.e. each field
has its own length, idf and tf)   See the discussion in
https://issues.apache.org/jira/browse/LUCENE-2959, particularly these
comments by Robert Muir:

"Actually as far as BM25f, this one presents a few challenges (some already
discussed on LUCENE-2091 
).

To summarize:

   - for any field, Lucene has a per-field terms dictionary that contains
   that term's docFreq. To compute BM25f's IDF method would be challenging,
   because it wants a docFreq "across all the fields". (its not clear to me at
   a glance either from the original paper, if this should be across only the
   fields in the query, across all the fields in the document, and if a
   "static" schema is implied in this scoring system (in lucene document 1 can
   have 3 fields and document 2 can have 40 different ones, even with
   different properties).
   - the same issue applies to length normalization, lucene has a "field
   length" but really no concept of document length."

Tom

On Thu, Apr 14, 2016 at 12:41 PM, David Cawley 
wrote:

> Hello,
> I am developing an enterprise search engine for a project and I was hoping
> to implement BM25F ranking algorithm to configure the tuning parameters on
> a per field basis. I understand BM25 similarity is now supported in Solr
> but I was hoping to be able to configure k1 and b for different fields such
> as title, description, anchor etc, as they are structured documents.
> I am fairly new to Solr so any help would be appreciated. If this is
> possible or any steps as to how I can go about implementing this it would
> be greatly appreciated.
>
> Regards,
>
> David
>
> Current Solr Version 5.4.1
>


add field requires collection reload

2016-04-18 Thread Hendrik Haddorp
Hi,

I'm using SolrCloud 6.0 with a managed schema. When I add fields using
SolrJ and immediately afterwards try to index data I sometimes get an
error telling me that a field that I just added does not exist. If I do
an explicit collection reload after the schema modification things seem
to work. Is that works as designed?

According to https://cwiki.apache.org/confluence/display/solr/Schema+API
a core reload will happen automatically when using the schema API: "When
modifying the schema with the API, a core reload will automatically
occur in order for the changes to be available immediately for documents
indexed thereafter."

regards,
Hendrik


Not seeing the tokenized values when using solr.PathHierarchyTokenizerFactory

2016-04-18 Thread Mark Robinson
Hi,

I was using the solr.PathHierarchyTokenizerFactory for a field say fieldB.
An input data like A/B/C when I check using the ANALYSIS facility in the
admin UI, is tokenized as A, A/B, A/B/C in fieldB.
A/B/C in my system is a "string" value in a fieldA which is both
indexed=stored=true. I copyField fieldA to fieldB which has the above
solr.PathHierarchyTokenizerFactory.

fieldB also has indexed=stored=true as well as multiValued=true.

Even then, when results displayed, fieldB shows only the original A/B/C ie
same as what is in fieldA.
But ANALYSIS as mentioned above shows all the different hierarchies for
fieldB. Also a querylike:-  fieldA:"A/B" yields no results but
fieldB:"A/B" gives results as fieldB has all the hierarchies in it.

But then why can't I see all the different hierarchies in my result for
fieldB as I clearly see when I check through the ANALYSIS in admin UI?

Could some one pls help understand this behavior.

Thanks!
Mark.


Re: Cannot use Phrase Queries in eDisMax and filtering

2016-04-18 Thread Antoine LE FLOC'H
Hello,

I don't have Solr source code handy but is
pf3=1&
pf2=1&
valid ? What would that do ? use the df or qf fields ?

This
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
says that the value of pf2 is a multivalued list of fields ? There are not
many example about this in this link.

I cannot find either the condition on the field analyzer to be able to use
pf, pf2 and pf3.

Feedback would be appreciated, thanks.

Antoine.





On Mon, Nov 3, 2014 at 8:29 PM, Ramzi Alqrainy 
wrote:

> I tried to produce your case in my machine with below queries, but
> everything
> worked fine with me. I just want to ask you a question what is the field
> type of "tag" field ?
>
> q=bmw&
> fl=score,*&
> wt=json&
> fq=city_id:59&
> qt=/query&
> defType=edismax&
> pf=title^15%20discription^5&
> pf3=1&
> pf2=1&
> ps=1&
> qroup=true&
> group.field=member_id&
> group.limit=10&
> sort=score desc&
> group.ngroups=true
>
>
>
>
> q=bmw&
> fl=score,*&
> wt=json&
> fq=city_id:59&
> qt=/query&
> defType=edismax&
> pf=title^15%20discription^5&
> pf3=1&
> pf2=1&
> ps=1&
> qroup=true&
> group.field=member_id&
> group.limit=10&
> group.ngroups=true&
> sort=score desc&
> fq=category_id:1777
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Cannot-use-Phrase-Queries-in-eDisMax-and-filtering-tp4167302p4167338.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Getting duplicate output while doing auto suggestion based on multiple filed using copy filed in solr 5.5

2016-04-18 Thread Tejas Bhanushali
HI Team,

I tried to run the same example as suggested by Chris Hostetter and i get
to know it's working fine for single field, but my requirement is it should
suggest based on multiple fields .i.e not only on "cat" field but it should
suggest based on few other fields like 'name','manu' etc. and due to
fuzzylookupfactory it's give suggestion based on starting characters, not
if users type any middle character of phrase eg. i tried with searching by
'stuff2' of 'electronics and stuff2' cat .

curl 'http://localhost:8983/solr/techproducts/suggest?wt=json&ind
ent=true&suggest.dictionary=mySuggester&suggest=true&suggest.q=stuff'
{
  "responseHeader":{
"status":0,
"QTime":2},
  "suggest":{"mySuggester":{
  "stuff":{
"numFound":0,
"suggestions":[]

On Fri, Apr 15, 2016 at 11:18 PM, Tejas Bhanushali <
contact.tejasbhanush...@gmail.com> wrote:

> Hi Team,
>
> Im getting the duplicate result when i do auto suggestion based on
> multiple filed by using copy filed . i have below table configuration .
>
> Segment -- have multiple category -- have multiple sub category -- have
> multiple products.
>
> suggestion are given based on
> segment name, category name, sub category name and product name.
>
> below is output .
>
> ---
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 1375
>   },
>   "command": "build",
>   "suggest": {
> "mySuggester": {
>   "Fruit": {
> "numFound": 10,
> "suggestions": [
>   {
> "term": "Fruits & Vegetables",
> "weight": 1000,
> "payload": ""
>   },
>   {
> "term": "Fruits & Vegetables",
> "weight": 1000,
> "payload": ""
>   },
>   {
> "term": "Fruits & Vegetables",
> "weight": 980,
> "payload": ""
>   },
>   {
> "term": "Fruits & Vegetables",
> "weight": 980,
> "payload": ""
>   },
>   {
> "term": "Fruits & Vegetables",
> "weight": 800,
> "payload": ""
>   },
>   {
> "term": "Fruits & Vegetables",
> "weight": 588,
> "payload": ""
>   },
>   {
> "term": "Cut Fruits",
> "weight": 456,
> "payload": ""
>   },
>   {
> "term": "Fruits",
> "weight": 456,
> "payload": ""
>   },
>   {
> "term": "Fruits & Vegetables",
> "weight": 456,
> "payload": ""
>   },
>   {
> "term": "Fruits",
> "weight": 456,
> "payload": ""
>   }
> ]
>   }
> }
>   }
> }
>
> --
> Thanks & Regards,
>
> Tejas Bhanushali
>



-- 
Thanks & Regards,

Tejas Bhanushali


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread John Bickerstaff
Nice - thanks Daniel.

On Mon, Apr 18, 2016 at 11:38 AM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> One thing I like about SolrCloud is that I don't have to configure
> Master/Slave replication in each "core" the same way to get them to
> replicate.
>
> The other thing I like about SolrCloud, which is largely theoretical at
> this point, is that I don't need to test changes to a collection's
> configuration by bringing up a whole new solr on a whole new server -
> SolrCloud already virtualizes this, and so I can make up a random
> collection name that doesn't conflict, and create the thing, and smoke test
> with it.   I know that standard practice is to bring up all new nodes, but
> I don't see why this is needed.
>
> -Original Message-
> From: John Bickerstaff [mailto:j...@johnbickerstaff.com]
> Sent: Monday, April 18, 2016 1:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Verifying - SOLR Cloud replaces load balancer?
>
> So - my IT guy makes the case that we don't really need Zookeeper / Solr
> Cloud...
>
> He may be right - we're serving static data (changes to the collection
> occur only 2 or 3 times a year and are minor)
>
> We probably could have 3 or 4 Solr nodes running in non-Cloud mode -- each
> configured the same way, behind a load balancer and do fine.
>
> I've got a Kafka server set up with the solr docs as topics.  It takes
> about 10 minutes to reload a "blank" Solr Server from the Kafka topic...
> If I target 3-4 SOLR servers from my microservice instead of one, it
> wouldn't take much longer than 10 minutes to concurrently reload all 3 or 4
> Solr servers from scratch...
>
> I'm biased in terms of using the most recent functionality, but I'm aware
> that bias is not necessarily based on facts and want to do my due
> diligence...
>
> Aside from the obvious benefits of spreading work across nodes (which may
> not be a big deal in our application and which my IT guy proposes is more
> transparently handled with a load balancer he understands) are there any
> other considerations that would drive a choice for Solr Cloud (zookeeper
> etc)?
>
>
>
> On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans 
> wrote:
>
> > On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
> >  wrote:
> > > Thanks all - very helpful.
> > >
> > > @Shawn - your reply implies that even if I'm hitting the URL for a
> > > single endpoint via HTTP - the "balancing" will still occur across
> > > the Solr
> > Cloud
> > > (I understand the caveat about that single endpoint being a
> > > potential
> > point
> > > of failure).  I just want to verify that I'm interpreting your
> > > response correctly...
> > >
> > > (I have been asked to provide IT with a comprehensive list of
> > > options
> > prior
> > > to a design discussion - which is why I'm trying to get clear about
> > > the various options)
> > >
> > > In a nutshell, I think I understand the following:
> > >
> > > a. Even if hitting a single URL, the Solr Cloud will "balance"
> > > across all available nodes for searching
> > >   Caveat: That single URL represents a potential single
> > > point of failure and this should be taken into account
> > >
> > > b. SolrJ's CloudSolrClient API provides the ability to distribute
> > > load -- based on Zookeeper's "knowledge" of all available Solr
> instances.
> > >   Note: This is more robust than "a" due to the fact that it
> > > eliminates the "single point of failure"
> > >
> > > c.  Use of a load balancer hitting all known Solr instances will be
> > > fine
> > -
> > > although the search requests may not run on the Solr instance the
> > > load balancer targeted - due to "a" above.
> > >
> > > Corrections or refinements welcomed...
> >
> > With option a), although queries will be distributed across the
> > cluster, all queries will be going through that single node. Not only
> > is that a single point of failure, but you risk saturating the
> > inter-node network traffic, possibly resulting in lower QPS and higher
> > latency on your queries.
> >
> > With option b), as well as SolrJ, recent versions of pysolr have a
> > ZK-aware SolrCloud client that behaves in a similar way.
> >
> > With option c), you can use the preferLocalShards so that shards that
> > are local to the queried node are used in preference to distributed
> > shards. Depending on your shard/cluster topology, this can increase
> > performance if you are returning large amounts of data - many or large
> > fields or many documents.
> >
> > Cheers
> >
> > Tom
> >
>


RE: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Davis, Daniel (NIH/NLM) [C]
One thing I like about SolrCloud is that I don't have to configure Master/Slave 
replication in each "core" the same way to get them to replicate.

The other thing I like about SolrCloud, which is largely theoretical at this 
point, is that I don't need to test changes to a collection's configuration by 
bringing up a whole new solr on a whole new server - SolrCloud already 
virtualizes this, and so I can make up a random collection name that doesn't 
conflict, and create the thing, and smoke test with it.   I know that standard 
practice is to bring up all new nodes, but I don't see why this is needed.

-Original Message-
From: John Bickerstaff [mailto:j...@johnbickerstaff.com] 
Sent: Monday, April 18, 2016 1:23 PM
To: solr-user@lucene.apache.org
Subject: Re: Verifying - SOLR Cloud replaces load balancer?

So - my IT guy makes the case that we don't really need Zookeeper / Solr 
Cloud...

He may be right - we're serving static data (changes to the collection occur 
only 2 or 3 times a year and are minor)

We probably could have 3 or 4 Solr nodes running in non-Cloud mode -- each 
configured the same way, behind a load balancer and do fine.

I've got a Kafka server set up with the solr docs as topics.  It takes about 10 
minutes to reload a "blank" Solr Server from the Kafka topic...
If I target 3-4 SOLR servers from my microservice instead of one, it wouldn't 
take much longer than 10 minutes to concurrently reload all 3 or 4 Solr servers 
from scratch...

I'm biased in terms of using the most recent functionality, but I'm aware that 
bias is not necessarily based on facts and want to do my due diligence...

Aside from the obvious benefits of spreading work across nodes (which may not 
be a big deal in our application and which my IT guy proposes is more 
transparently handled with a load balancer he understands) are there any other 
considerations that would drive a choice for Solr Cloud (zookeeper etc)?



On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans  wrote:

> On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff 
>  wrote:
> > Thanks all - very helpful.
> >
> > @Shawn - your reply implies that even if I'm hitting the URL for a 
> > single endpoint via HTTP - the "balancing" will still occur across 
> > the Solr
> Cloud
> > (I understand the caveat about that single endpoint being a 
> > potential
> point
> > of failure).  I just want to verify that I'm interpreting your 
> > response correctly...
> >
> > (I have been asked to provide IT with a comprehensive list of 
> > options
> prior
> > to a design discussion - which is why I'm trying to get clear about 
> > the various options)
> >
> > In a nutshell, I think I understand the following:
> >
> > a. Even if hitting a single URL, the Solr Cloud will "balance" 
> > across all available nodes for searching
> >   Caveat: That single URL represents a potential single 
> > point of failure and this should be taken into account
> >
> > b. SolrJ's CloudSolrClient API provides the ability to distribute 
> > load -- based on Zookeeper's "knowledge" of all available Solr instances.
> >   Note: This is more robust than "a" due to the fact that it 
> > eliminates the "single point of failure"
> >
> > c.  Use of a load balancer hitting all known Solr instances will be 
> > fine
> -
> > although the search requests may not run on the Solr instance the 
> > load balancer targeted - due to "a" above.
> >
> > Corrections or refinements welcomed...
>
> With option a), although queries will be distributed across the 
> cluster, all queries will be going through that single node. Not only 
> is that a single point of failure, but you risk saturating the 
> inter-node network traffic, possibly resulting in lower QPS and higher 
> latency on your queries.
>
> With option b), as well as SolrJ, recent versions of pysolr have a 
> ZK-aware SolrCloud client that behaves in a similar way.
>
> With option c), you can use the preferLocalShards so that shards that 
> are local to the queried node are used in preference to distributed 
> shards. Depending on your shard/cluster topology, this can increase 
> performance if you are returning large amounts of data - many or large 
> fields or many documents.
>
> Cheers
>
> Tom
>


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread John Bickerstaff
So - my IT guy makes the case that we don't really need Zookeeper / Solr
Cloud...

He may be right - we're serving static data (changes to the collection
occur only 2 or 3 times a year and are minor)

We probably could have 3 or 4 Solr nodes running in non-Cloud mode -- each
configured the same way, behind a load balancer and do fine.

I've got a Kafka server set up with the solr docs as topics.  It takes
about 10 minutes to reload a "blank" Solr Server from the Kafka topic...
If I target 3-4 SOLR servers from my microservice instead of one, it
wouldn't take much longer than 10 minutes to concurrently reload all 3 or 4
Solr servers from scratch...

I'm biased in terms of using the most recent functionality, but I'm aware
that bias is not necessarily based on facts and want to do my due
diligence...

Aside from the obvious benefits of spreading work across nodes (which may
not be a big deal in our application and which my IT guy proposes is more
transparently handled with a load balancer he understands) are there any
other considerations that would drive a choice for Solr Cloud (zookeeper
etc)?



On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans  wrote:

> On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
>  wrote:
> > Thanks all - very helpful.
> >
> > @Shawn - your reply implies that even if I'm hitting the URL for a single
> > endpoint via HTTP - the "balancing" will still occur across the Solr
> Cloud
> > (I understand the caveat about that single endpoint being a potential
> point
> > of failure).  I just want to verify that I'm interpreting your response
> > correctly...
> >
> > (I have been asked to provide IT with a comprehensive list of options
> prior
> > to a design discussion - which is why I'm trying to get clear about the
> > various options)
> >
> > In a nutshell, I think I understand the following:
> >
> > a. Even if hitting a single URL, the Solr Cloud will "balance" across all
> > available nodes for searching
> >   Caveat: That single URL represents a potential single point of
> > failure and this should be taken into account
> >
> > b. SolrJ's CloudSolrClient API provides the ability to distribute load --
> > based on Zookeeper's "knowledge" of all available Solr instances.
> >   Note: This is more robust than "a" due to the fact that it
> > eliminates the "single point of failure"
> >
> > c.  Use of a load balancer hitting all known Solr instances will be fine
> -
> > although the search requests may not run on the Solr instance the load
> > balancer targeted - due to "a" above.
> >
> > Corrections or refinements welcomed...
>
> With option a), although queries will be distributed across the
> cluster, all queries will be going through that single node. Not only
> is that a single point of failure, but you risk saturating the
> inter-node network traffic, possibly resulting in lower QPS and higher
> latency on your queries.
>
> With option b), as well as SolrJ, recent versions of pysolr have a
> ZK-aware SolrCloud client that behaves in a similar way.
>
> With option c), you can use the preferLocalShards so that shards that
> are local to the queried node are used in preference to distributed
> shards. Depending on your shard/cluster topology, this can increase
> performance if you are returning large amounts of data - many or large
> fields or many documents.
>
> Cheers
>
> Tom
>


what is opening realtime Searcher

2016-04-18 Thread Jaroslaw Rozanski
Hi,

 What exactly triggers opening new "realtime" searcher?

2016-04-18_16:28:02.33289 INFO  (qtp1038620625-13) [c:col1 s:shard1 
r:core_node3 x:col1_shard1_replica3] o.a.s.s.SolrIndexSearcher Opening 
Searcher@752e986f[col1_shard1_replica3] realtime

I am seeing above being triggered when adding documents to index. The
frequency (from few milliseconds to few seconds) does not correlate with
maxTime of either autoCommit or autoSoftCommit (which are fixed to tens
of seconds).
 
Client never sends commit message explicitly (and there is
IgnoreCommitOptimizeUpdateProcessorFactory in processor chain).
 
Re: Solr 5.5.0
 
 
 
Thanks,
Jarek
 


Re: Index BackUp using JDK 8 & Restore using JDK 7. Does this work?

2016-04-18 Thread Manohar Sripada
Thanks Shawn! :-)

On Mon, Apr 18, 2016 at 6:42 PM, Shawn Heisey  wrote:

> On 4/18/2016 12:49 AM, Manohar Sripada wrote:
> > We are using Solr 5.2.1 and JDK 7. We do create a static index in one
> > cluster (solr cluster 1) and ship that index to another cluster (Solr
> > cluster 2).  Solr Cluster 2 is the one where queries will be fired.
> >
> > Due to some unavoidable reasons, we want to upgrade Solr Cluster 1 to JDK
> > 8. But, we can't upgrade Solr cluster 2 to JDK 8 in near future. Does the
> > backed up index from Solr cluster 1 which uses JDK 8 works when restored
> in
> > Solr cluster 2 which uses JDK 7?
>
> The Lucene index format (Solr is a Lucene application) is the same
> regardless of Java version or hardware platform.  Some programs (rrdtool
> being the prominent example) have a different file format on 32-bit CPUs
> compared to 64-bit CPUs ... but Lucene/Solr is not one of those programs.
>
> Some info you might already know: Solr 4.8.x through 5.5.x require Java
> 7.  Solr 6.0.0 requires Java 8.
>
> Thanks,
> Shawn
>
>


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Tom Evans
On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
 wrote:
> Thanks all - very helpful.
>
> @Shawn - your reply implies that even if I'm hitting the URL for a single
> endpoint via HTTP - the "balancing" will still occur across the Solr Cloud
> (I understand the caveat about that single endpoint being a potential point
> of failure).  I just want to verify that I'm interpreting your response
> correctly...
>
> (I have been asked to provide IT with a comprehensive list of options prior
> to a design discussion - which is why I'm trying to get clear about the
> various options)
>
> In a nutshell, I think I understand the following:
>
> a. Even if hitting a single URL, the Solr Cloud will "balance" across all
> available nodes for searching
>   Caveat: That single URL represents a potential single point of
> failure and this should be taken into account
>
> b. SolrJ's CloudSolrClient API provides the ability to distribute load --
> based on Zookeeper's "knowledge" of all available Solr instances.
>   Note: This is more robust than "a" due to the fact that it
> eliminates the "single point of failure"
>
> c.  Use of a load balancer hitting all known Solr instances will be fine -
> although the search requests may not run on the Solr instance the load
> balancer targeted - due to "a" above.
>
> Corrections or refinements welcomed...

With option a), although queries will be distributed across the
cluster, all queries will be going through that single node. Not only
is that a single point of failure, but you risk saturating the
inter-node network traffic, possibly resulting in lower QPS and higher
latency on your queries.

With option b), as well as SolrJ, recent versions of pysolr have a
ZK-aware SolrCloud client that behaves in a similar way.

With option c), you can use the preferLocalShards so that shards that
are local to the queried node are used in preference to distributed
shards. Depending on your shard/cluster topology, this can increase
performance if you are returning large amounts of data - many or large
fields or many documents.

Cheers

Tom


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread John Bickerstaff
Excellent - thanks!

On Mon, Apr 18, 2016 at 9:16 AM, Erick Erickson 
wrote:

> Your summary pretty much nails it.
>
> For (b) note that CloudSolrClient uses an internal software load
> balancer to distribute queries, FWIW.
>
>
>
> On Mon, Apr 18, 2016 at 7:52 AM, John Bickerstaff
>  wrote:
> > Thanks all - very helpful.
> >
> > @Shawn - your reply implies that even if I'm hitting the URL for a single
> > endpoint via HTTP - the "balancing" will still occur across the Solr
> Cloud
> > (I understand the caveat about that single endpoint being a potential
> point
> > of failure).  I just want to verify that I'm interpreting your response
> > correctly...
> >
> > (I have been asked to provide IT with a comprehensive list of options
> prior
> > to a design discussion - which is why I'm trying to get clear about the
> > various options)
> >
> > In a nutshell, I think I understand the following:
> >
> > a. Even if hitting a single URL, the Solr Cloud will "balance" across all
> > available nodes for searching
> >   Caveat: That single URL represents a potential single point of
> > failure and this should be taken into account
> >
> > b. SolrJ's CloudSolrClient API provides the ability to distribute load --
> > based on Zookeeper's "knowledge" of all available Solr instances.
> >   Note: This is more robust than "a" due to the fact that it
> > eliminates the "single point of failure"
> >
> > c.  Use of a load balancer hitting all known Solr instances will be fine
> -
> > although the search requests may not run on the Solr instance the load
> > balancer targeted - due to "a" above.
> >
> > Corrections or refinements welcomed...
> >
> > On Mon, Apr 18, 2016 at 7:21 AM, Shawn Heisey 
> wrote:
> >
> >> On 4/17/2016 10:35 PM, John Bickerstaff wrote:
> >> > My prior use of SOLR in production was pre SOLR cloud.  We put a
> >> > round-robin  load balancer in front of replicas for searching.
> >> >
> >> > Do I understand correctly that a load balancer is unnecessary with
> SOLR
> >> > Cloud?  I. E. -- SOLR and Zookeeper will balance the load, regardless
> of
> >> > which replica's URL is getting hit?
> >>
> >> Your understanding is correct -- queries sent to a single SolrCloud node
> >> will be balanced across the cloud, although the node you are sending the
> >> queries to might represent a single point of failure.
> >>
> >> If your program is written in Java, you can use CloudSolrClient in SolrJ
> >> -- this client talks to the zookeeper ensemble and dynamically adjusts
> >> to the addition and removal of Solr nodes in the cloud.  All
> >> notifications from the cloud to the client about servers going up or
> >> down are nearly instantaneous -- the client does not need to poll for
> >> status.
> >>
> >> For other programming languages, if your client code is not capable of
> >> failing over to a second node when the primary goes down, then you would
> >> still need a load balancer.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>


Re: Wildcard query behavior.

2016-04-18 Thread Erick Erickson
Here's a blog on the subject:
https://lucidworks.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

bq: When validator is changed to validate, both at query time and index time,
then should not validator*/validator return the same results at-least?

This is one of those problems that's easy to state, but hard to solve. And
there are so many variations that any attempt to solve it will _always_
have lots of surprises. Simple example (and remember that the
stemming is usually algorithmic). "validator" probably stems to "validat".
However, "validato" (note the 'o') may not stem
the same way at all, so searching for "validato*" wouldn't produce the
expected response.

Best,
Erick

On Mon, Apr 18, 2016 at 6:23 AM, Shawn Heisey  wrote:
> On 4/18/2016 1:18 AM, Modassar Ather wrote:
>> When I search for f:validator I get 80K+ documents whereas if I search for
>> f:validator* I get only around 150 results.
>>
>> When I checked on analysis page I see that validator is changed to
>> validate. Per my understanding in both the above cases it should at-least
>> give the exact same result of around 80K+ documents.
>
> What Reth was trying to tell you, but did not state clearly, is that
> when you use wildcards, your query is NOT analyzed -- none of your
> filters, including the stemmer, are used.
>
> Thanks,
> Shawn
>


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Erick Erickson
Your summary pretty much nails it.

For (b) note that CloudSolrClient uses an internal software load
balancer to distribute queries, FWIW.



On Mon, Apr 18, 2016 at 7:52 AM, John Bickerstaff
 wrote:
> Thanks all - very helpful.
>
> @Shawn - your reply implies that even if I'm hitting the URL for a single
> endpoint via HTTP - the "balancing" will still occur across the Solr Cloud
> (I understand the caveat about that single endpoint being a potential point
> of failure).  I just want to verify that I'm interpreting your response
> correctly...
>
> (I have been asked to provide IT with a comprehensive list of options prior
> to a design discussion - which is why I'm trying to get clear about the
> various options)
>
> In a nutshell, I think I understand the following:
>
> a. Even if hitting a single URL, the Solr Cloud will "balance" across all
> available nodes for searching
>   Caveat: That single URL represents a potential single point of
> failure and this should be taken into account
>
> b. SolrJ's CloudSolrClient API provides the ability to distribute load --
> based on Zookeeper's "knowledge" of all available Solr instances.
>   Note: This is more robust than "a" due to the fact that it
> eliminates the "single point of failure"
>
> c.  Use of a load balancer hitting all known Solr instances will be fine -
> although the search requests may not run on the Solr instance the load
> balancer targeted - due to "a" above.
>
> Corrections or refinements welcomed...
>
> On Mon, Apr 18, 2016 at 7:21 AM, Shawn Heisey  wrote:
>
>> On 4/17/2016 10:35 PM, John Bickerstaff wrote:
>> > My prior use of SOLR in production was pre SOLR cloud.  We put a
>> > round-robin  load balancer in front of replicas for searching.
>> >
>> > Do I understand correctly that a load balancer is unnecessary with SOLR
>> > Cloud?  I. E. -- SOLR and Zookeeper will balance the load, regardless of
>> > which replica's URL is getting hit?
>>
>> Your understanding is correct -- queries sent to a single SolrCloud node
>> will be balanced across the cloud, although the node you are sending the
>> queries to might represent a single point of failure.
>>
>> If your program is written in Java, you can use CloudSolrClient in SolrJ
>> -- this client talks to the zookeeper ensemble and dynamically adjusts
>> to the addition and removal of Solr nodes in the cloud.  All
>> notifications from the cloud to the client about servers going up or
>> down are nearly instantaneous -- the client does not need to poll for
>> status.
>>
>> For other programming languages, if your client code is not capable of
>> failing over to a second node when the primary goes down, then you would
>> still need a load balancer.
>>
>> Thanks,
>> Shawn
>>
>>


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread John Bickerstaff
Thanks all - very helpful.

@Shawn - your reply implies that even if I'm hitting the URL for a single
endpoint via HTTP - the "balancing" will still occur across the Solr Cloud
(I understand the caveat about that single endpoint being a potential point
of failure).  I just want to verify that I'm interpreting your response
correctly...

(I have been asked to provide IT with a comprehensive list of options prior
to a design discussion - which is why I'm trying to get clear about the
various options)

In a nutshell, I think I understand the following:

a. Even if hitting a single URL, the Solr Cloud will "balance" across all
available nodes for searching
  Caveat: That single URL represents a potential single point of
failure and this should be taken into account

b. SolrJ's CloudSolrClient API provides the ability to distribute load --
based on Zookeeper's "knowledge" of all available Solr instances.
  Note: This is more robust than "a" due to the fact that it
eliminates the "single point of failure"

c.  Use of a load balancer hitting all known Solr instances will be fine -
although the search requests may not run on the Solr instance the load
balancer targeted - due to "a" above.

Corrections or refinements welcomed...

On Mon, Apr 18, 2016 at 7:21 AM, Shawn Heisey  wrote:

> On 4/17/2016 10:35 PM, John Bickerstaff wrote:
> > My prior use of SOLR in production was pre SOLR cloud.  We put a
> > round-robin  load balancer in front of replicas for searching.
> >
> > Do I understand correctly that a load balancer is unnecessary with SOLR
> > Cloud?  I. E. -- SOLR and Zookeeper will balance the load, regardless of
> > which replica's URL is getting hit?
>
> Your understanding is correct -- queries sent to a single SolrCloud node
> will be balanced across the cloud, although the node you are sending the
> queries to might represent a single point of failure.
>
> If your program is written in Java, you can use CloudSolrClient in SolrJ
> -- this client talks to the zookeeper ensemble and dynamically adjusts
> to the addition and removal of Solr nodes in the cloud.  All
> notifications from the cloud to the client about servers going up or
> down are nearly instantaneous -- the client does not need to poll for
> status.
>
> For other programming languages, if your client code is not capable of
> failing over to a second node when the primary goes down, then you would
> still need a load balancer.
>
> Thanks,
> Shawn
>
>


want to subscribe

2016-04-18 Thread SRINI SOLR



Re: Adding a new shard

2016-04-18 Thread Jay Potharaju
Thanks for the explaination Erick!. I will try out your recommendation.


On Sun, Apr 17, 2016 at 3:34 PM, Erick Erickson 
wrote:

> bq: So inorder for me  move the shards to its own instances, I will have to
> take a down time and move the newly created shards & replicas to its own
> instances.
>
> No, this is not true.
>
> The easiest way to move things around is use the collections API
> ADDREPLICA command after splitting.
>
> Let's call this particular shard S1 on machine M1, and the results of
> the SPLITSHARD command S1.1 and S1.2 Further, let's say that your goal
> is to move _one_ of the subshards from machine M1 to M2
>
> So the sequence is:
>
> 1> issue SPLITSHARD and wait for it to complete. This requires no
> downtime and after the split the old shard becomes inactive and the
> two new subshards are servicing all requests. I'd probably stop
> indexing during this operation just to be on the safe side, although
> that's not necessary. So now you have both S1.1 and S1.2 running on M1
>
> 2> Use the ADDREPLICA command to add a replica of S1.2 to M2. Again,
> no downtime required. Wait until the new replica is "active", at which
> point it's fully operational. So now we have S1.1 and S1.2 running on
> M1 and S1.2.2 running on M2.
>
> 3> Use the DELETEREPLICA command to remove S1.2 from M1. Now you have
> S1.1 running on M1 and S1.2.1 running on M2. No downtime during any of
> this.
>
> 4> You should be able to delete S1 now from M1 just to tidy up.
>
> 5> Repeat for the other shards.
>
> Best,
> Erick
>
>
> On Sun, Apr 17, 2016 at 3:09 PM, Jay Potharaju 
> wrote:
> > Erik thanks for the reply. In my current prod setup I anticipate the
> number
> > of documents to grow almost 5 times by the end of the year and therefore
> > planning on how to scale when required. We have high query volume and
> > growing dataset, that is why would like to scale by sharding &
> replication.
> >
> > In my dev sandbox, I have 2 replicas & 2 shards created using compositeId
> > as my routing option. If I split the shard, it will create 2 new shards
> on
> > each the solr instances including replicas and my request will start
> going
> > to the new shards.
> > So inorder for me  move the shards to its own instances, I will have to
> > take a down time and move the newly created shards & replicas to its own
> > instances. Is that a correct interpretation of the how shard splitting
> > would work
> >
> >  I was hoping that solr will automagically split the existing shard &
> > create replicas on the new instances  rather than the existing nodes.
> That
> > is why I said the current shard splitting will not work for me.
> > Thanks
> >
> > On Sat, Apr 16, 2016 at 8:08 PM, Erick Erickson  >
> > wrote:
> >
> >> Why don't you think splitting the shards will do what you need?
> >> Admittedly it will have to be applied to each shard and will
> >> double the number of shards you have, that's the current
> >> limitation. At the end, though, you will have 4 shards when
> >> you used to have 2 and you can move them around to whatever
> >> hardware you can scrape up.
> >>
> >> This assumes you're using the default compositeId routing
> >> scheme and not implicit routing. If you are using compositeId
> >> there is no provision to add another shard.
> >>
> >> As far as SOLR-5025 is concerned, nobody's working on that
> >> that I know of.
> >>
> >> I have to ask though whether you've tuned your existing
> >> machines. How many docs are on each? Why do you think
> >> you need more shards? Query speed? OOMs? Java heaps
> >> getting too big?
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Apr 15, 2016 at 10:50 PM, Jay Potharaju 
> >> wrote:
> >> > I found ticket https://issues.apache.org/jira/browse/SOLR-5025 which
> >> talks
> >> > about sharding in solrcloud. Are there any plans to address this
> issue in
> >> > near future?
> >> > Can any of the users on the forum comment how they are handling this
> >> > scenario in production?
> >> > Thanks
> >> >
> >> > On Fri, Apr 15, 2016 at 4:28 PM, Jay Potharaju  >
> >> > wrote:
> >> >
> >> >> Hi,
> >> >> I have an existing collection which has 2 shards, one on each node in
> >> the
> >> >> cloud. Now I want to split the existing collection into 3 shards
> >> because of
> >> >> increase in volume of data. And create this new shard  on a new node
> in
> >> the
> >> >> solrCloud.
> >> >>
> >> >>  I read about splitting a shard & creating a shard, but not sure it
> will
> >> >> work.
> >> >>
> >> >> Any suggestions how are others handling this scenario in production.
> >> >> --
> >> >> Thanks
> >> >> Jay
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks
> >> > Jay Potharaju
> >>
> >
> >
> >
> > --
> > Thanks
> > Jay Potharaju
>



-- 
Thanks
Jay Potharaju


Re: normal solr query vs facet query performance

2016-04-18 Thread Shawn Heisey
On 4/18/2016 5:06 AM, Mugeesh Husain wrote:
> 1.)solr normal query(q=*:*) vs facet query(facet.query="abc") ?
> 2.)solr normal query(q=*:*) vs facet
> search(facet=tru&facet.field=coullumn_name) ?
> 3.)solr filter query(q=Column:some value) vs facet query(facet.query="abc")
> ?
> 4.)solr normal query(q=*:*) vs filter query(q=column:some value) ?

This is a question that is nearly impossible to answer without your
actual index, and even then only you can answer it.  You need to *try*
these queries and see what happens.

https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Note that there is a performance bug with the *:* (MatchAllDocuments)
query on 5.x versions, which is only solved in 5.5.0 and later.  This
query runs quite a bit slower than it should.

https://issues.apache.org/jira/browse/SOLR-8251

Thanks,
Shawn



Re: Overall large size in Solr across collections

2016-04-18 Thread Shawn Heisey
On 4/18/2016 4:22 AM, Zheng Lin Edwin Yeo wrote:
> I have many collections in Solr, but with only 1 shard. I found that the
> index size across all the collections has passed the 1TB mark. Currently
> the query speed is still normal, but the indexing speed seems to be become
> slower.
>
> Will it affect the performance if I continue to increase the index size but
> stick to 1 shard?

I have noticed overall *bulk* indexing speed slows down as the index
gets bigger, but I suspect that a big part of the reason this happens is
segment merging involves more *large* segments, tying up I/O resources.

The amount of time required to index a small number of documents should
not be affected much by index size, but something that is likely to take
longer with a large index is the *commit* operation -- especially if
Solr's caches are configured to autowarm.

Running the index on SSD, or on a RAID10 volume with of a lot of regular
disks, can greatly speed up indexing.  The parity-based RAID levels
(primarily 5 and 6) have a fairly severe write penalty, so I do not
recommend them for Solr, unless indexing happens infrequently.

Installing plenty of memory is very helpful for *query* speed, but it
can also *indirectly* speed up indexing.  If the disk is not busy when
queries are happening, there's more I/O bandwidth available for writes.

Thanks,
Shawn



Re: Wildcard query behavior.

2016-04-18 Thread Shawn Heisey
On 4/18/2016 1:18 AM, Modassar Ather wrote:
> When I search for f:validator I get 80K+ documents whereas if I search for
> f:validator* I get only around 150 results.
>
> When I checked on analysis page I see that validator is changed to
> validate. Per my understanding in both the above cases it should at-least
> give the exact same result of around 80K+ documents.

What Reth was trying to tell you, but did not state clearly, is that
when you use wildcards, your query is NOT analyzed -- none of your
filters, including the stemmer, are used.

Thanks,
Shawn



Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Shawn Heisey
On 4/17/2016 10:35 PM, John Bickerstaff wrote:
> My prior use of SOLR in production was pre SOLR cloud.  We put a
> round-robin  load balancer in front of replicas for searching.
>
> Do I understand correctly that a load balancer is unnecessary with SOLR
> Cloud?  I. E. -- SOLR and Zookeeper will balance the load, regardless of
> which replica's URL is getting hit?

Your understanding is correct -- queries sent to a single SolrCloud node
will be balanced across the cloud, although the node you are sending the
queries to might represent a single point of failure.

If your program is written in Java, you can use CloudSolrClient in SolrJ
-- this client talks to the zookeeper ensemble and dynamically adjusts
to the addition and removal of Solr nodes in the cloud.  All
notifications from the cloud to the client about servers going up or
down are nearly instantaneous -- the client does not need to poll for
status.

For other programming languages, if your client code is not capable of
failing over to a second node when the primary goes down, then you would
still need a load balancer.

Thanks,
Shawn



Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Jack Krupansky
SolrJ does indeed provide load balancing via CloudSolrClient which
uses LBHttpSolrClient:
https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrClient.html
https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/impl/LBHttpSolrClient.html

There is the separate issue of how many application clients you may have
and whether a load balancer would be in front of them.

-- Jack Krupansky

On Mon, Apr 18, 2016 at 4:34 AM, Jaroslaw Rozanski 
wrote:

> Hi,
>
> How are you executing searches?
>
> I am asking because if you search using Solr client, for example SolrJ -
> ie. create instance of CloudSolrClient, and not directly via HTTP
> endpoint, it will provided load-balancing (last time I checked it picks
> random non-stale node).
>
>
> Thanks,
> Jarek
>
> On Mon, 18 Apr 2016, at 05:58, John Bickerstaff wrote:
> > Thanks, so on the matter of indexing -- while I could isolate a cloud
> > replica from queries by not including it in the load balancer's list...
> >
> > ... I cannot isolate any of the replicas from an indexing perspective by
> > a
> > similar strategy because the SOLR leader decides who does indexing?  Or
> > do
> > all "nodes" index the same incoming document independently?
> >
> > Now that I know I still need a load balancer, I guess I'm trying to find
> > a
> > way to keep indexing load off servers that are busy serving search
> > results...  Possibly by having one or two servers just handle indexing...
> >
> > Perhaps I'm looking in the wrong direction though -- and should just spin
> > up more replicas to handle more indexing load?
> > On Apr 17, 2016 10:46 PM, "Walter Underwood" 
> > wrote:
> >
> > No, Zookeeper is used for managing the locations of replicas and the
> > leader
> > for indexing. Queries should still be distributed with a load balancer.
> >
> > Queries do NOT go through Zookeeper.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Apr 17, 2016, at 9:35 PM, John Bickerstaff <
> j...@johnbickerstaff.com>
> > wrote:
> > >
> > > My prior use of SOLR in production was pre SOLR cloud.  We put a
> > > round-robin  load balancer in front of replicas for searching.
> > >
> > > Do I understand correctly that a load balancer is unnecessary with SOLR
> > > Cloud?  I. E. -- SOLR and Zookeeper will balance the load, regardless
> of
> > > which replica's URL is getting hit?
> > >
> > > Are there any caveats?
> > >
> > > Thanks,
>


Re: Index BackUp using JDK 8 & Restore using JDK 7. Does this work?

2016-04-18 Thread Shawn Heisey
On 4/18/2016 12:49 AM, Manohar Sripada wrote:
> We are using Solr 5.2.1 and JDK 7. We do create a static index in one
> cluster (solr cluster 1) and ship that index to another cluster (Solr
> cluster 2).  Solr Cluster 2 is the one where queries will be fired.
>
> Due to some unavoidable reasons, we want to upgrade Solr Cluster 1 to JDK
> 8. But, we can't upgrade Solr cluster 2 to JDK 8 in near future. Does the
> backed up index from Solr cluster 1 which uses JDK 8 works when restored in
> Solr cluster 2 which uses JDK 7?

The Lucene index format (Solr is a Lucene application) is the same
regardless of Java version or hardware platform.  Some programs (rrdtool
being the prominent example) have a different file format on 32-bit CPUs
compared to 64-bit CPUs ... but Lucene/Solr is not one of those programs.

Some info you might already know: Solr 4.8.x through 5.5.x require Java
7.  Solr 6.0.0 requires Java 8.

Thanks,
Shawn



[ANNOUNCEMENT] Luke 6.0.0 released

2016-04-18 Thread Dmitry Kan
Download the release zip here:

https://github.com/DmitryKey/luke/releases/tag/luke-6.0.0

Major upgrade to new Lucene 6.0.0 API.

#55 
Enjoy!


-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


normal solr query vs facet query performance

2016-04-18 Thread Mugeesh Husain
Hello,

I am looking for which query will be fast in term of performance,

1.)solr normal query(q=*:*) vs facet query(facet.query="abc") ?
2.)solr normal query(q=*:*) vs facet
search(facet=tru&facet.field=coullumn_name) ?
3.)solr filter query(q=Column:some value) vs facet query(facet.query="abc")
?
4.)solr normal query(q=*:*) vs filter query(q=column:some value) ?



Also provide some good tutorial for above these things.


Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/normal-solr-query-vs-facet-query-performance-tp4270907.html
Sent from the Solr - User mailing list archive at Nabble.com.


Overall large size in Solr across collections

2016-04-18 Thread Zheng Lin Edwin Yeo
Hi,

I have many collections in Solr, but with only 1 shard. I found that the
index size across all the collections has passed the 1TB mark. Currently
the query speed is still normal, but the indexing speed seems to be become
slower.

Will it affect the performance if I continue to increase the index size but
stick to 1 shard?

I'm using Solr 5.4.0

Regards,
Edwin


Re: Wildcard query behavior.

2016-04-18 Thread Modassar Ather
Thanks Reth for your response.

When validator is changed to validate, both at query time and index time,
then should not validator*/validator return the same results at-least?

E.g. 5 documents contains validator. At index time validator got changed to
validate.
Now when validator* is searched it will also change to validate and should
match all 5 documents. In this case I am not sure how the wildcard
internally is handled meaning what the query will transform to.

Please help me understand the internals of wildcard with stemming or point
me to some documents as I could not find any details on it.

Best,
Modassar

On Mon, Apr 18, 2016 at 1:04 PM, Reth RM  wrote:

> If you search for f:validat*, then I believe you will get same number of
> results. Please check.
>
> f:validator* is searching for records that have prefix "validator" where as
> field with stemmer which stems "validator" to "validate" (if this stemming
> was applied at index time as well as query time) its looking for records
> that have "validate" or "validator", so for obvious reasons, numFound might
> have been different.
>
>
>
> On Mon, Apr 18, 2016 at 12:48 PM, Modassar Ather 
> wrote:
>
> > Hi,
> >
> > Please help me understand following.
> >
> > I have analysis chain which uses KStemFilterFactory for a field. Solr
> > version is 5.4.0
> >
> > When I search for f:validator I get 80K+ documents whereas if I search
> for
> > f:validator* I get only around 150 results.
> >
> > When I checked on analysis page I see that validator is changed to
> > validate. Per my understanding in both the above cases it should at-least
> > give the exact same result of around 80K+ documents.
> >
> > I understand in some cases wildcards can result in sub-optimal results
> for
> > stemmed content. Please correct me if I am wrong.
> >
> > Thanks,
> > Modassar
> >
>


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Jaroslaw Rozanski
Hi,

How are you executing searches? 

I am asking because if you search using Solr client, for example SolrJ -
ie. create instance of CloudSolrClient, and not directly via HTTP
endpoint, it will provided load-balancing (last time I checked it picks
random non-stale node).


Thanks,
Jarek

On Mon, 18 Apr 2016, at 05:58, John Bickerstaff wrote:
> Thanks, so on the matter of indexing -- while I could isolate a cloud
> replica from queries by not including it in the load balancer's list...
> 
> ... I cannot isolate any of the replicas from an indexing perspective by
> a
> similar strategy because the SOLR leader decides who does indexing?  Or
> do
> all "nodes" index the same incoming document independently?
> 
> Now that I know I still need a load balancer, I guess I'm trying to find
> a
> way to keep indexing load off servers that are busy serving search
> results...  Possibly by having one or two servers just handle indexing...
> 
> Perhaps I'm looking in the wrong direction though -- and should just spin
> up more replicas to handle more indexing load?
> On Apr 17, 2016 10:46 PM, "Walter Underwood" 
> wrote:
> 
> No, Zookeeper is used for managing the locations of replicas and the
> leader
> for indexing. Queries should still be distributed with a load balancer.
> 
> Queries do NOT go through Zookeeper.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Apr 17, 2016, at 9:35 PM, John Bickerstaff 
> wrote:
> >
> > My prior use of SOLR in production was pre SOLR cloud.  We put a
> > round-robin  load balancer in front of replicas for searching.
> >
> > Do I understand correctly that a load balancer is unnecessary with SOLR
> > Cloud?  I. E. -- SOLR and Zookeeper will balance the load, regardless of
> > which replica's URL is getting hit?
> >
> > Are there any caveats?
> >
> > Thanks,


Re: Wildcard query behavior.

2016-04-18 Thread Reth RM
If you search for f:validat*, then I believe you will get same number of
results. Please check.

f:validator* is searching for records that have prefix "validator" where as
field with stemmer which stems "validator" to "validate" (if this stemming
was applied at index time as well as query time) its looking for records
that have "validate" or "validator", so for obvious reasons, numFound might
have been different.



On Mon, Apr 18, 2016 at 12:48 PM, Modassar Ather 
wrote:

> Hi,
>
> Please help me understand following.
>
> I have analysis chain which uses KStemFilterFactory for a field. Solr
> version is 5.4.0
>
> When I search for f:validator I get 80K+ documents whereas if I search for
> f:validator* I get only around 150 results.
>
> When I checked on analysis page I see that validator is changed to
> validate. Per my understanding in both the above cases it should at-least
> give the exact same result of around 80K+ documents.
>
> I understand in some cases wildcards can result in sub-optimal results for
> stemmed content. Please correct me if I am wrong.
>
> Thanks,
> Modassar
>


Wildcard query behavior.

2016-04-18 Thread Modassar Ather
Hi,

Please help me understand following.

I have analysis chain which uses KStemFilterFactory for a field. Solr
version is 5.4.0

When I search for f:validator I get 80K+ documents whereas if I search for
f:validator* I get only around 150 results.

When I checked on analysis page I see that validator is changed to
validate. Per my understanding in both the above cases it should at-least
give the exact same result of around 80K+ documents.

I understand in some cases wildcards can result in sub-optimal results for
stemmed content. Please correct me if I am wrong.

Thanks,
Modassar