date:20170412

So I have a field named "key" that uses KeywordTokenizer and has
multiValued="true" set. A doc like

  val one
  yet another value
  third


My field will have exactly three indexed tokens

val one
yet another value
third

Best,
Erick

On Wed, Apr 12, 2017 at 2:38 PM, Ahmet Arslan  wrote:
> I don't understand the first option, what is each value? Keyword tokenizer 
> emits single token, analogous to string type.
>
>
>
> On Wednesday, April 12, 2017, 7:45:52 PM GMT+3, Walter Underwood 
>  wrote:
> Does the KeywordTokenizer make each value into a unitary string or does it 
> take the whole list of values and make that a single string?
>
> I really hope it is the former. I can’t find this in the docs (including 
> JavaDocs).
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)

Re: Enable Gzip compression Solr 6.0

2017-04-12 Thread Mahmoud Almokadem

Thanks Rick,

I already running Solr on my infrastructure and behind a web application.

The web application is working as a proxy before Solr, so I think I can 
compress the content on Solr end. But I have made it on the proxy now.

Thanks again,
Mahmoud 


> On Apr 12, 2017, at 4:31 PM, Rick Leir  wrote:
> 
> Hi Mahmoud
> I assume you are running Solr 'behind' a web application, so Solr is not 
> directly on the net.
> 
> The gzip compression is an Apache thing, and relates to your web application. 
> 
> Connections to Solr are within your infrastructure, so you might not want to 
> gzip them. But maybe your setup is different?
> 
> Older versions of Solr used Tomcat which supported gzip. Newer versions use 
> Zookeeper and Jetty and you prolly will find a way.
> Cheers -- Rick
> 
>> On April 12, 2017 8:48:45 AM EDT, Mahmoud Almokadem  
>> wrote:
>> Hello,
>> 
>> How can I enable Gzip compression for Solr 6.0 to save bandwidth
>> between
>> the server and clients?
>> 
>> Thanks,
>> Mahmoud
> 
> -- 
> Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Long GC pauses while reading Solr docs using Cursor approach

You're missing the point of my comment. Since they already are
docValues, you can use the /export functionality to get the results
back as a _stream_ and avoid all of the overhead of the aggregator
node doing a merge sort and all of that.

You'll have to do this from SolrJ, but see CloudSolrStream. You can
see examples of its usage in StreamingTest.java.

this should
1> complete much, much faster. The design goal is 400K rows/second but YMMV
2> use vastly less memory on your Solr instances.
3> only require _one_ query

Best,
Erick

On Wed, Apr 12, 2017 at 7:36 PM, Shawn Heisey  wrote:
> On 4/12/2017 5:19 PM, Chetas Joshi wrote:
>> I am getting back 100K results per page.
>> The fields have docValues enabled and I am getting sorted results based on 
>> "id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes).
>>
>> I have a solr Cloud of 80 nodes. There will be one shard that will get top 
>> 100K docs from each shard and apply merge sort. So, the max memory usage of 
>> any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap memory 
>> usage shoot up from 8 GB to 17 GB?
>
> From what I understand, Java overhead for a String object is 56 bytes
> above the actual byte size of the string itself.  And each character in
> the string will be two bytes -- Java uses UTF-16 for character
> representation internally.  If I'm right about these numbers, it means
> that each of those id values will take 120 bytes -- and that doesn't
> include the size the actual response (xml, json, etc).
>
> I don't know what the overhead for a long is, but you can be sure that
> it's going to take more than eight bytes total memory usage for each one.
>
> Then there is overhead for all the Lucene memory structures required to
> execute the query and gather results, plus Solr memory structures to
> keep track of everything.  I have absolutely no idea how much memory
> Lucene and Solr use to accomplish a query, but it's not going to be
> small when you have 200 million documents per shard.
>
> Speaking of Solr memory requirements, under normal query circumstances
> the aggregating node is going to receive at least 100K results from
> *every* shard in the collection, which it will condense down to the
> final result with 100K entries.  The behavior during a cursor-based
> request may be more memory-efficient than what I have described, but I
> am unsure whether that is the case.
>
> If the cursor behavior is not more efficient, then each entry in those
> results will contain the uniqueKey value and the score.  That's going to
> be many megabytes for every shard.  If there are 80 shards, it would
> probably be over a gigabyte for one request.
>
> Thanks,
> Shawn
>

Re: unexpected docvalues type NONE

2017-04-12 Thread Prashant Saraswat

Hi Shawn,

The listing_lastmodified field was not changed. I only added a new field. I
have removed the field, but I still get the error.

Thanks
Prashant

On Wed, Apr 12, 2017 at 11:20 PM, Shawn Heisey  wrote:

> On 4/12/2017 8:04 PM, Prashant Saraswat wrote:
> > I'm using Solr 6.4.0. The schema was created on 6.4.0 and I indexed
> several
> > hundred thousand documents and everything was fine.
> >
> > Now I added one field to the schema:
> >
> >  > stored="true" required="false"/>
> >
> > I suddenly start getting this error for certain queries ( not all
> queries and even for queries that have nothing to do with this field ). See
> full exception below.
> >
> > Am I supposed to reindex the entire dataset when anything changes in the
> > schema as long as even one field is using docvalues?
>
> The error suggests that the definition for listing_lastmodified was
> altered at some point without reindexing.  I'm guessing that the field
> did NOT have docValues originally, then docValues was enabled.
>
> When the docValues setting for a field is changed, deleting the index
> and rebuilding it is required.  In 6.4.0, simple field types like the
> TrieDoubleField used for tdouble have docValues enabled by default.
>
> Adding a field should never require a reindex, unless you want to
> populate that field in documents that have already been indexed.
>
> Thanks,
> Shawn
>
>


--

Re: unexpected docvalues type NONE

On 4/12/2017 8:04 PM, Prashant Saraswat wrote:
> I'm using Solr 6.4.0. The schema was created on 6.4.0 and I indexed several
> hundred thousand documents and everything was fine.
>
> Now I added one field to the schema:
>
>  stored="true" required="false"/>
>
> I suddenly start getting this error for certain queries ( not all queries and 
> even for queries that have nothing to do with this field ). See full 
> exception below.
>
> Am I supposed to reindex the entire dataset when anything changes in the
> schema as long as even one field is using docvalues?

The error suggests that the definition for listing_lastmodified was
altered at some point without reindexing.  I'm guessing that the field
did NOT have docValues originally, then docValues was enabled.

When the docValues setting for a field is changed, deleting the index
and rebuilding it is required.  In 6.4.0, simple field types like the
TrieDoubleField used for tdouble have docValues enabled by default.

Adding a field should never require a reindex, unless you want to
populate that field in documents that have already been indexed.

Thanks,
Shawn

Re: Filtering results by minimum relevancy score

2017-04-12 Thread Koji Sekiguchi

Hi Walter,

May I ask a tangential question? I'm curious the following line you wrote:

> Solr is a vector-space engine. Some early engines (Verity VDK) were probabilistic engines. Those
do give an absolute estimate of the relevance of each hit. Unfortunately, the relevance of results
is just not as good as vector-space engines. So, probabilistic engines are mostly dead.

Can you elaborate this?

I thought Okapi BM25, which is the default Similarity on Solr, is based on the
probabilistic
model. Did you mean that Lucene/Solr is still based on vector space model but
they built
BM25Similarity on top of it and therefore, BM25Similarity is not pure
probabilistic scoring
system or Okapi BM25 is not originally probabilistic?

As for me, I prefer the idea of vector space than probabilistic for the
information retrieval,
and I stick with ClassicSimilarity for my projects.

Thanks,

Koji

On 2017/04/13 4:08, Walter Underwood wrote:

Fine. It can’t be done. If it was easy, Solr/Lucene would already have the
feature, right?

Solr is a vector-space engine. Some early engines (Verity VDK) were
probabilistic engines. Those do give an absolute estimate of the relevance of
each hit. Unfortunately, the relevance of results is just not as good as
vector-space engines. So, probabilistic engines are mostly dead.

But, “you don’t want to do it” is very good advice. Instead of trying to reduce
bad hits, work on increasing good hits. It is really hard, sometimes not
possible, to optimize both. Increasing the good hits makes your customers
happy. Reducing the bad hits makes your UX team happy.

Here is a process. Start collecting the clicks on the search results page (SRP)
with each query. Look at queries that have below average clickthrough. See if
those can be combined into categories, then address each category.

Some categories that I have used:

* One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all
valid. Use synonyms or shingles (and maybe the word delimiter filter) to match
these.

* Misspellings. These should be about 10% of queries. Use fuzzy matching. I
recommend the patch in SOLR-629.

* Alternate vocabulary. You sell a “laptop”, but people call it a “notebook”.
People search for “kids movies”, but your movie genre is “Children and Family”.
Use synonyms.

* Missing content. People can’t find anything about beach parking because there
isn’t a page about that. Instead, there are scraps of info about beach parking
in multiple other pages. Fix the content.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)

On Apr 12, 2017, at 11:44 AM, David Kramer wrote:

The idea is to not return poorly matching results, not to limit the number of
results returned. One query may have hundreds of excellent matches and another
query may have 7. So cutting off by the number of results is trivial but not
useful.

Again, we are not doing this for performance reasons. We’re doing this because
we don’t want to show products that are not very relevant to the search terms
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be
done” or “here’s how to do it” than “you don’t want to do it”. I’m still left
not knowing if it’s even possible. The one concrete answer of using frange
doesn’t help as referencing score in either the q or the fq produces an
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha" wrote:

Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)

Makes sense ?

On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti

Re: Long GC pauses while reading Solr docs using Cursor approach

On 4/12/2017 5:19 PM, Chetas Joshi wrote:
> I am getting back 100K results per page.
> The fields have docValues enabled and I am getting sorted results based on 
> "id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes).
>
> I have a solr Cloud of 80 nodes. There will be one shard that will get top 
> 100K docs from each shard and apply merge sort. So, the max memory usage of 
> any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap memory usage 
> shoot up from 8 GB to 17 GB?

>From what I understand, Java overhead for a String object is 56 bytes
above the actual byte size of the string itself.  And each character in
the string will be two bytes -- Java uses UTF-16 for character
representation internally.  If I'm right about these numbers, it means
that each of those id values will take 120 bytes -- and that doesn't
include the size the actual response (xml, json, etc).

I don't know what the overhead for a long is, but you can be sure that
it's going to take more than eight bytes total memory usage for each one.

Then there is overhead for all the Lucene memory structures required to
execute the query and gather results, plus Solr memory structures to
keep track of everything.  I have absolutely no idea how much memory
Lucene and Solr use to accomplish a query, but it's not going to be
small when you have 200 million documents per shard.

Speaking of Solr memory requirements, under normal query circumstances
the aggregating node is going to receive at least 100K results from
*every* shard in the collection, which it will condense down to the
final result with 100K entries.  The behavior during a cursor-based
request may be more memory-efficient than what I have described, but I
am unsure whether that is the case.

If the cursor behavior is not more efficient, then each entry in those
results will contain the uniqueKey value and the score.  That's going to
be many megabytes for every shard.  If there are 80 shards, it would
probably be over a gigabyte for one request.

Thanks,
Shawn

unexpected docvalues type NONE

2017-04-12 Thread Prashant Saraswat

Hi,

I'm using Solr 6.4.0. The schema was created on 6.4.0 and I indexed several
hundred thousand documents and everything was fine.

Now I added one field to the schema:



I suddenly start getting this error for certain queries ( not all queries
and even for queries that have nothing to do with this field ). See full
exception below.

Am I supposed to reindex the entire dataset when anything changes in the
schema as long as even one field is using docvalues?

Thanks
Prashant




java.lang.IllegalStateException: unexpected docvalues type NONE for field
'listing_lastmodified' (expected one of [BINARY, NUMERIC, SORTED,
SORTED_NUMERIC, SORTED_SET]). Re-index with correct docvalues type.
at org.apache.lucene.index.DocValues.checkField(DocValues.java:212)
at
org.apache.lucene.index.DocValues.getDocsWithField(DocValues.java:324)
at
org.apache.solr.search.SolrIndexSearcher.decorateDocValueFields(SolrIndexSearcher.java:783)
at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:136)
at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:52)
at
org.apache.solr.response.BinaryResponseWriter$Resolver.writeResultsBody(BinaryResponseWriter.java:124)
at
org.apache.solr.response.BinaryResponseWriter$Resolver.writeResults(BinaryResponseWriter.java:143)
at
org.apache.solr.response.BinaryResponseWriter$Resolver.resolve(BinaryResponseWriter.java:87)
at
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:234)

--

Re: Long GC pauses while reading Solr docs using Cursor approach

2017-04-12 Thread Chetas Joshi

I am getting back 100K results per page.
The fields have docValues enabled and I am getting sorted results based on
"id" and 2 more fields (String: 32 Bytes and Long: 8 Bytes).

I have a solr Cloud of 80 nodes. There will be one shard that will get top
100K docs from each shard and apply merge sort. So, the max memory usage of
any shard could be 40 bytes * 100K * 80 = 320 MB. Why would heap memory
usage shoot up from 8 GB to 17 GB?

Thanks!

On Wed, Apr 12, 2017 at 1:32 PM, Erick Erickson 
wrote:

> Oh my. Returning 100K rows per request is usually poor practice.
> One hopes these are very tiny docs.
>
> But this may well be an "XY" problem. What kinds of information
> are you returning in your docs and could they all be docValues
> types? In which case you would be waaay far ahead by using
> the various Streaming options.
>
> Best,
> Erick
>
> On Wed, Apr 12, 2017 at 12:59 PM, Chetas Joshi 
> wrote:
> > I am running a query that returns 10 MM docs in total and the number of
> > rows per page is 100K.
> >
> > On Wed, Apr 12, 2017 at 12:53 PM, Mikhail Khludnev 
> wrote:
> >
> >> And what is the rows parameter?
> >>
> >> 12 апр. 2017 г. 21:32 пользователь "Chetas Joshi" <
> chetas.jo...@gmail.com>
> >> написал:
> >>
> >> > Thanks for your response Shawn and Wunder.
> >> >
> >> > Hi Shawn,
> >> >
> >> > Here is the system config:
> >> >
> >> > Total system memory = 512 GB
> >> > each server handles two 500 MB cores
> >> > Number of solr docs per 500 MB core = 200 MM
> >> >
> >> > The average heap usage is around 4-6 GB. When the read starts using
> the
> >> > Cursor approach, the heap usage starts increasing with the base of the
> >> > sawtooth at 8 GB and then shooting up to 17 GB. Even after the full
> GC,
> >> the
> >> > heap usage remains around 15 GB and then it comes down to 8 GB.
> >> >
> >> > With 100K docs, the requirement will be in MBs so it is strange it is
> >> > jumping from 8 GB to 17 GB while preparing the sorted response.
> >> >
> >> > Thanks!
> >> >
> >> >
> >> >
> >> > On Tue, Apr 11, 2017 at 8:48 PM, Walter Underwood <
> wun...@wunderwood.org
> >> >
> >> > wrote:
> >> >
> >> > > JVM version? We’re running v8 update 121 with the G1 collector and
> it
> >> is
> >> > > working really well. We also have an 8GB heap.
> >> > >
> >> > > Graph your heap usage. You’ll see a sawtooth shape, where it grows,
> >> then
> >> > > there is a major GC. The maximum of the base of the sawtooth is the
> >> > working
> >> > > set of heap that your Solr installation needs. Set the heap to that
> >> > value,
> >> > > plus a gigabyte or so. We run with a 2GB eden (new space) because so
> >> much
> >> > > of Solr’s allocations have a lifetime of one request. So, the base
> of
> >> the
> >> > > sawtooth, plus a gigabyte breathing room, plus two more for eden.
> That
> >> > > should work.
> >> > >
> >> > > I don’t set all the ratios and stuff. When were running CMS, I set a
> >> size
> >> > > for the heap and a size for the new space. Done. With G1, I don’t
> even
> >> > get
> >> > > that fussy.
> >> > >
> >> > > wunder
> >> > > Walter Underwood
> >> > > wun...@wunderwood.org
> >> > > http://observer.wunderwood.org/  (my blog)
> >> > >
> >> > >
> >> > > > On Apr 11, 2017, at 8:22 PM, Shawn Heisey 
> >> wrote:
> >> > > >
> >> > > > On 4/11/2017 2:56 PM, Chetas Joshi wrote:
> >> > > >> I am using Solr (5.5.0) on HDFS. SolrCloud of 80 nodes. Sold
> >> > collection
> >> > > >> with number of shards = 80 and replication Factor=2
> >> > > >>
> >> > > >> Sold JVM heap size = 20 GB
> >> > > >> solr.hdfs.blockcache.enabled = true
> >> > > >> solr.hdfs.blockcache.direct.memory.allocation = true
> >> > > >> MaxDirectMemorySize = 25 GB
> >> > > >>
> >> > > >> I am querying a solr collection with index size = 500 MB per
> core.
> >> > > >
> >> > > > I see that you and I have traded messages before on the list.
> >> > > >
> >> > > > How much total system memory is there per server?  How many of
> these
> >> > > > 500MB cores are on each server?  How many docs are in a 500MB
> core?
> >> > The
> >> > > > answers to these questions may affect the other advice that I give
> >> you.
> >> > > >
> >> > > >> The off-heap (25 GB) is huge so that it can load the entire
> index.
> >> > > >
> >> > > > I still know very little about how HDFS handles caching and
> memory.
> >> > You
> >> > > > want to be sure that as much data as possible from your indexes is
> >> > > > sitting in local memory on the server.
> >> > > >
> >> > > >> Using cursor approach (number of rows = 100K), I read 2 fields
> >> (Total
> >> > 40
> >> > > >> bytes per solr doc) from the Solr docs that satisfy the query.
> The
> >> > docs
> >> > > are sorted by "id" and then by those 2 fields.
> >> > > >>
> >> > > >> I am not able to understand why the heap memory is getting full
> and
> >> > Full
> >> > > >> GCs are consecutively running with long GC pauses (> 30
> seconds). I
> >> am
>

RE: Japanese character is garbled when using TikaEntityProcessor

2017-04-12 Thread Noriyuki TAKEI

Thanks!!I appreciate for your quick reply.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Japanese-character-is-garbled-when-using-TikaEntityProcessor-tp4329217p4329657.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering results by minimum relevancy score

2017-04-12 Thread David Kramer

Thank you!  That worked.

From: Ahmet Arslan 
Date: Wednesday, April 12, 2017 at 3:15 PM
To: "solr-user@lucene.apache.org" , David Kramer 

Subject: Re: Filtering results by minimum relevancy score

Hi,

I cannot find it. However it should be something like

q=hello={!frange l=0.5}query($q)

Ahmet

On Wednesday, April 12, 2017, 10:07:54 PM GMT+3, Ahmet Arslan 
 wrote:
Hi David,
A function query named "query" returns the score for the given subquery.
Combined with frange query parser this is possible. I tried it in the past.I am 
searching the original post. I think it was Yonik's post.
https://cwiki.apache.org/confluence/display/solr/Function+Queries

Ahmet

On Wednesday, April 12, 2017, 9:45:17 PM GMT+3, David Kramer 
 wrote:
The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.  I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha"  wrote:

Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)

Makes sense ?

On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti  wrote:

> Can i ask what is the final requirement here ?
> What are you trying to do ?
>  - just display less results ?
> you can easily do at search client time, cutting after a certain amount
> - make search faster returning less results ?
> This is not going to work, as you need to score all of them as Erick
> explained.
>
> Function query ( as Mikhail specified) will run on a per document basis (
> if
> I am correct), so if your idea was to speed up the things, this is not
> going
> to work.
>
> It makes much more sense to refine your system to improve relevancy if 
your
> concern is to have more relevant docs.
> If your concern is just to not show that many pages, you can limit that
> client side.
>
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Filtering-results-by-minimum-relevancy-score-
> tp4329180p4329295.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: KeywordTokenizer and multiValued field

2017-04-12 Thread Ahmet Arslan

I don't understand the first option, what is each value? Keyword tokenizer 
emits single token, analogous to string type.



On Wednesday, April 12, 2017, 7:45:52 PM GMT+3, Walter Underwood 
 wrote:
Does the KeywordTokenizer make each value into a unitary string or does it take 
the whole list of values and make that a single string?

I really hope it is the former. I can’t find this in the docs (including 
JavaDocs).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Long GC pauses while reading Solr docs using Cursor approach

Oh my. Returning 100K rows per request is usually poor practice.
One hopes these are very tiny docs.

But this may well be an "XY" problem. What kinds of information
are you returning in your docs and could they all be docValues
types? In which case you would be waaay far ahead by using
the various Streaming options.

Best,
Erick

On Wed, Apr 12, 2017 at 12:59 PM, Chetas Joshi  wrote:
> I am running a query that returns 10 MM docs in total and the number of
> rows per page is 100K.
>
> On Wed, Apr 12, 2017 at 12:53 PM, Mikhail Khludnev  wrote:
>
>> And what is the rows parameter?
>>
>> 12 апр. 2017 г. 21:32 пользователь "Chetas Joshi" 
>> написал:
>>
>> > Thanks for your response Shawn and Wunder.
>> >
>> > Hi Shawn,
>> >
>> > Here is the system config:
>> >
>> > Total system memory = 512 GB
>> > each server handles two 500 MB cores
>> > Number of solr docs per 500 MB core = 200 MM
>> >
>> > The average heap usage is around 4-6 GB. When the read starts using the
>> > Cursor approach, the heap usage starts increasing with the base of the
>> > sawtooth at 8 GB and then shooting up to 17 GB. Even after the full GC,
>> the
>> > heap usage remains around 15 GB and then it comes down to 8 GB.
>> >
>> > With 100K docs, the requirement will be in MBs so it is strange it is
>> > jumping from 8 GB to 17 GB while preparing the sorted response.
>> >
>> > Thanks!
>> >
>> >
>> >
>> > On Tue, Apr 11, 2017 at 8:48 PM, Walter Underwood > >
>> > wrote:
>> >
>> > > JVM version? We’re running v8 update 121 with the G1 collector and it
>> is
>> > > working really well. We also have an 8GB heap.
>> > >
>> > > Graph your heap usage. You’ll see a sawtooth shape, where it grows,
>> then
>> > > there is a major GC. The maximum of the base of the sawtooth is the
>> > working
>> > > set of heap that your Solr installation needs. Set the heap to that
>> > value,
>> > > plus a gigabyte or so. We run with a 2GB eden (new space) because so
>> much
>> > > of Solr’s allocations have a lifetime of one request. So, the base of
>> the
>> > > sawtooth, plus a gigabyte breathing room, plus two more for eden. That
>> > > should work.
>> > >
>> > > I don’t set all the ratios and stuff. When were running CMS, I set a
>> size
>> > > for the heap and a size for the new space. Done. With G1, I don’t even
>> > get
>> > > that fussy.
>> > >
>> > > wunder
>> > > Walter Underwood
>> > > wun...@wunderwood.org
>> > > http://observer.wunderwood.org/  (my blog)
>> > >
>> > >
>> > > > On Apr 11, 2017, at 8:22 PM, Shawn Heisey 
>> wrote:
>> > > >
>> > > > On 4/11/2017 2:56 PM, Chetas Joshi wrote:
>> > > >> I am using Solr (5.5.0) on HDFS. SolrCloud of 80 nodes. Sold
>> > collection
>> > > >> with number of shards = 80 and replication Factor=2
>> > > >>
>> > > >> Sold JVM heap size = 20 GB
>> > > >> solr.hdfs.blockcache.enabled = true
>> > > >> solr.hdfs.blockcache.direct.memory.allocation = true
>> > > >> MaxDirectMemorySize = 25 GB
>> > > >>
>> > > >> I am querying a solr collection with index size = 500 MB per core.
>> > > >
>> > > > I see that you and I have traded messages before on the list.
>> > > >
>> > > > How much total system memory is there per server?  How many of these
>> > > > 500MB cores are on each server?  How many docs are in a 500MB core?
>> > The
>> > > > answers to these questions may affect the other advice that I give
>> you.
>> > > >
>> > > >> The off-heap (25 GB) is huge so that it can load the entire index.
>> > > >
>> > > > I still know very little about how HDFS handles caching and memory.
>> > You
>> > > > want to be sure that as much data as possible from your indexes is
>> > > > sitting in local memory on the server.
>> > > >
>> > > >> Using cursor approach (number of rows = 100K), I read 2 fields
>> (Total
>> > 40
>> > > >> bytes per solr doc) from the Solr docs that satisfy the query. The
>> > docs
>> > > are sorted by "id" and then by those 2 fields.
>> > > >>
>> > > >> I am not able to understand why the heap memory is getting full and
>> > Full
>> > > >> GCs are consecutively running with long GC pauses (> 30 seconds). I
>> am
>> > > >> using CMS GC.
>> > > >
>> > > > A 20GB heap is quite large.  Do you actually need it to be that
>> large?
>> > > > If you graph JVM heap usage over a long period of time, what are the
>> > low
>> > > > points in the graph?
>> > > >
>> > > > A result containing 100K docs is going to be pretty large, even with
>> a
>> > > > limited number of fields.  It is likely to be several megabytes.  It
>> > > > will need to be entirely built in the heap memory before it is sent
>> to
>> > > > the client -- both as Lucene data structures (which will probably be
>> > > > much larger than the actual response due to Java overhead) and as the
>> > > > actual response format.  Then it will be garbage as soon as the
>> > response
>> > > > is done.  Repeat this enough times, and you're going to go

RE: Solr 6.2 - Creating cores via replication from master?

2017-04-12 Thread Pouliot, Scott

Yeah...I need to get SOLR Cloud up and running.  For some reason, I have yet to 
succeed with it using an external Zookeeper for some reason.  Ugghhh

Thanks for the confirmation!

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Wednesday, April 12, 2017 4:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.2 - Creating cores via replication from master?

On 4/12/2017 2:05 PM, Pouliot, Scott wrote:
> Is it possible to create a core on a master SOLR server and have it 
> automatically replicated to a new slave core?  We're running SOLR 6.2 at the 
> moment, and manually creating the core on the master, and then the slave.  
> Once we feed the master we're good to go. My manager approached me with a 
> change to our setup, and in order to facilitate itI need to somehow get 
> the core replicated automatically from master to slave at creation 
> timewithout manually calling create core on the slave.
>
> Is this even possible?  I know that the master knows absolutely nothing about 
> it's slaves out of the box...and I have yet to find any documentation that 
> tells me otherwise, but figured I'd hit up you experts out here before I 
> called this a wash.

No, that is not possible.

This is one of the big advantages of SolrCloud over the old master-slave 
replication.  If you create a new collection and tell it that you want a 
replicationFactor of 3, then 3 copies of that collection will exist on 
different machines in the cloud.  There are no masters and no slaves -- one of 
those replicas will be elected as the leader.

Thanks,
Shawn

Re: Solr 6.2 - Creating cores via replication from master?

On 4/12/2017 2:05 PM, Pouliot, Scott wrote:
> Is it possible to create a core on a master SOLR server and have it 
> automatically replicated to a new slave core?  We're running SOLR 6.2 at the 
> moment, and manually creating the core on the master, and then the slave.  
> Once we feed the master we're good to go. My manager approached me with a 
> change to our setup, and in order to facilitate itI need to somehow get 
> the core replicated automatically from master to slave at creation 
> timewithout manually calling create core on the slave.
>
> Is this even possible?  I know that the master knows absolutely nothing about 
> it's slaves out of the box...and I have yet to find any documentation that 
> tells me otherwise, but figured I'd hit up you experts out here before I 
> called this a wash.

No, that is not possible.

This is one of the big advantages of SolrCloud over the old master-slave
replication.  If you create a new collection and tell it that you want a
replicationFactor of 3, then 3 copies of that collection will exist on
different machines in the cloud.  There are no masters and no slaves --
one of those replicas will be elected as the leader.

Thanks,
Shawn

Re: Filtering results by minimum relevancy score

2017-04-12 Thread Doug Turnbull

David I think it can be done, but a score has no real *meaning* to your
domain other than the one you engineer into it. There's no 1-100 scale that
guarantees at 100 that your users will love the results.

Solr isn't really a turn key solution. It requires you to understand more
deeply what relevance means in your domain and how to use the features of
the engine to achieve the right use experience.

What's a relevant result? What does Relevant mean for your users? What user
experience are you creating?

Is this a news search where you need to filter out old articles? Or ones
that aren't trustworthy? Or articles where the body doesn't match enough
user keywords? Or restaurants outside a certain radius as not usable?

I've been in similar situation and usually getting rid of "low quality"
results involves creative uses of filters to remove obvious low-value
cases. You can create an fq for example that limits the results to only
include articles where at least 2 keywords match the body field. Or express
some minimum proximity, popularity, or recency requirement.

I think you're going to meet frustration until you can pin down your users
and/or your stakeholders on what they want. This is always the hard prob
btw;)

On Wed, Apr 12, 2017 at 11:45 AM David Kramer 
wrote:

> The idea is to not return poorly matching results, not to limit the number
> of results returned.  One query may have hundreds of excellent matches and
> another query may have 7. So cutting off by the number of results is
> trivial but not useful.
>
> Again, we are not doing this for performance reasons. We’re doing this
> because we don’t want to show products that are not very relevant to the
> search terms specified by the user for UX reasons.
>
> I had hoped that the responses would have been more focused on “it’ can’t
> be done” or “here’s how to do it” than “you don’t want to do it”.   I’m
> still left not knowing if it’s even possible. The one concrete answer of
> using frange doesn’t help as referencing score in either the q or the fq
> produces an “undefined field” error.
>
> Thanks.
>
> On 4/11/17, 8:59 AM, "Dorian Hoxha"  wrote:
>
> Can't the filter be used in cases when you're paginating in
> sharded-scenario ?
> So if you do limit=10, offset=10, each shard will return 20 docs ?
> While if you do limit=10, _score<=last_page.min_score, then each shard
> will
> return 10 docs ? (they will still score all docs, but merging will be
> faster)
>
> Makes sense ?
>
> On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <
> a.benede...@sease.io
> > wrote:
>
> > Can i ask what is the final requirement here ?
> > What are you trying to do ?
> >  - just display less results ?
> > you can easily do at search client time, cutting after a certain
> amount
> > - make search faster returning less results ?
> > This is not going to work, as you need to score all of them as Erick
> > explained.
> >
> > Function query ( as Mikhail specified) will run on a per document
> basis (
> > if
> > I am correct), so if your idea was to speed up the things, this is
> not
> > going
> > to work.
> >
> > It makes much more sense to refine your system to improve relevancy
> if your
> > concern is to have more relevant docs.
> > If your concern is just to not show that many pages, you can limit
> that
> > client side.
> >
> >
> >
> >
> >
> >
> > -
> > ---
> > Alessandro Benedetti
> > Search Consultant, R Software Engineer, Director
> > Sease Ltd. - www.sease.io
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Filtering-results-by-minimum-relevancy-score-
> > tp4329180p4329295.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>

Solr 6.2 - Creating cores via replication from master?

2017-04-12 Thread Pouliot, Scott

Is it possible to create a core on a master SOLR server and have it 
automatically replicated to a new slave core?  We're running SOLR 6.2 at the 
moment, and manually creating the core on the master, and then the slave.  Once 
we feed the master we're good to go. My manager approached me with a change to 
our setup, and in order to facilitate itI need to somehow get the core 
replicated automatically from master to slave at creation timewithout 
manually calling create core on the slave.

Is this even possible?  I know that the master knows absolutely nothing about 
it's slaves out of the box...and I have yet to find any documentation that 
tells me otherwise, but figured I'd hit up you experts out here before I called 
this a wash.

Re: Long GC pauses while reading Solr docs using Cursor approach

2017-04-12 Thread Chetas Joshi

I am running a query that returns 10 MM docs in total and the number of
rows per page is 100K.

On Wed, Apr 12, 2017 at 12:53 PM, Mikhail Khludnev  wrote:

> And what is the rows parameter?
>
> 12 апр. 2017 г. 21:32 пользователь "Chetas Joshi" 
> написал:
>
> > Thanks for your response Shawn and Wunder.
> >
> > Hi Shawn,
> >
> > Here is the system config:
> >
> > Total system memory = 512 GB
> > each server handles two 500 MB cores
> > Number of solr docs per 500 MB core = 200 MM
> >
> > The average heap usage is around 4-6 GB. When the read starts using the
> > Cursor approach, the heap usage starts increasing with the base of the
> > sawtooth at 8 GB and then shooting up to 17 GB. Even after the full GC,
> the
> > heap usage remains around 15 GB and then it comes down to 8 GB.
> >
> > With 100K docs, the requirement will be in MBs so it is strange it is
> > jumping from 8 GB to 17 GB while preparing the sorted response.
> >
> > Thanks!
> >
> >
> >
> > On Tue, Apr 11, 2017 at 8:48 PM, Walter Underwood  >
> > wrote:
> >
> > > JVM version? We’re running v8 update 121 with the G1 collector and it
> is
> > > working really well. We also have an 8GB heap.
> > >
> > > Graph your heap usage. You’ll see a sawtooth shape, where it grows,
> then
> > > there is a major GC. The maximum of the base of the sawtooth is the
> > working
> > > set of heap that your Solr installation needs. Set the heap to that
> > value,
> > > plus a gigabyte or so. We run with a 2GB eden (new space) because so
> much
> > > of Solr’s allocations have a lifetime of one request. So, the base of
> the
> > > sawtooth, plus a gigabyte breathing room, plus two more for eden. That
> > > should work.
> > >
> > > I don’t set all the ratios and stuff. When were running CMS, I set a
> size
> > > for the heap and a size for the new space. Done. With G1, I don’t even
> > get
> > > that fussy.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > >
> > > > On Apr 11, 2017, at 8:22 PM, Shawn Heisey 
> wrote:
> > > >
> > > > On 4/11/2017 2:56 PM, Chetas Joshi wrote:
> > > >> I am using Solr (5.5.0) on HDFS. SolrCloud of 80 nodes. Sold
> > collection
> > > >> with number of shards = 80 and replication Factor=2
> > > >>
> > > >> Sold JVM heap size = 20 GB
> > > >> solr.hdfs.blockcache.enabled = true
> > > >> solr.hdfs.blockcache.direct.memory.allocation = true
> > > >> MaxDirectMemorySize = 25 GB
> > > >>
> > > >> I am querying a solr collection with index size = 500 MB per core.
> > > >
> > > > I see that you and I have traded messages before on the list.
> > > >
> > > > How much total system memory is there per server?  How many of these
> > > > 500MB cores are on each server?  How many docs are in a 500MB core?
> > The
> > > > answers to these questions may affect the other advice that I give
> you.
> > > >
> > > >> The off-heap (25 GB) is huge so that it can load the entire index.
> > > >
> > > > I still know very little about how HDFS handles caching and memory.
> > You
> > > > want to be sure that as much data as possible from your indexes is
> > > > sitting in local memory on the server.
> > > >
> > > >> Using cursor approach (number of rows = 100K), I read 2 fields
> (Total
> > 40
> > > >> bytes per solr doc) from the Solr docs that satisfy the query. The
> > docs
> > > are sorted by "id" and then by those 2 fields.
> > > >>
> > > >> I am not able to understand why the heap memory is getting full and
> > Full
> > > >> GCs are consecutively running with long GC pauses (> 30 seconds). I
> am
> > > >> using CMS GC.
> > > >
> > > > A 20GB heap is quite large.  Do you actually need it to be that
> large?
> > > > If you graph JVM heap usage over a long period of time, what are the
> > low
> > > > points in the graph?
> > > >
> > > > A result containing 100K docs is going to be pretty large, even with
> a
> > > > limited number of fields.  It is likely to be several megabytes.  It
> > > > will need to be entirely built in the heap memory before it is sent
> to
> > > > the client -- both as Lucene data structures (which will probably be
> > > > much larger than the actual response due to Java overhead) and as the
> > > > actual response format.  Then it will be garbage as soon as the
> > response
> > > > is done.  Repeat this enough times, and you're going to go through
> even
> > > > a 20GB heap pretty fast, and need a full GC.  Full GCs on a 20GB heap
> > > > are slow.
> > > >
> > > > You could try switching to G1, as long as you realize that you're
> going
> > > > against advice from Lucene experts but honestly, I do not expect
> > > > this to really help, because you would probably still need full GCs
> due
> > > > to the rate that garbage is being created.  If you do try it, I would
> > > > strongly recommend the latest Java 8, either Oracle or OpenJDK.
> Here's
> > > > my wiki page where

Re: Long GC pauses while reading Solr docs using Cursor approach

2017-04-12 Thread Mikhail Khludnev

And what is the rows parameter?

12 апр. 2017 г. 21:32 пользователь "Chetas Joshi" 
написал:

> Thanks for your response Shawn and Wunder.
>
> Hi Shawn,
>
> Here is the system config:
>
> Total system memory = 512 GB
> each server handles two 500 MB cores
> Number of solr docs per 500 MB core = 200 MM
>
> The average heap usage is around 4-6 GB. When the read starts using the
> Cursor approach, the heap usage starts increasing with the base of the
> sawtooth at 8 GB and then shooting up to 17 GB. Even after the full GC, the
> heap usage remains around 15 GB and then it comes down to 8 GB.
>
> With 100K docs, the requirement will be in MBs so it is strange it is
> jumping from 8 GB to 17 GB while preparing the sorted response.
>
> Thanks!
>
>
>
> On Tue, Apr 11, 2017 at 8:48 PM, Walter Underwood 
> wrote:
>
> > JVM version? We’re running v8 update 121 with the G1 collector and it is
> > working really well. We also have an 8GB heap.
> >
> > Graph your heap usage. You’ll see a sawtooth shape, where it grows, then
> > there is a major GC. The maximum of the base of the sawtooth is the
> working
> > set of heap that your Solr installation needs. Set the heap to that
> value,
> > plus a gigabyte or so. We run with a 2GB eden (new space) because so much
> > of Solr’s allocations have a lifetime of one request. So, the base of the
> > sawtooth, plus a gigabyte breathing room, plus two more for eden. That
> > should work.
> >
> > I don’t set all the ratios and stuff. When were running CMS, I set a size
> > for the heap and a size for the new space. Done. With G1, I don’t even
> get
> > that fussy.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Apr 11, 2017, at 8:22 PM, Shawn Heisey  wrote:
> > >
> > > On 4/11/2017 2:56 PM, Chetas Joshi wrote:
> > >> I am using Solr (5.5.0) on HDFS. SolrCloud of 80 nodes. Sold
> collection
> > >> with number of shards = 80 and replication Factor=2
> > >>
> > >> Sold JVM heap size = 20 GB
> > >> solr.hdfs.blockcache.enabled = true
> > >> solr.hdfs.blockcache.direct.memory.allocation = true
> > >> MaxDirectMemorySize = 25 GB
> > >>
> > >> I am querying a solr collection with index size = 500 MB per core.
> > >
> > > I see that you and I have traded messages before on the list.
> > >
> > > How much total system memory is there per server?  How many of these
> > > 500MB cores are on each server?  How many docs are in a 500MB core?
> The
> > > answers to these questions may affect the other advice that I give you.
> > >
> > >> The off-heap (25 GB) is huge so that it can load the entire index.
> > >
> > > I still know very little about how HDFS handles caching and memory.
> You
> > > want to be sure that as much data as possible from your indexes is
> > > sitting in local memory on the server.
> > >
> > >> Using cursor approach (number of rows = 100K), I read 2 fields (Total
> 40
> > >> bytes per solr doc) from the Solr docs that satisfy the query. The
> docs
> > are sorted by "id" and then by those 2 fields.
> > >>
> > >> I am not able to understand why the heap memory is getting full and
> Full
> > >> GCs are consecutively running with long GC pauses (> 30 seconds). I am
> > >> using CMS GC.
> > >
> > > A 20GB heap is quite large.  Do you actually need it to be that large?
> > > If you graph JVM heap usage over a long period of time, what are the
> low
> > > points in the graph?
> > >
> > > A result containing 100K docs is going to be pretty large, even with a
> > > limited number of fields.  It is likely to be several megabytes.  It
> > > will need to be entirely built in the heap memory before it is sent to
> > > the client -- both as Lucene data structures (which will probably be
> > > much larger than the actual response due to Java overhead) and as the
> > > actual response format.  Then it will be garbage as soon as the
> response
> > > is done.  Repeat this enough times, and you're going to go through even
> > > a 20GB heap pretty fast, and need a full GC.  Full GCs on a 20GB heap
> > > are slow.
> > >
> > > You could try switching to G1, as long as you realize that you're going
> > > against advice from Lucene experts but honestly, I do not expect
> > > this to really help, because you would probably still need full GCs due
> > > to the rate that garbage is being created.  If you do try it, I would
> > > strongly recommend the latest Java 8, either Oracle or OpenJDK.  Here's
> > > my wiki page where I discuss this:
> > >
> > > https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_
> > First.29_Collector
> > >
> > > Reducing the heap size (which may not be possible -- need to know the
> > > answer to the question about memory graphing) and reducing the number
> of
> > > rows per query are the only quick solutions I can think of.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
> >
>

Re: Filtering results by minimum relevancy score

2017-04-12 Thread Ahmet Arslan

Hi,
I cannot find it. However it should be something like 
q=hello={!frange l=0.5}query($q)

Ahmet
On Wednesday, April 12, 2017, 10:07:54 PM GMT+3, Ahmet Arslan 
 wrote:
Hi David,
A function query named "query" returns the score for the given subquery. 
Combined with frange query parser this is possible. I tried it in the past.I am 
searching the original post. I think it was Yonik's post.
https://cwiki.apache.org/confluence/display/solr/Function+Queries

Ahmet

On Wednesday, April 12, 2017, 9:45:17 PM GMT+3, David Kramer 
 wrote:
The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.  I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha"  wrote:

    Can't the filter be used in cases when you're paginating in
    sharded-scenario ?
    So if you do limit=10, offset=10, each shard will return 20 docs ?
    While if you do limit=10, _score<=last_page.min_score, then each shard will
    return 10 docs ? (they will still score all docs, but merging will be
    faster)

    Makes sense ?

    On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti  wrote:

    > Can i ask what is the final requirement here ?
    > What are you trying to do ?
    >  - just display less results ?
    > you can easily do at search client time, cutting after a certain amount
    > - make search faster returning less results ?
    > This is not going to work, as you need to score all of them as Erick
    > explained.
    >
    > Function query ( as Mikhail specified) will run on a per document basis (
    > if
    > I am correct), so if your idea was to speed up the things, this is not
    > going
    > to work.
    >
    > It makes much more sense to refine your system to improve relevancy if 
your
    > concern is to have more relevant docs.
    > If your concern is just to not show that many pages, you can limit that
    > client side.
    >
    >
    >
    >
    >
    >
    > -
    > ---
    > Alessandro Benedetti
    > Search Consultant, R Software Engineer, Director
    > Sease Ltd. - www.sease.io
    > --
    > View this message in context: http://lucene.472066.n3.
    > nabble.com/Filtering-results-by-minimum-relevancy-score-
    > tp4329180p4329295.html
    > Sent from the Solr - User mailing list archive at Nabble.com.
    >

RE: Solr 6.4 - Transient core loading is extremely slow with HDFS and S3

2017-04-12 Thread Cahill, Trey

Hi Amarnath, 

From this log snippet:
"
2017-04-12 17:53:44.900 INFO
 (searcherExecutor-12-thread-1-processing-x:amar1) [   x:amar1]
o.a.s.c.SolrCore [amar1] Registered new searcher Searcher@3f61e7f2[amar1]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_16(6.4.2):c97790)
Uninverting(_17(6.4.2):C236640) Uninverting(_b(6.4.2):C51852)
Uninverting(_d(6.4.2):C4) Uninverting(_f(6.4.2):C1)
Uninverting(_o(6.4.2):C33360) Uninverting(_r(6.4.2):C40358)
Uninverting(_y(6.4.2):C6) Uninverting(_14(6.4.2):C1) 
Uninverting(_15(6.4.2):C1)))}
2017-04-12 17:56:22.799 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.SolrCores Opening transient core amar1
2017-04-12 17:56:22.837 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.S.Request [amar1]  webapp=/solr path=/select params={q=*:*=0}
hits=59 status=0 *QTime=243787*
"

It does look like reading the data into Solr from S3 is being slow.

Running Solr on an EC2 instance in the same AWS region as your S3 bucket should 
help.  While you’re in AWS, using VPC endpoints should also help with 
performance. From your logs, it looks like you're running from your laptop.

It looks like you’re using s3a, which is a good start.  On a side note,  Hadoop 
2.8 has recently been released 
(https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.8+Release), which 
includes some work for s3a. Not promising any performance improvements if you 
use s3a with Hadoop 2.8, but it's probably the best way to access S3 right now.

Finally,  remember that S3 is a service; if the S3 service is slow (for example 
due to a heavy stream of request), then your operations with S3 will also be 
slow.  

Hope this helps and good luck, 

Trey

-Original Message-
From: Amarnath palavalli [mailto:pamarn...@gmail.com] 
Sent: Wednesday, April 12, 2017 2:09 PM
To: solr-user@lucene.apache.org
Subject: Solr 6.4 - Transient core loading is extremely slow with HDFS and S3

Hello,

I am using S3 as the primary store for data directory of core. To achieve this, 
I have the following in Solrconfig.xml:


  **
*  s3a://amar-hdfs/solr*
*  /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop*
*  true*
*  4*
*  true*
*  16384*
*  true*
*  true*
*  16*
*  192*
*  *

When I access the core 'amar1' it is taking like 245 seconds to load the core 
of total size about 85 MB. Here is the complete solr.log for core
loading:

2017-04-12 17:52:19.079 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.SolrResourceLoader [amar1] Added 57 libs to classloader, from
paths: [/Users/apalavalli/solr/solr-deployment/contrib/clustering/lib,
/Users/apalavalli/solr/solr-deployment/contrib/extraction/lib,
/Users/apalavalli/solr/solr-deployment/contrib/langid/lib,
/Users/apalavalli/solr/solr-deployment/contrib/velocity/lib,
/Users/apalavalli/solr/solr-deployment/dist,
/Users/apalavalli/solr/solr-deployment/dist/lib2]
2017-04-12 17:52:19.109 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.SolrConfig Using Lucene MatchVersion: 6.4.2
2017-04-12 17:52:19.155 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.s.IndexSchema [amar1] Schema name=log-saas
2017-04-12 17:52:19.217 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.s.IndexSchema Loaded schema log-saas/1.6 with uniqueid field id
2017-04-12 17:52:19.217 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.CoreContainer Creating SolrCore 'amar1' using configuration from 
configset
/Users/apalavalli/solr/solr-deployment/server/solr/configsets/base-config-s3
2017-04-12 17:52:19.223 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory solr.hdfs.home=s3a://amar-hdfs/solr
2017-04-12 17:52:19.223 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Solr Kerberos Authentication disabled
2017-04-12 17:52:19.234 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.SolrCore [[amar1] ] Opening new SolrCore at 
[/Users/apalavalli/solr/solr-deployment/server/solr/configsets/base-config-s3],
dataDir=[s3a://amar-hdfs/solr/amar1/data/]
2017-04-12 17:52:19.234 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.JmxMonitoredMap JMX monitoring is enabled. Adding Solr mbeans to JMX 
Server: com.sun.jmx.mbeanserver.JmxMBeanServer@5745ca0e
2017-04-12 17:52:19.236 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory creating directory factory for path 
s3a://amar-hdfs/solr/amar1/data/snapshot_metadata
2017-04-12 17:52:19.274 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Number of slabs of block cache [4] with direct 
memory allocation set to [true]
2017-04-12 17:52:19.274 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Block cache target memory usage, slab size of 
[134217728] will allocate [4] slabs and use ~[536870912] bytes
2017-04-12 17:52:19.274 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Creating new global HDFS BlockCache
2017-04-12 17:52:19.888 WARN  (qtp1654589030-18) [   x:amar1]
o.a.h.u.NativeCodeLoader Unable to load native-hadoop library for your 
platform... using builtin-java classes where

Re: Filtering results by minimum relevancy score

2017-04-12 Thread Walter Underwood

Fine. It can’t be done. If it was easy, Solr/Lucene would already have the 
feature, right?

Solr is a vector-space engine. Some early engines (Verity VDK) were 
probabilistic engines. Those do give an absolute estimate of the relevance of 
each hit. Unfortunately, the relevance of results is just not as good as 
vector-space engines. So, probabilistic engines are mostly dead.

But, “you don’t want to do it” is very good advice. Instead of trying to reduce 
bad hits, work on increasing good hits. It is really hard, sometimes not 
possible, to optimize both. Increasing the good hits makes your customers 
happy. Reducing the bad hits makes your UX team happy.

Here is a process. Start collecting the clicks on the search results page (SRP) 
with each query. Look at queries that have below average clickthrough. See if 
those can be combined into categories, then address each category.

Some categories that I have used:

* One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all 
valid. Use synonyms or shingles (and maybe the word delimiter filter) to match 
these.

* Misspellings. These should be about 10% of queries. Use fuzzy matching. I 
recommend the patch in SOLR-629.

* Alternate vocabulary. You sell a “laptop”, but people call it a “notebook”. 
People search for “kids movies”, but your movie genre is “Children and Family”. 
Use synonyms.

* Missing content. People can’t find anything about beach parking because there 
isn’t a page about that. Instead, there are scraps of info about beach parking 
in multiple other pages. Fix the content.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 12, 2017, at 11:44 AM, David Kramer  wrote:
> 
> The idea is to not return poorly matching results, not to limit the number of 
> results returned.  One query may have hundreds of excellent matches and 
> another query may have 7. So cutting off by the number of results is trivial 
> but not useful.
> 
> Again, we are not doing this for performance reasons. We’re doing this 
> because we don’t want to show products that are not very relevant to the 
> search terms specified by the user for UX reasons.
> 
> I had hoped that the responses would have been more focused on “it’ can’t be 
> done” or “here’s how to do it” than “you don’t want to do it”.   I’m still 
> left not knowing if it’s even possible. The one concrete answer of using 
> frange doesn’t help as referencing score in either the q or the fq produces 
> an “undefined field” error.
> 
> Thanks.
> 
> On 4/11/17, 8:59 AM, "Dorian Hoxha"  wrote:
> 
>Can't the filter be used in cases when you're paginating in
>sharded-scenario ?
>So if you do limit=10, offset=10, each shard will return 20 docs ?
>While if you do limit=10, _score<=last_page.min_score, then each shard will
>return 10 docs ? (they will still score all docs, but merging will be
>faster)
> 
>Makes sense ?
> 
>On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti 
> > wrote:
> 
>> Can i ask what is the final requirement here ?
>> What are you trying to do ?
>> - just display less results ?
>> you can easily do at search client time, cutting after a certain amount
>> - make search faster returning less results ?
>> This is not going to work, as you need to score all of them as Erick
>> explained.
>> 
>> Function query ( as Mikhail specified) will run on a per document basis (
>> if
>> I am correct), so if your idea was to speed up the things, this is not
>> going
>> to work.
>> 
>> It makes much more sense to refine your system to improve relevancy if your
>> concern is to have more relevant docs.
>> If your concern is just to not show that many pages, you can limit that
>> client side.
>> 
>> 
>> 
>> 
>> 
>> 
>> -
>> ---
>> Alessandro Benedetti
>> Search Consultant, R Software Engineer, Director
>> Sease Ltd. - www.sease.io
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/Filtering-results-by-minimum-relevancy-score-
>> tp4329180p4329295.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
> 
>

Re: Filtering results by minimum relevancy score

2017-04-12 Thread Ahmet Arslan

Hi David,
A function query named "query" returns the score for the given subquery. 
Combined with frange query parser this is possible. I tried it in the past.I am 
searching the original post. I think it was Yonik's post.
https://cwiki.apache.org/confluence/display/solr/Function+Queries

Ahmet

On Wednesday, April 12, 2017, 9:45:17 PM GMT+3, David Kramer 
 wrote:
The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.  I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha"  wrote:

    Can't the filter be used in cases when you're paginating in
    sharded-scenario ?
    So if you do limit=10, offset=10, each shard will return 20 docs ?
    While if you do limit=10, _score<=last_page.min_score, then each shard will
    return 10 docs ? (they will still score all docs, but merging will be
    faster)

    Makes sense ?

    On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti  wrote:

    > Can i ask what is the final requirement here ?
    > What are you trying to do ?
    >  - just display less results ?
    > you can easily do at search client time, cutting after a certain amount
    > - make search faster returning less results ?
    > This is not going to work, as you need to score all of them as Erick
    > explained.
    >
    > Function query ( as Mikhail specified) will run on a per document basis (
    > if
    > I am correct), so if your idea was to speed up the things, this is not
    > going
    > to work.
    >
    > It makes much more sense to refine your system to improve relevancy if 
your
    > concern is to have more relevant docs.
    > If your concern is just to not show that many pages, you can limit that
    > client side.
    >
    >
    >
    >
    >
    >
    > -
    > ---
    > Alessandro Benedetti
    > Search Consultant, R Software Engineer, Director
    > Sease Ltd. - www.sease.io
    > --
    > View this message in context: http://lucene.472066.n3.
    > nabble.com/Filtering-results-by-minimum-relevancy-score-
    > tp4329180p4329295.html
    > Sent from the Solr - User mailing list archive at Nabble.com.
    >

Re: Filtering results by minimum relevancy score

2017-04-12 Thread David Kramer

The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.   I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha"  wrote:

Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)

Makes sense ?

On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti  wrote:

> Can i ask what is the final requirement here ?
> What are you trying to do ?
>  - just display less results ?
> you can easily do at search client time, cutting after a certain amount
> - make search faster returning less results ?
> This is not going to work, as you need to score all of them as Erick
> explained.
>
> Function query ( as Mikhail specified) will run on a per document basis (
> if
> I am correct), so if your idea was to speed up the things, this is not
> going
> to work.
>
> It makes much more sense to refine your system to improve relevancy if 
your
> concern is to have more relevant docs.
> If your concern is just to not show that many pages, you can limit that
> client side.
>
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Filtering-results-by-minimum-relevancy-score-
> tp4329180p4329295.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Long GC pauses while reading Solr docs using Cursor approach

2017-04-12 Thread Chetas Joshi

Thanks for your response Shawn and Wunder.

Hi Shawn,

Here is the system config:

Total system memory = 512 GB
each server handles two 500 MB cores
Number of solr docs per 500 MB core = 200 MM

The average heap usage is around 4-6 GB. When the read starts using the
Cursor approach, the heap usage starts increasing with the base of the
sawtooth at 8 GB and then shooting up to 17 GB. Even after the full GC, the
heap usage remains around 15 GB and then it comes down to 8 GB.

With 100K docs, the requirement will be in MBs so it is strange it is
jumping from 8 GB to 17 GB while preparing the sorted response.

Thanks!



On Tue, Apr 11, 2017 at 8:48 PM, Walter Underwood 
wrote:

> JVM version? We’re running v8 update 121 with the G1 collector and it is
> working really well. We also have an 8GB heap.
>
> Graph your heap usage. You’ll see a sawtooth shape, where it grows, then
> there is a major GC. The maximum of the base of the sawtooth is the working
> set of heap that your Solr installation needs. Set the heap to that value,
> plus a gigabyte or so. We run with a 2GB eden (new space) because so much
> of Solr’s allocations have a lifetime of one request. So, the base of the
> sawtooth, plus a gigabyte breathing room, plus two more for eden. That
> should work.
>
> I don’t set all the ratios and stuff. When were running CMS, I set a size
> for the heap and a size for the new space. Done. With G1, I don’t even get
> that fussy.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 11, 2017, at 8:22 PM, Shawn Heisey  wrote:
> >
> > On 4/11/2017 2:56 PM, Chetas Joshi wrote:
> >> I am using Solr (5.5.0) on HDFS. SolrCloud of 80 nodes. Sold collection
> >> with number of shards = 80 and replication Factor=2
> >>
> >> Sold JVM heap size = 20 GB
> >> solr.hdfs.blockcache.enabled = true
> >> solr.hdfs.blockcache.direct.memory.allocation = true
> >> MaxDirectMemorySize = 25 GB
> >>
> >> I am querying a solr collection with index size = 500 MB per core.
> >
> > I see that you and I have traded messages before on the list.
> >
> > How much total system memory is there per server?  How many of these
> > 500MB cores are on each server?  How many docs are in a 500MB core?  The
> > answers to these questions may affect the other advice that I give you.
> >
> >> The off-heap (25 GB) is huge so that it can load the entire index.
> >
> > I still know very little about how HDFS handles caching and memory.  You
> > want to be sure that as much data as possible from your indexes is
> > sitting in local memory on the server.
> >
> >> Using cursor approach (number of rows = 100K), I read 2 fields (Total 40
> >> bytes per solr doc) from the Solr docs that satisfy the query. The docs
> are sorted by "id" and then by those 2 fields.
> >>
> >> I am not able to understand why the heap memory is getting full and Full
> >> GCs are consecutively running with long GC pauses (> 30 seconds). I am
> >> using CMS GC.
> >
> > A 20GB heap is quite large.  Do you actually need it to be that large?
> > If you graph JVM heap usage over a long period of time, what are the low
> > points in the graph?
> >
> > A result containing 100K docs is going to be pretty large, even with a
> > limited number of fields.  It is likely to be several megabytes.  It
> > will need to be entirely built in the heap memory before it is sent to
> > the client -- both as Lucene data structures (which will probably be
> > much larger than the actual response due to Java overhead) and as the
> > actual response format.  Then it will be garbage as soon as the response
> > is done.  Repeat this enough times, and you're going to go through even
> > a 20GB heap pretty fast, and need a full GC.  Full GCs on a 20GB heap
> > are slow.
> >
> > You could try switching to G1, as long as you realize that you're going
> > against advice from Lucene experts but honestly, I do not expect
> > this to really help, because you would probably still need full GCs due
> > to the rate that garbage is being created.  If you do try it, I would
> > strongly recommend the latest Java 8, either Oracle or OpenJDK.  Here's
> > my wiki page where I discuss this:
> >
> > https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_
> First.29_Collector
> >
> > Reducing the heap size (which may not be possible -- need to know the
> > answer to the question about memory graphing) and reducing the number of
> > rows per query are the only quick solutions I can think of.
> >
> > Thanks,
> > Shawn
> >
>
>

Solr 6.4 - Transient core loading is extremely slow with HDFS and S3

2017-04-12 Thread Amarnath palavalli

Hello,

I am using S3 as the primary store for data directory of core. To achieve
this, I have the following in Solrconfig.xml:


  **
*  s3a://amar-hdfs/solr*
*  /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop*
*  true*
*  4*
*  true*
*  16384*
*  true*
*  true*
*  16*
*  192*
*  *

When I access the core 'amar1' it is taking like 245 seconds to load the
core of total size about 85 MB. Here is the complete solr.log for core
loading:

2017-04-12 17:52:19.079 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.SolrResourceLoader [amar1] Added 57 libs to classloader, from
paths: [/Users/apalavalli/solr/solr-deployment/contrib/clustering/lib,
/Users/apalavalli/solr/solr-deployment/contrib/extraction/lib,
/Users/apalavalli/solr/solr-deployment/contrib/langid/lib,
/Users/apalavalli/solr/solr-deployment/contrib/velocity/lib,
/Users/apalavalli/solr/solr-deployment/dist,
/Users/apalavalli/solr/solr-deployment/dist/lib2]
2017-04-12 17:52:19.109 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.SolrConfig Using Lucene MatchVersion: 6.4.2
2017-04-12 17:52:19.155 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.s.IndexSchema [amar1] Schema name=log-saas
2017-04-12 17:52:19.217 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.s.IndexSchema Loaded schema log-saas/1.6 with uniqueid field id
2017-04-12 17:52:19.217 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.CoreContainer Creating SolrCore 'amar1' using configuration from
configset
/Users/apalavalli/solr/solr-deployment/server/solr/configsets/base-config-s3
2017-04-12 17:52:19.223 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory solr.hdfs.home=s3a://amar-hdfs/solr
2017-04-12 17:52:19.223 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Solr Kerberos Authentication disabled
2017-04-12 17:52:19.234 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.SolrCore [[amar1] ] Opening new SolrCore at
[/Users/apalavalli/solr/solr-deployment/server/solr/configsets/base-config-s3],
dataDir=[s3a://amar-hdfs/solr/amar1/data/]
2017-04-12 17:52:19.234 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.JmxMonitoredMap JMX monitoring is enabled. Adding Solr mbeans to
JMX Server: com.sun.jmx.mbeanserver.JmxMBeanServer@5745ca0e
2017-04-12 17:52:19.236 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory creating directory factory for path
s3a://amar-hdfs/solr/amar1/data/snapshot_metadata
2017-04-12 17:52:19.274 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Number of slabs of block cache [4] with direct
memory allocation set to [true]
2017-04-12 17:52:19.274 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Block cache target memory usage, slab size of
[134217728] will allocate [4] slabs and use ~[536870912] bytes
2017-04-12 17:52:19.274 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Creating new global HDFS BlockCache
2017-04-12 17:52:19.888 WARN  (qtp1654589030-18) [   x:amar1]
o.a.h.u.NativeCodeLoader Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
2017-04-12 17:52:20.759 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.s.b.BlockDirectory Block cache on write is disabled
2017-04-12 17:52:21.074 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory creating directory factory for path
s3a://amar-hdfs/solr/amar1/data
2017-04-12 17:52:21.659 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory creating directory factory for path
s3a://amar-hdfs/solr/amar1/data/index
2017-04-12 17:52:21.670 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Number of slabs of block cache [4] with direct
memory allocation set to [true]
2017-04-12 17:52:21.671 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.c.HdfsDirectoryFactory Block cache target memory usage, slab size of
[134217728] will allocate [4] slabs and use ~[536870912] bytes
2017-04-12 17:52:21.947 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.s.b.BlockDirectory Block cache on write is disabled
2017-04-12 17:52:22.058 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.r.XSLTResponseWriter xsltCacheLifetimeSeconds=5
2017-04-12 17:52:22.112 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.u.UpdateHandler Using UpdateLog implementation:
org.apache.solr.update.UpdateLog
2017-04-12 17:52:22.112 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH
numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536
2017-04-12 17:52:22.128 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.u.CommitTracker Hard AutoCommit: if uncommited for 1ms;
2017-04-12 17:52:22.128 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.u.CommitTracker Soft AutoCommit: if uncommited for 5000ms;
2017-04-12 17:53:44.573 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.s.SolrIndexSearcher Opening [Searcher@3f61e7f2[amar1] main]
2017-04-12 17:53:44.575 INFO  (qtp1654589030-18) [   x:amar1]
o.a.s.r.ManagedResourceStorage File-based storage initialized to use dir:

Re: What does the replication factor parameter in collections api do?

really <3>. replicationFactor is used to set up your collection
initially, you have to be able to change your topology afterwards so
it's ignored thereafter.

Once your replica is added, it's automatically made use of by the collection.

On Wed, Apr 12, 2017 at 9:30 AM, Johannes Knaus  wrote:
> Hi,
>
> I am still quite new to Solr. I have the following setup:
> A SolrCloud setup with
> 38 nodes,
> maxShardsPerNode=2,
> implicit routing with routing field,
> and replication factor=2.
>
> Now, I want to add replica. This works fine by first increasing the 
> maxShardsPerNode to a higher number and then add replicas.
> So far, so good. I can confirm changes of the maxShardsPerNode parameter and 
> added replicas in the Admin UI.
> However, the Solr Admin UI still is showing me a replication factor of 2.
> I am a little confused about what the replicationfactor parameter actually 
> does in my case:
>
> 1) What does that mean? Does Solr make use of all replicas I have or only of 
> two?
> 2) Do I need to increase the replication factor value as well to really have 
> more replicas available and usable? If this is true, do I need to 
> restart/reload the collection newly upload configs to Zookeeper or anything 
> alike?
> 3) Or is replicationfactor just a parameter that is needed for the first 
> start of SolrCloud and can be ignored afterwards?
>
> Thank you very much for your help,
> All the best,
> Johannes
>

Re: KeywordTokenizer and multiValued field

2017-04-12 Thread Andrea Gazzarini


Hi Wunder,
I think it's the first option: if you have 3 values then the analyzer 
chain is executed three times.


Andrea

On 12/04/17 18:45, Walter Underwood wrote:

Does the KeywordTokenizer make each value into a unitary string or does it take 
the whole list of values and make that a single string?

I really hope it is the former. I can’t find this in the docs (including 
JavaDocs).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

KeywordTokenizer and multiValued field

2017-04-12 Thread Walter Underwood

Does the KeywordTokenizer make each value into a unitary string or does it take 
the whole list of values and make that a single string?

I really hope it is the former. I can’t find this in the docs (including 
JavaDocs).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

What does the replication factor parameter in collections api do?

2017-04-12 Thread Johannes Knaus

Hi,

I am still quite new to Solr. I have the following setup:
A SolrCloud setup with 
38 nodes, 
maxShardsPerNode=2, 
implicit routing with routing field, 
and replication factor=2.

Now, I want to add replica. This works fine by first increasing the 
maxShardsPerNode to a higher number and then add replicas.
So far, so good. I can confirm changes of the maxShardsPerNode parameter and 
added replicas in the Admin UI.
However, the Solr Admin UI still is showing me a replication factor of 2.
I am a little confused about what the replicationfactor parameter actually does 
in my case:

1) What does that mean? Does Solr make use of all replicas I have or only of 
two?
2) Do I need to increase the replication factor value as well to really have 
more replicas available and usable? If this is true, do I need to 
restart/reload the collection newly upload configs to Zookeeper or anything 
alike?
3) Or is replicationfactor just a parameter that is needed for the first start 
of SolrCloud and can be ignored afterwards?

Thank you very much for your help,
All the best,
Johannes

Re: Stopping a node from receiving any requests temporarily.

2017-04-12 Thread Callum Lamb

We can do that in most cases and that's what we've been doing up until now
to prevent failed requests.

All the more incentive to get rid of those joins then I guess!

Thanks.

On Wed, Apr 12, 2017 at 4:16 PM, Erick Erickson 
wrote:

> No good ideas here with current Solr. I just raised SOLR-10484 for the
> generic ability to take a replica out of action (including the
> ADDREPLICA operation).
>
> Your understanding is correct, Solr will route requests to active
> replicas. Is it possible that you can load the "from" core first
> _then_ add the replica that references it? Or do they switch roles?
>
> Best,
> Erick
>
> On Wed, Apr 12, 2017 at 7:39 AM, Callum Lamb  wrote:
> > Forgot to mention. We're using solr 5.5.2 in Solr cloud mode. Everything
> is
> > single sharded at the moment as the collections are still quite small.
> >
> > On Wed, Apr 12, 2017 at 3:30 PM, Callum Lamb  wrote:
> >
> >> We have a Solr cluster that still takes queries that join between cores
> (I
> >> know, bad). We can't change that anytime soon however and I was hoping
> >> there was a band-aid I could use in the mean time to make deployments of
> >> new nodes cleaner.
> >>
> >> When we want to add a new node to cluster we'll have a brief moment in
> >> time where one of the cores in that join will be present, but the other
> >> won't.
> >>
> >> My understanding is that even if you stop requests from reaching the new
> >> Solr node with haproxy, Solr can can route requests between nodes on
> it's
> >> own behind haproxy. We've also noticed that this internal Solr routing
> is
> >> not aware of the join in the query and will route a request to a core
> that
> >> joins to another core even if the latter is not present yet (Causing the
> >> query to fail).
> >>
> >> Until we eliminate all the joins, we want to be able to have a node we
> can
> >> do things to, but *gaurentee* it won't receive any requests until we
> decide
> >> it's ready to take requests. Is there an easy way to do this? We could
> try
> >> stopping the Solr's from talking to each other at the network level but
> >> this seems iffy to me and might cause something weird to happen.
> >>
> >> Any ideas?
> >>
> >>
> >>
> >
> > --
> >
> > Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> > Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
> >
> > Contact details for our other offices can be found at
> > http://www.mintel.com/office-locations.
> >
> > This email and any attachments may include content that is confidential,
> > privileged
> > or otherwise protected under applicable law. Unauthorised disclosure,
> > copying, distribution
> > or use of the contents is prohibited and may be unlawful. If you have
> > received this email in error,
> > including without appropriate authorisation, then please reply to the
> > sender about the error
> > and delete this email and any attachments.
> >
>

-- 

Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations.

This email and any attachments may include content that is confidential, 
privileged 
or otherwise protected under applicable law. Unauthorised disclosure, 
copying, distribution 
or use of the contents is prohibited and may be unlawful. If you have 
received this email in error,
including without appropriate authorisation, then please reply to the 
sender about the error 
and delete this email and any attachments.

RE: Solr 6.4. Can't index MS Visio vsdx files

2017-04-12 Thread Allison, Timothy B.

The release candidate for POI was just cut...unfortunately, I think after Nick 
Burch fixed the 'PolylineTo' issue...thank you, btw, for opening that!

That'll be done within a week unless there are surprises.  Once that's out, I 
have to update a few things, but I'd think we'd have a candidate for Tika a 
week later, then a week for release.

You can get nightly builds here: https://builds.apache.org/

Please ask on the POI or Tika users lists for how to get the latest/latest 
running, and thank you, again, for opening the issue on POI's Bugzilla.

Best,

   Tim

-Original Message-
From: Gytis Mikuciunas [mailto:gyt...@gmail.com] 
Sent: Wednesday, April 12, 2017 1:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files

when 1.15 will be released? maybe you have some beta version and I could test 
it :)

SAX sounds interesting, and from info that I found in google it could solve my 
issues.

On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. 
wrote:

> It depends.  We've been trying to make parsers more, erm, flexible, 
> but there are some problems from which we cannot recover.
>
> Tl;dr there isn't a short answer.  :(
>
> My sense is that DIH/ExtractingDocumentHandler is intended to get 
> people up and running with Solr easily but it is not really a great 
> idea for production.  See Erick's gem: https://lucidworks.com/2012/ 
> 02/14/indexing-with-solrj/
>
> As for the Tika portion... at the very least, Tika _shouldn't_ cause 
> the ingesting process to crash.  At most, it should fail at the file 
> level and not cause greater havoc.  In practice, if you're processing 
> millions of files from the wild, you'll run into bad behavior and need 
> to defend against permanent hangs, oom, memory leaks.
>
> Also, at the least, if there's an exception with an embedded file, 
> Tika should catch it and keep going with the rest of the file.  If 
> this doesn't happen let us know!  We are aware that some types of 
> embedded file stream problems were causing parse failures on the 
> entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't 
> let them percolate up through the parent file (they're reported in the 
> metadata though).
>
> Specifically for your stack traces:
>
> For your initial problem with the missing class exceptions -- I 
> thought we used to catch those in docx and log them.  I haven't been 
> able to track this down, though.  I can look more if you have a need.
>
> For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' 
> name 'PolylineTo' ", this problem might go away if we implemented a 
> pure SAX parser for vsdx.  We just did this for docx and pptx (coming 
> in 1.15) and these are more robust to variation because they aren't 
> requiring a match with the ooxml schema.  I haven't looked much at 
> vsdx, but that _might_ help.
>
> For "TODO Support v5 Pointers", this isn't supported and would require 
> contributions.  However, I agree that POI shouldn't throw a Runtime 
> exception.  Perhaps open an issue in POI, or maybe we should catch 
> this special example at the Tika level?
>
> For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI 
> team _might_ be able to modify the parser to ignore a stream if 
> there's an exception, but that's often a sign that something needs to 
> be fixed with the parser.  In short, the solution will come from POI.
>
> Best,
>
>  Tim
>
> -Original Message-
> From: Gytis Mikuciunas [mailto:gyt...@gmail.com]
> Sent: Tuesday, April 11, 2017 1:56 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Thanks for your responses.
> Are there any posibilities to ignore parsing errors and continue indexing?
> because now solr/tika stops parsing whole document if it finds any 
> exception
>
> On Apr 11, 2017 19:51, "Allison, Timothy B."  wrote:
>
> > You might want to drop a note to the dev or user's list on Apache POI.
> >
> > I'm not extremely familiar with the vsd(x) portion of our code base.
> >
> > The first item ("PolylineTo") may be caused by a mismatch btwn your 
> > doc and the ooxml spec.
> >
> > The second item appears to be an unsupported feature.
> >
> > The third item may be an area for improvement within our 
> > codebase...I can't tell just from the stacktrace.
> >
> > You'll probably get more helpful answers over on POI.  Sorry, I 
> > can't help with this...
> >
> > Best,
> >
> >Tim
> >
> > P.S.
> > >  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
> >
> > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set 
> > of poi-ooxml-schemas-3.15.jar
> >
> >
> >
>

Re: Grouped Result sort issue

Alessandro:

I should have been explicit that I'm hypothesizing somewhat here, so
believe me at your own risk ;)

bq: So it means that group sorting is independent of the group head sorting

that's my hypothesis, but it's _not_ based on knowing the code.

Best,
Erick

On Wed, Apr 12, 2017 at 2:05 AM, alessandro.benedetti
 wrote:
> "You're telling Solr to return the highest scoring doc in each group.
> However, you're asking to order the _groups_ in ascending score order
> (i.e. the group with the lowest scoring doc first) of _any_ doc in
> that group, not just the one(s) returned. These are two separate
> things. "
>
> This is quite interesting, and I admit I have not explored the internals yet
> so i didn't know.
> So, even if you return only the top scoring doc per group and you flat the
> groups (group.format=simple), the "invisible docs" will still regulate the
> sorting of the groups.
> I would say it is at least quite counter-intuitive.
>
> So it means that group sorting is independent of the group head sorting.
> sort = score asc -> will always sort the groups by ascending score of the
> minimum scoring doc of the group.
> sort = score desc -> will always sort the groups by descending score of the
> maximum scoring doc in the group
>
> Cheers
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Grouped-Result-sort-issue-tp4329255p4329468.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: maxDoc ten times greater than numDoc

Yes, this is very strange. My bet: you have something
custom, a setting, indexing code, whatever that
is getting in the way.

Second possibility (really stretching here): your
merge settings are set to 10 segments having to exist
before merging and somehow not all the docs in the
segments are replaced. So until you get to the 10th
re-index (and assuming a single segment is
produced per re-index) the older segments aren't
merged. If that were the case I'd expect to see the
number of deleted docs drop back periodically
then build up again. A real shot in the dark. One way
to test this would be to specify "segmentsPerTier" of, say,
2 rather than the default 10, see:
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
If this were the case I'd expect with a setting of 2 that
your index might have 50% deleted docs, that would at
least tell us whether we're on the right track.

Take a look at your index on disk. If you're seeing gaps
in the numbering, you are getting merging, it may be
that they're not happening very often.

And I take it you have no custom code here and you are
doing commits? (hard commits are all that matters
for merging, it doesn't matter whether openSearcher
is set to true or false).

I just tried the "techproducts" example as follows:
1> indexed all the sample files with the bin/solr -e techproducts example
2> started re-indexing the sample docs one at a time with post.jar

It took a while, but eventually the original segments got merged away so
I doubt it's any weirdness with a small index.

Speaking of small index, why are you sharding with only
8K docs? Sharding will probably slow things down for such
a small index. This isn't germane to your question though.

Best,
Erick

On Wed, Apr 12, 2017 at 5:56 AM, Shawn Heisey  wrote:
> On 4/12/2017 5:11 AM, Markus Jelsma wrote:
>> One of our 2 shard collections is rather small and gets all its entries 
>> reindexed every 20 minutes orso. Now i just noticed maxDoc is ten times 
>> greater than numDoc, the merger is never scheduled but settings are default. 
>> We just overwrite the existing entries, all of them.
>>
>> Here are the stats:
>>
>> Last Modified:12 minutes ago
>> Num Docs: 8336
>> Max Doc:82362
>> Heap Memory Usage: -1
>> Deleted Docs: 74026
>> Version: 3125
>> Segment Count: 10
>
> This discrepancy would typically mean that when you reindex, you're
> indexing MOST of the documents, but not ALL of them, so at least one
> document is still not deleted in each older segment.  When segments have
> all their documents deleted, they are automatically removed by Lucene,
> but if there's even one document NOT deleted, the segment will remain
> until it is merged.
>
> There's no information here about how large this core is, but unless the
> documents are REALLY enormous, I'm betting that an optimize would happen
> quickly.  With a document count this low and an indexing pattern that
> results in such a large maxdoc, this might be a good time to go against
> general advice and perform an optimize at least once a day.
>
> An alternate idea that would not require optimizes:  If the intent is to
> completely rebuild the index, you might want to consider issuing a
> "delete all docs by query" before beginning the indexing process.  This
> would ensure that none of the previous documents remain.  As long as you
> don't do a commit that opens a new searcher before the indexing is
> complete, clients won't ever know that everything was deleted.
>
>> This is the config:
>>
>>   6.5.0
>>   ${solr.data.dir:}
>>   > class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
>>   
>>   
>>
>>   
>> ${solr.lock.type:native}
>>  false
>>   
>>
>>   
>>
>>   
>> 
>>   ${solr.ulog.dir:}
>> 
>>   
>
> Side issue: This config is missing autoCommit.  You really should have
> autoCommit with openSearcher set to false and a maxTime in the
> neighborhood of 6.  It goes inside the updateHandler section.  This
> won't change the maxDoc issue, but because of the other problems it can
> prevent, it is strongly recommended.  It can be omitted if you are
> confident that your indexing code is correctly managing hard commits.
>
> Thanks,
> Shawn
>

Re: Filtering results by minimum relevancy score

Well, just because ES has it doesn't mean it's A Good Thing. IMO, it's
just a "feel good" kind of thing for people who don't really
understand scoring.

>From that page: "Note, most times, this does not make much sense, but
is provided for advanced use cases."

I've written enough weasel-worded caveats to read the hidden message
here (freely translated and purged of expletives):

"OK, if you insist we'll provide this, and we'll make you feel good by
saying it's for 'advanced use cases". We don't expect this to be
useful at all, but it's easy to do and we'll waste more time arguing
than just putting it in. P.S. don't call us when you find out this is
useless".

Best,
Erick

On Wed, Apr 12, 2017 at 7:37 AM, Shawn Heisey  wrote:
> On 4/10/2017 8:59 AM, David Kramer wrote:
>> I’ve done quite a bit of searching on this. Pretty much every page I
>> find says it’s a bad idea and won’t work well, but I’ve been asked to
>> at least try it to reduce the number of completely unrelated results
>> returned. We are not trying to normalize the number, or display it as
>> a percentage, and I understand why those are not mathematically sound.
>> We are relying on Solr for pagination, so we can’t just filter out low
>> scores from the results.
>
> Here's my contribution.  This boils down to nearly the same thing Erick
> said, but stated in a very different way: The absolute score value has
> zero meaning, for ANY purpose ... not just percentages or
> normalization.  If you try to use it, you're asking for disappointment.
>
> Scores only have meaning within a single query, and the only information
> that's important is whether the score of one document is higher or lower
> than the score of the rest of the documents in the same result.
> Boosting lets you influence those relative scores, but the actual
> numeric score of one document in a result doesn't reveal ANYTHING useful
> about that document.
>
> I agree with Erick's general advice:  Instead of trying to arbitrarily
> decide which documents are scoring too low to be relevant, refine the
> query so that irrelevant results are either completely excluded, or so
> relevant documents will outscore irrelevant ones and the first few pages
> will be good results.  Users must be trained to expect irrelevant (and
> slow) results if they paginate deeply.  For performance reasons, you
> should limit how many pages users can view on a result.
>
> Thanks,
> Shawn
>

Re: Stopping a node from receiving any requests temporarily.

No good ideas here with current Solr. I just raised SOLR-10484 for the
generic ability to take a replica out of action (including the
ADDREPLICA operation).

Your understanding is correct, Solr will route requests to active
replicas. Is it possible that you can load the "from" core first
_then_ add the replica that references it? Or do they switch roles?

Best,
Erick

On Wed, Apr 12, 2017 at 7:39 AM, Callum Lamb  wrote:
> Forgot to mention. We're using solr 5.5.2 in Solr cloud mode. Everything is
> single sharded at the moment as the collections are still quite small.
>
> On Wed, Apr 12, 2017 at 3:30 PM, Callum Lamb  wrote:
>
>> We have a Solr cluster that still takes queries that join between cores (I
>> know, bad). We can't change that anytime soon however and I was hoping
>> there was a band-aid I could use in the mean time to make deployments of
>> new nodes cleaner.
>>
>> When we want to add a new node to cluster we'll have a brief moment in
>> time where one of the cores in that join will be present, but the other
>> won't.
>>
>> My understanding is that even if you stop requests from reaching the new
>> Solr node with haproxy, Solr can can route requests between nodes on it's
>> own behind haproxy. We've also noticed that this internal Solr routing is
>> not aware of the join in the query and will route a request to a core that
>> joins to another core even if the latter is not present yet (Causing the
>> query to fail).
>>
>> Until we eliminate all the joins, we want to be able to have a node we can
>> do things to, but *gaurentee* it won't receive any requests until we decide
>> it's ready to take requests. Is there an easy way to do this? We could try
>> stopping the Solr's from talking to each other at the network level but
>> this seems iffy to me and might cause something weird to happen.
>>
>> Any ideas?
>>
>>
>>
>
> --
>
> Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
>
> Contact details for our other offices can be found at
> http://www.mintel.com/office-locations.
>
> This email and any attachments may include content that is confidential,
> privileged
> or otherwise protected under applicable law. Unauthorised disclosure,
> copying, distribution
> or use of the contents is prohibited and may be unlawful. If you have
> received this email in error,
> including without appropriate authorisation, then please reply to the
> sender about the error
> and delete this email and any attachments.
>

Re: Getting error while excuting full import

On 4/10/2017 3:47 AM, ankur.168 wrote:
> Hi All,I am trying to use solr with 2 cores interacting with 2 different
> databases, one core is executing full-import successfully where as when I am
> running for 2nd one it is throwing table or view not found exception. If I
> am using the query directly It is running fine. Below is the error meassge I
> am getting.Kindly help me, not able to understand what could be the issue
> here.I am using solr 6.4.1.

> java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist 
> at
> oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447) at
>

You didn't include your dataimport config.  You'll probably need to
redact password information from it before you send it.

If you go to the Logging tab of the admin UI and change the level of the
JdbcDataSource class to DEBUG, then you will find the actual SQL Solr is
sending to the database in the solr.log file when you do another
import.  These logs will not show up in the Logging tab -- you will need
to find the actual logfile on disk.

Thanks,
Shawn

Re: Stopping a node from receiving any requests temporarily.

2017-04-12 Thread Callum Lamb

Forgot to mention. We're using solr 5.5.2 in Solr cloud mode. Everything is
single sharded at the moment as the collections are still quite small.

On Wed, Apr 12, 2017 at 3:30 PM, Callum Lamb  wrote:

> We have a Solr cluster that still takes queries that join between cores (I
> know, bad). We can't change that anytime soon however and I was hoping
> there was a band-aid I could use in the mean time to make deployments of
> new nodes cleaner.
>
> When we want to add a new node to cluster we'll have a brief moment in
> time where one of the cores in that join will be present, but the other
> won't.
>
> My understanding is that even if you stop requests from reaching the new
> Solr node with haproxy, Solr can can route requests between nodes on it's
> own behind haproxy. We've also noticed that this internal Solr routing is
> not aware of the join in the query and will route a request to a core that
> joins to another core even if the latter is not present yet (Causing the
> query to fail).
>
> Until we eliminate all the joins, we want to be able to have a node we can
> do things to, but *gaurentee* it won't receive any requests until we decide
> it's ready to take requests. Is there an easy way to do this? We could try
> stopping the Solr's from talking to each other at the network level but
> this seems iffy to me and might cause something weird to happen.
>
> Any ideas?
>
>
>

-- 

Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations.

This email and any attachments may include content that is confidential, 
privileged 
or otherwise protected under applicable law. Unauthorised disclosure, 
copying, distribution 
or use of the contents is prohibited and may be unlawful. If you have 
received this email in error,
including without appropriate authorisation, then please reply to the 
sender about the error 
and delete this email and any attachments.

Re: Filtering results by minimum relevancy score

On 4/10/2017 8:59 AM, David Kramer wrote:
> I’ve done quite a bit of searching on this. Pretty much every page I
> find says it’s a bad idea and won’t work well, but I’ve been asked to
> at least try it to reduce the number of completely unrelated results
> returned. We are not trying to normalize the number, or display it as
> a percentage, and I understand why those are not mathematically sound.
> We are relying on Solr for pagination, so we can’t just filter out low
> scores from the results. 

Here's my contribution.  This boils down to nearly the same thing Erick
said, but stated in a very different way: The absolute score value has
zero meaning, for ANY purpose ... not just percentages or
normalization.  If you try to use it, you're asking for disappointment.

Scores only have meaning within a single query, and the only information
that's important is whether the score of one document is higher or lower
than the score of the rest of the documents in the same result. 
Boosting lets you influence those relative scores, but the actual
numeric score of one document in a result doesn't reveal ANYTHING useful
about that document.

I agree with Erick's general advice:  Instead of trying to arbitrarily
decide which documents are scoring too low to be relevant, refine the
query so that irrelevant results are either completely excluded, or so
relevant documents will outscore irrelevant ones and the first few pages
will be good results.  Users must be trained to expect irrelevant (and
slow) results if they paginate deeply.  For performance reasons, you
should limit how many pages users can view on a result.

Thanks,
Shawn

Re: Enable Gzip compression Solr 6.0

2017-04-12 Thread Rick Leir

Hi Mahmoud
I assume you are running Solr 'behind' a web application, so Solr is not 
directly on the net.

The gzip compression is an Apache thing, and relates to your web application. 

Connections to Solr are within your infrastructure, so you might not want to 
gzip them. But maybe your setup is different?

Older versions of Solr used Tomcat which supported gzip. Newer versions use 
Zookeeper and Jetty and you prolly will find a way.
Cheers -- Rick

On April 12, 2017 8:48:45 AM EDT, Mahmoud Almokadem  
wrote:
>Hello,
>
>How can I enable Gzip compression for Solr 6.0 to save bandwidth
>between
>the server and clients?
>
>Thanks,
>Mahmoud

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Stopping a node from receiving any requests temporarily.

2017-04-12 Thread Callum Lamb

We have a Solr cluster that still takes queries that join between cores (I
know, bad). We can't change that anytime soon however and I was hoping
there was a band-aid I could use in the mean time to make deployments of
new nodes cleaner.

When we want to add a new node to cluster we'll have a brief moment in time
where one of the cores in that join will be present, but the other won't.

My understanding is that even if you stop requests from reaching the new
Solr node with haproxy, Solr can can route requests between nodes on it's
own behind haproxy. We've also noticed that this internal Solr routing is
not aware of the join in the query and will route a request to a core that
joins to another core even if the latter is not present yet (Causing the
query to fail).

Until we eliminate all the joins, we want to be able to have a node we can
do things to, but *gaurentee* it won't receive any requests until we decide
it's ready to take requests. Is there an easy way to do this? We could try
stopping the Solr's from talking to each other at the network level but
this seems iffy to me and might cause something weird to happen.

Any ideas?

-- 

Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations.

This email and any attachments may include content that is confidential, 
privileged 
or otherwise protected under applicable law. Unauthorised disclosure, 
copying, distribution 
or use of the contents is prohibited and may be unlawful. If you have 
received this email in error,
including without appropriate authorisation, then please reply to the 
sender about the error 
and delete this email and any attachments.

Re: simple matches not catching at query time

2017-04-12 Thread Mikhail Khludnev

John,

Double quotes is a sign of a phrase query (and round braces inside of
double quotes is a horrible to think about beast). Since the query is a
disjunction of phrases and the shingle it has no chance to match any of
indexed values from screenshots. Probably you need to flip
autoGeneratePhraseQueries
(see
https://cwiki.apache.org/confluence/display/solr/Field+Type+Definitions+and+Properties
)

On Wed, Apr 12, 2017 at 2:52 PM, John Blythe  wrote:

> you can view some of my analyses here that has caused me grief and
> confusion: http://imgur.com/a/Fcht3
>
> here is a debug output:
>
> "rawquerystring":"\"ZIMMER:ZIMMER US\"",
> "querystring":"\"ZIMMER:ZIMMER US\"",
> "parsedquery":"(+DisjunctionMaxQuery((manufacturer_syn:\"zimmer
> zimmer\" | manufacturer_s:ZIMMER:ZIMMER US |
> manufacturer_split_syn:\"zimmer zimmer\" |
> manufacturer_syn_both:\"(zimmer_zimmer_us zimmer) zimmer\" |
> manufacturer_text:\"zimmer zimmer us\")) ())/no_coord",
> "parsedquery_toString":"+(manufacturer_syn:\"zimmer zimmer\" |
> manufacturer_s:ZIMMER:ZIMMER US | manufacturer_split_syn:\"zimmer
> zimmer\" | manufacturer_syn_both:\"(zimmer_zimmer_us zimmer) zimmer\"
> | manufacturer_text:\"zimmer zimmer us\") ()",
> "explain":{},
>
>
> is it the quotes that are getting things screwy? i'm not entirely versed on
> how to interpret the raw and parsed query data here. does \"zimmer zimmer\"
> mean that lucene is receiving that shingle rather than 'zimmer' (implicit
> OR) 'zimmer'? if so, then i'm not understanding why that's happening bc
> some of these have WDF that is generating word parts.
>
> aside: i've changed the server-side code used to send the query to split on
> the colon and send over as separate tokens wrapped in quotes. in the case
> above, field:("VENDOR:VENDOR US") becomes field:("VENDOR" "VENDOR US")
> which successfully solves my immediate problem. that said, i'd really like
> to understand better where things are going wrong w the above _and_ learn
> better how to debug my queries.
>
> i need to get the TermsComponent used to find what is being indexed so i
> can report back on that and then can share the list of items requested by
> alessandro.
>
> thanks all!
>
>
> On Wed, Apr 12, 2017 at 5:26 AM, alessandro.benedetti <
> a.benede...@sease.io>
> wrote:
>
> > hi John, I am a bit confused here.
> >
> > Let's focus on one field and one document.
> >
> > Given this parsed phrase query :
> >
> > manufacturer_split_syn:"vendor vendor"
> >
> > and the document 1 :
> > D1
> > {"id":"1"
> > "manufacturer_split_syn" : "vendor"}
> >
> > Are you expecting this to match ?
> > because it shouldn't ...
> >
> > let's try to formulate the problem in this way, with less explaining and
> > more step by step :
> >
> > Original Query :
> > Parsed Query:
> > Document indexed :
> > Terms in the index :
> >
> > Cheers
> >
> >
> >
> > -
> > ---
> > Alessandro Benedetti
> > Search Consultant, R Software Engineer, Director
> > Sease Ltd. - www.sease.io
> > --
> > View this message in context: http://lucene.472066.n3.nabble
> > .com/simple-matches-not-catching-at-query-time-tp4329337p4329475.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
Sincerely yours
Mikhail Khludnev

Re: Autosuggestion

2017-04-12 Thread Andrea Gazzarini


Hi,
I think you got an old post. I would have a look at the built-in 
feature, first. These posts can help you to get a quick overview:


https://cwiki.apache.org/confluence/display/solr/Suggester
http://alexbenedetti.blogspot.it/2015/07/solr-you-complete-me.html
https://lucidworks.com/2015/03/04/solr-suggester/

HTH,
Andrea

On 12/04/17 14:43, OTH wrote:

Hello,

Is there any recommended way to achieve auto-suggestion in textboxes using
Solr?

I'm new to Solr, but right now I have achieved this functionality by using
an example I found online, doing this:

I added a copy field, which is of the following type:

   
 
   
   
 
 
   
   
 
   

In the search box, after each character is typed, the above field is
queried, and the results are shown in a drop-down list.

However, this is performing quite slow.  I'm not sure if that has to do
with the front-end code, or because I'm not using the recommended approach
in terms of how I'm using Solr.  Is there any other recommended way to use
Solr to achieve this functionality?

Thanks

Re: maxDoc ten times greater than numDoc

On 4/12/2017 5:11 AM, Markus Jelsma wrote:
> One of our 2 shard collections is rather small and gets all its entries 
> reindexed every 20 minutes orso. Now i just noticed maxDoc is ten times 
> greater than numDoc, the merger is never scheduled but settings are default. 
> We just overwrite the existing entries, all of them.
>
> Here are the stats:
>
> Last Modified:12 minutes ago 
> Num Docs: 8336
> Max Doc:82362
> Heap Memory Usage: -1
> Deleted Docs: 74026
> Version: 3125
> Segment Count: 10

This discrepancy would typically mean that when you reindex, you're
indexing MOST of the documents, but not ALL of them, so at least one
document is still not deleted in each older segment.  When segments have
all their documents deleted, they are automatically removed by Lucene,
but if there's even one document NOT deleted, the segment will remain
until it is merged.

There's no information here about how large this core is, but unless the
documents are REALLY enormous, I'm betting that an optimize would happen
quickly.  With a document count this low and an indexing pattern that
results in such a large maxdoc, this might be a good time to go against
general advice and perform an optimize at least once a day.

An alternate idea that would not require optimizes:  If the intent is to
completely rebuild the index, you might want to consider issuing a
"delete all docs by query" before beginning the indexing process.  This
would ensure that none of the previous documents remain.  As long as you
don't do a commit that opens a new searcher before the indexing is
complete, clients won't ever know that everything was deleted.

> This is the config:
>
>   6.5.0
>   ${solr.data.dir:}
>class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
>   
>   
>
>   
> ${solr.lock.type:native}
>  false
>   
>
>   
>
>   
> 
>   ${solr.ulog.dir:}
> 
>   

Side issue: This config is missing autoCommit.  You really should have
autoCommit with openSearcher set to false and a maxTime in the
neighborhood of 6.  It goes inside the updateHandler section.  This
won't change the maxDoc issue, but because of the other problems it can
prevent, it is strongly recommended.  It can be omitted if you are
confident that your indexing code is correctly managing hard commits.

Thanks,
Shawn

Enable Gzip compression Solr 6.0

2017-04-12 Thread Mahmoud Almokadem

Hello,

How can I enable Gzip compression for Solr 6.0 to save bandwidth between
the server and clients?

Thanks,
Mahmoud

Autosuggestion

2017-04-12 Thread OTH

Hello,

Is there any recommended way to achieve auto-suggestion in textboxes using
Solr?

I'm new to Solr, but right now I have achieved this functionality by using
an example I found online, doing this:

I added a copy field, which is of the following type:

  

  
  


  
  

  

In the search box, after each character is typed, the above field is
queried, and the results are shown in a drop-down list.

However, this is performing quite slow.  I'm not sure if that has to do
with the front-end code, or because I'm not using the recommended approach
in terms of how I'm using Solr.  Is there any other recommended way to use
Solr to achieve this functionality?

Thanks

RE: maxDoc ten times greater than numDoc

This may be incorrect, but I think that even if a merge happened and the disk
space is actually released, the  deleted docs count will still be there.
What about your index size ? is the index 10 times bigger than expected ?

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/maxDoc-ten-times-greater-than-numDoc-tp4329484p4329494.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering results by minimum relevancy score

2017-04-12 Thread Dorian Hoxha

@alessandro
Elastic-search has it:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-min-score.html

On Wed, Apr 12, 2017 at 1:49 PM, alessandro.benedetti 
wrote:

> I am not completely sure that the potential benefit of merging less docs in
> sharded pagination overcomes the additional time needed to apply the
> filtering function query.
> I would need to investigate more in details the frange internals.
>
> Cheers
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Filtering-results-by-minimum-relevancy-score-
> tp4329180p4329489.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: simple matches not catching at query time

2017-04-12 Thread John Blythe

you can view some of my analyses here that has caused me grief and
confusion: http://imgur.com/a/Fcht3

here is a debug output:

"rawquerystring":"\"ZIMMER:ZIMMER US\"",
"querystring":"\"ZIMMER:ZIMMER US\"",
"parsedquery":"(+DisjunctionMaxQuery((manufacturer_syn:\"zimmer
zimmer\" | manufacturer_s:ZIMMER:ZIMMER US |
manufacturer_split_syn:\"zimmer zimmer\" |
manufacturer_syn_both:\"(zimmer_zimmer_us zimmer) zimmer\" |
manufacturer_text:\"zimmer zimmer us\")) ())/no_coord",
"parsedquery_toString":"+(manufacturer_syn:\"zimmer zimmer\" |
manufacturer_s:ZIMMER:ZIMMER US | manufacturer_split_syn:\"zimmer
zimmer\" | manufacturer_syn_both:\"(zimmer_zimmer_us zimmer) zimmer\"
| manufacturer_text:\"zimmer zimmer us\") ()",
"explain":{},

is it the quotes that are getting things screwy? i'm not entirely versed on
how to interpret the raw and parsed query data here. does \"zimmer zimmer\"
mean that lucene is receiving that shingle rather than 'zimmer' (implicit
OR) 'zimmer'? if so, then i'm not understanding why that's happening bc
some of these have WDF that is generating word parts.

aside: i've changed the server-side code used to send the query to split on
the colon and send over as separate tokens wrapped in quotes. in the case
above, field:("VENDOR:VENDOR US") becomes field:("VENDOR" "VENDOR US")
which successfully solves my immediate problem. that said, i'd really like
to understand better where things are going wrong w the above _and_ learn
better how to debug my queries.

i need to get the TermsComponent used to find what is being indexed so i
can report back on that and then can share the list of items requested by
alessandro.

thanks all!

On Wed, Apr 12, 2017 at 5:26 AM, alessandro.benedetti 
wrote:

> hi John, I am a bit confused here.
>
> Let's focus on one field and one document.
>
> Given this parsed phrase query :
>
> manufacturer_split_syn:"vendor vendor"
>
> and the document 1 :
> D1
> {"id":"1"
> "manufacturer_split_syn" : "vendor"}
>
> Are you expecting this to match ?
> because it shouldn't ...
>
> let's try to formulate the problem in this way, with less explaining and
> more step by step :
>
> Original Query :
> Parsed Query:
> Document indexed :
> Terms in the index :
>
> Cheers
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.nabble
> .com/simple-matches-not-catching-at-query-time-tp4329337p4329475.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Filtering results by minimum relevancy score

I am not completely sure that the potential benefit of merging less docs in
sharded pagination overcomes the additional time needed to apply the
filtering function query.
I would need to investigate more in details the frange internals.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-results-by-minimum-relevancy-score-tp4329180p4329489.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: maxDoc ten times greater than numDoc

2017-04-12 Thread Markus Jelsma

Hello - i know it includes all those deleted/overwritten documents. But having 
89,9 % deleted documens is quite unreasonable, so i would expect the 
mergeScheduler to kick in at least once in a while. It doesn't with default 
settings so i am curious what is wrong.

Our large regular search cluster regularly merges segments but that one 
receives updates and deletes more sparsel. Maybe the scheduler is fooled by the 
way i reindex. Any ideas?

Regards,
Markus

 
 
-Original message-
> From:alessandro.benedetti 
> Sent: Wednesday 12th April 2017 13:45
> To: solr-user@lucene.apache.org
> Subject: Re: maxDoc ten times greater than numDoc
> 
> Hi Markus,
> maxDocs includes deletions :
> 
> Deleted Docs: 74026 + Num Docs: 8336 = Max Doc:82362 
> 
> Cheers
> 
> 
> 
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/maxDoc-ten-times-greater-than-numDoc-tp4329484p4329487.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: maxDoc ten times greater than numDoc

Hi Markus,
maxDocs includes deletions :

Deleted Docs: 74026 + Num Docs: 8336 = Max Doc:82362 

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/maxDoc-ten-times-greater-than-numDoc-tp4329484p4329487.html
Sent from the Solr - User mailing list archive at Nabble.com.

maxDoc ten times greater than numDoc

2017-04-12 Thread Markus Jelsma

Hi,

One of our 2 shard collections is rather small and gets all its entries 
reindexed every 20 minutes orso. Now i just noticed maxDoc is ten times greater 
than numDoc, the merger is never scheduled but settings are default. We just 
overwrite the existing entries, all of them.

Here are the stats:

Last Modified:    12 minutes ago 
Num Docs: 8336
Max Doc:    82362
Heap Memory Usage: -1
Deleted Docs: 74026
Version: 3125
Segment Count: 10

This is the config:

  6.5.0
  ${solr.data.dir:}
  
  
  

  
    ${solr.lock.type:native}
 false
  

  

  
    
  ${solr.ulog.dir:}
    
  

Any ideas? Thanks!
Markus

RE: DistributedUpdateProcessorFactory was explicitly disabled from this updateRequestProcessorChain

2017-04-12 Thread Pratik Thaker

Hi All,

I am facing this issue since very long, can you please provide your suggestion 
on it ?

Regards,
Pratik Thaker

-Original Message-
From: Pratik Thaker [mailto:pratik.tha...@smartstreamrdu.com]
Sent: 09 February 2017 21:24
To: 'solr-user@lucene.apache.org'
Subject: RE: DistributedUpdateProcessorFactory was explicitly disabled from 
this updateRequestProcessorChain

Hi Friends,

Can you please try to give me some details about below issue ?

Regards,
Pratik Thaker

From: Pratik Thaker
Sent: 07 February 2017 17:12
To: 'solr-user@lucene.apache.org'
Subject: DistributedUpdateProcessorFactory was explicitly disabled from this 
updateRequestProcessorChain

Hi All,

I am using SOLR Cloud 6.0

I am receiving below exception very frequently in solr logs,

o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: 
RunUpdateProcessor has received an AddUpdateCommand containing a document that 
appears to still contain Atomic document update operations, most likely because 
DistributedUpdateProcessorFactory was explicitly disabled from this 
updateRequestProcessorChain
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:63)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:335)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:74)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:936)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1091)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:714)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at 
org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:93)
at 
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:97)

Can you please help me with the root cause ? Below is the snapshot of 
solrconfig,






   


  [^\w-\.]
  _





  
-MM-dd'T'HH:mm:ss.SSSZ
-MM-dd'T'HH:mm:ss,SSSZ
-MM-dd'T'HH:mm:ss.SSS
-MM-dd'T'HH:mm:ss,SSS
-MM-dd'T'HH:mm:ssZ
-MM-dd'T'HH:mm:ss
-MM-dd'T'HH:mmZ
-MM-dd'T'HH:mm
-MM-dd HH:mm:ss.SSSZ
-MM-dd HH:mm:ss,SSSZ
-MM-dd HH:mm:ss.SSS
-MM-dd HH:mm:ss,SSS
-MM-dd HH:mm:ssZ
-MM-dd HH:mm:ss
-MM-dd HH:mmZ
-MM-dd HH:mm
-MM-dd
  


  strings
  
java.lang.Boolean
booleans
  
  
java.util.Date
tdates
  
  
java.lang.Long
java.lang.Integer
tlongs
  
  
java.lang.Number
tdoubles
  


  

Regards,
Pratik Thaker


The information in this email is confidential and may be legally privileged. It 
is intended solely for the addressee. Access to this email by anyone else is 
unauthorised. If you are not the

Re: simple matches not catching at query time

hi John, I am a bit confused here.

Let's focus on one field and one document.

Given this parsed phrase query :

manufacturer_split_syn:"vendor vendor"

and the document 1 :
D1
{"id":"1"
"manufacturer_split_syn" : "vendor"}

Are you expecting this to match ?
because it shouldn't ...

let's try to formulate the problem in this way, with less explaining and
more step by step :

Original Query :
Parsed Query:
Document indexed :
Terms in the index : 

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/simple-matches-not-catching-at-query-time-tp4329337p4329475.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Grouped Result sort issue