Re: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id

2015-11-02 Thread Tom Evans
On Mon, Nov 2, 2015 at 1:38 PM, fabigol  wrote:
> Thank
> All works.
> I have 2 last questions:
> How can i put 0 by defaults " clean" during a indexation?
>
> To conclure, i wand to understand:
>
>
> Requests: 7 (1/s), Fetched: 452447 (45245/s), Skipped: 0, Processed: 17433
> (1743/s)
>
> What is the "requests"?
> What is 'Fetched"?
> What is "Processed"?
>
> Thank again for your answer
>

Depends upon how DIH is configured - different things return different
numbers. For a SqlEntityProcessor, "Requests" is the number of SQL
queries, "Fetched" is the number of rows read from those queries, and
"Processed" is the number of documents processed by SOLR.

> For the second question, i try:
> 
> false
> 
>
> and
> true
> false
>

Putting things in "invariants" overrides whatever is passed for that
parameter in the request parameters. By putting "false" in invariants, you are making it impossible
to clean + index as part of DIH, because "clean" is always false.

Cheers

Tom


Re: Very high memory and CPU utilization.

2015-11-02 Thread Toke Eskildsen
On Mon, 2015-11-02 at 14:17 +0100, Toke Eskildsen wrote:
> http://rosalind:52300/solr/collection1/select?q=%22der+se*%
> 22=json=true=false=true=domain
> 
> gets expanded to
> 
> parsedquery": "(+DisjunctionMaxQuery((content_text:\"kan svane\" |
> author:kan svane* | text:\"kan svane\" | title:\"kan svane\" | url:kan
> svane* | description:\"kan svane\")) ())/no_coord"

Wrong copy-paste, sorry. The correct expansion of "der se*" is

"rawquerystring": "\"der se*\"",

"querystring": "\"der se*\"",

"parsedquery": "(+DisjunctionMaxQuery((content_text:se | author:der se*
| text:se | title:se | url:der se* | description:se)) ())/no_coord",

"parsedquery_toString": "+(content_text:se | author:der se* | text:se |
title:se | url:der se* | description:se) ()",

"QParser": "ExtendedDismaxQParser",



This supports jim's claim that "foo bar*" is probably not doing what you
(Modassar) think it is doing.


- Toke Eskildsen, State and University Library, Denmark




Re: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id

2015-11-02 Thread fabigol
For the second question, i try:

false


and 
true
false

in solrConfig.xml but without success?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Document-is-missing-mandatory-uniqueKey-field-id-tp4237067p4237699.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id

2015-11-02 Thread fabigol
Thank 
All works.
I have 2 last questions:
How can i put 0 by defaults " clean" during a indexation?

To conclure, i wand to understand:
 

Requests: 7 (1/s), Fetched: 452447 (45245/s), Skipped: 0, Processed: 17433
(1743/s)

What is the "requests"?
What is 'Fetched"?
What is "Processed"?

Thank again for your answer



--
View this message in context: 
http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Document-is-missing-mandatory-uniqueKey-field-id-tp4237067p4237693.html
Sent from the Solr - User mailing list archive at Nabble.com.


ways to affect on SpanMultiTermQueryWrapper.TopTermsSpanBooleanQueryRewrite

2015-11-02 Thread Dmitry Kan
Hi solr fans,

Are there ways to affect on strategy
behind SpanMultiTermQueryWrapper.TopTermsSpanBooleanQueryRewrite ?

As it seems, at the moment, the rewrite method loads max N words that
maximize term score. How can this be changed to loading top terms by
frequency, for example?

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Problem with the Content Field during Solr Indexing

2015-11-02 Thread Susheel Kumar
Hi Shruti,

If you are looking to index images to make them searchable (Image Search)
then you will have to look at LIRE (Lucene Image Retrieval)
http://www.lire-project.net/  and can follow Lire Solr Plugin at this site
https://bitbucket.org/dermotte/liresolr.

Thanks,
Susheel

On Sat, Oct 31, 2015 at 9:46 PM, Zheng Lin Edwin Yeo 
wrote:

> Hi Shruti,
>
> From what I understand, the /update/extract handler is for indexing
> rich-text documents, and does not support ".png" files.
>
> It only supports the following files format: pdf, doc, docx, ppt, pptx,
> xls, xlsx, odt, odp, ods, ott, otp, ots, rtf, htm, html, txt, log
> If you use the default post.jar, I believe the other formats will get
> filtered out.
>
> When I tried to index ".png" file in my custom handler, it just index "
> " in the content.
>
> Regards,
> Edwin
>
>
>
> On 31 October 2015 at 09:35, Shruti Mundra  wrote:
>
> > Hi Edwin,
> >
> > The file extension of the image file is ".png" and we are following this
> > url for indexing:
> > "
> >
> >
> http://blog.thedigitalgroup.com/vijaym/wp-content/uploads/sites/11/2015/07/SolrImageExtract.png
> > "
> >
> > Thanks and Regards,
> > Shruti Mundra
> >
> > On Thu, Oct 29, 2015 at 8:33 PM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >
> > wrote:
> >
> > > The "\n" actually means new line as decoded by Solr from the indexed
> > > document.
> > >
> > > What is your file extension of your image file, and which method are
> you
> > > using to do the indexing?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 30 October 2015 at 04:38, Shruti Mundra  wrote:
> > >
> > > > Hi,
> > > >
> > > > When I'm trying index an image file directly to Solr, the attribute
> > > > content, consists of trails of "\n"s and not the data.
> > > > We are successful in getting the metadata for that image.
> > > >
> > > > Can anyone help us out on how we could get the content along with the
> > > > Metadata.
> > > >
> > > > Thanks!
> > > >
> > > > - Shruti Mundra
> > > >
> > >
> >
>


Re: Very high memory and CPU utilization.

2015-11-02 Thread Toke Eskildsen
On Mon, 2015-11-02 at 17:27 +0530, Modassar Ather wrote:

> The query q=network se* is quick enough in our system too. It takes
> around 3-4 seconds for around 8 million records.
> 
> The problem is with the same query as phrase. q="network se*".

I misunderstood your query then. I tried replicating it with
q="der se*"

http://rosalind:52300/solr/collection1/select?q=%22der+se*%
22=json=true=false=true=domain

gets expanded to

parsedquery": "(+DisjunctionMaxQuery((content_text:\"kan svane\" |
author:kan svane* | text:\"kan svane\" | title:\"kan svane\" | url:kan
svane* | description:\"kan svane\")) ())/no_coord"

The result was 1,043,258,271 hits in 15,211 ms


Interestingly enough, a search for 
q="kan svane*"
resulted in 711 hits in 12,470 ms. Maybe because 'kan' alone matches 1
billion+ documents. On that note,
q=se*
resulted in -951812427 hits in 194,276 ms.

Now this is interesting. The negative number seems to be caused by
grouping, but I finally got the response time up in the minutes. Still
no memory problems though. Hits without grouping were 3,343,154,869.

For comparison,
q=http
resulted in -1527418054 hits in 87,464 ms. Without grouping the hit
count was 7,062,516,538. Twice the hits of 'se*' in half the time.

> I changed my SolrCloud setup from 12 shard to 8 shard and given each
> shard 30 GB of RAM on the same machine with same index size
> (re-indexed) but could not see the significant improvement for the
> query given.

Strange. I would have expected the extra free memory for disk space to
help performance.

> Also can you please share your experiences with respect to RAM, GC,
> solr cache setup etc as it seems by your comment that the SolrCloud
> environment you have is kind of similar to the one I work on?
> 
There is a short write up at
https://sbdevel.wordpress.com/net-archive-search/

- Toke Eskildsen, State and University Library, Denmark





Re: Solr Keyword query on a specific field.

2015-11-02 Thread Aaron Gibbons
The input for the title field is user based so a wide range of things can
be entered there.  Quoting the title is not what I'm looking for.  I also
checked and q.op is AND and MM is 100%.  In addition to the Title field the
user can also use general keywords so setting local params (df) to
something else would not work either to my knowledge.

To give you a better idea of what I'm trying to accomplish: I have a form
to allow users to search on Title, Keywords and add a location. The correct
operators are applied between each of these and also for the main keywords
themselves.  The only issue is with the default operator being applied
within the Title sections's keywords. My goal is to have the Title keywords
work the same as the general keywords but only be applied to the title
field vs the default text field.

On Fri, Oct 30, 2015 at 6:35 PM, davidphilip cherian <
davidphilipcher...@gmail.com> wrote:

> >> "Is there any way to have a single field search use the same keyword
> search logic as the default query?"
> Do a phrase search, with double quotes surrounding the multiple keywords,
> it should work.
>
> Try q=title:("Test Keywords")
>
> You could possibly try adding this q.op as local param to query as shown
> below.
>
> https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries
>
> If you are using edismax query parser, check for what is mm pram
> set. q.op=AND => mm=100%; q.op=OR => mm=0%)
>
> https://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29
>
>
> On Fri, Oct 30, 2015 at 3:27 PM, Aaron Gibbons <
> agibb...@synergydatasystems.com> wrote:
>
> > Is there any way to have a single field search use the same keyword
> search
> > logic as the default query? I define q.op as AND in my query which gets
> > applied to any main keywords but any keywords I'm trying to use within a
> > field do not get the same logic applied.
> > Example:
> > q=(title:(Test Keywords)) the space is treated as OR regardless of q.op
> > q=(Test Keywords) the space is defined by q.op which is AND
> >
> > Using the correct operators (AND OR * - +...) it works great as I have it
> > defined. There's just this one little caveat when you use spaces between
> > keywords expecting the q.op operator to be applied.
> > Thanks,
> > Aaron
> >
>


Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
Well it seems that doing q="network se*" is working but not in the way you
expect. Doing this q="network se*" would not trigger a prefix query and the
"*" character would be treated as any character. I suspect that your query
is in fact "network se" (assuming you're using a StandardTokenizer) and
that the word "se" is very popular in your documents. That would explain
the slow response time. Bottom line is that doing "network se*" will not
trigger prefix query at all (I may be wrong but this is the expected
behaviour for Solr up to 4.3).

2015-11-02 13:47 GMT+01:00 Modassar Ather :

> The problem is with the same query as phrase. q="network se*".
>
> The last . is fullstops for the sentence and the query is q=field:"network
> se*"
>
> Best,
> Modassar
>
> On Mon, Nov 2, 2015 at 6:10 PM, jim ferenczi 
> wrote:
>
> > Oups I did not read the thread carrefully.
> > *The problem is with the same query as phrase. q="network se*".*
> > I was not aware that you could do that with Solr ;). I would say this is
> > expected because in such case if the number of expansions for "se*" is
> big
> > then you would have to check the positions for a significant words. I
> don't
> > know if there is a limitation in the number of expansions for a prefix
> > query contained into a phrase query but I would look at this parameter
> > first (limit the number of expansion per prefix search, let's say the N
> > most significant words based on the frequency of the words for instance).
> >
> > 2015-11-02 13:36 GMT+01:00 jim ferenczi :
> >
> > >
> > >
> > >
> > > *I am not able to get  the above point. So when I start Solr with 28g
> > RAM,
> > > for all the activities related to Solr it should not go beyond 28g. And
> > the
> > > remaining heap will be used for activities other than Solr. Please help
> > me
> > > understand.*
> > >
> > > Well those 28GB of heap are the memory "reserved" for your Solr
> > > application, though some parts of the index (not to say all) are
> > retrieved
> > > via MMap (if you use the default MMapDirectory) which do not use the
> heap
> > > at all. This is a very important part of Lucene/Solr, the heap should
> be
> > > sized in a way that let a significant amount of RAM available for the
> > > index. If not then you rely on the speed of your disk, if you have SSDs
> > > it's better but reads are still significantly slower with SSDs than
> with
> > > direct RAM access. Another thing to keep in mind is that mmap will
> always
> > > tries to put things in RAM, this is why I suspect that you swap
> activity
> > is
> > > killing your performance.
> > >
> > > 2015-11-02 11:55 GMT+01:00 Modassar Ather :
> > >
> > >> Thanks Jim for your response.
> > >>
> > >> The remaining size after you removed the heap usage should be reserved
> > for
> > >> the index (not only the other system activities).
> > >> I am not able to get  the above point. So when I start Solr with 28g
> > RAM,
> > >> for all the activities related to Solr it should not go beyond 28g.
> And
> > >> the
> > >> remaining heap will be used for activities other than Solr. Please
> help
> > me
> > >> understand.
> > >>
> > >> *Also the CPU utilization goes upto 400% in few of the nodes:*
> > >> You said that only machine is used so I assumed that 400% cpu is for a
> > >> single process (one solr node), right ?
> > >> Yes you are right that 400% is for single process.
> > >> The disks are SSDs.
> > >>
> > >> Regards,
> > >> Modassar
> > >>
> > >> On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi 
> > >> wrote:
> > >>
> > >> > *if it correlates with the bad performance you're seeing. One
> > important
> > >> > thing to notice is that a significant part of your index needs to be
> > in
> > >> RAM
> > >> > (especially if you're using SSDs) in order to achieve good
> > performance.*
> > >> >
> > >> > Especially if you're not using SSDs, sorry ;)
> > >> >
> > >> > 2015-11-02 11:38 GMT+01:00 jim ferenczi :
> > >> >
> > >> > > 12 shards with 28GB for the heap and 90GB for each index means
> that
> > >> you
> > >> > > need at least 336GB for the heap (assuming you're using all of it
> > >> which
> > >> > may
> > >> > > be easily the case considering the way the GC is handling memory)
> > and
> > >> ~=
> > >> > > 1TO for the index. Let's say that you don't need your entire index
> > in
> > >> > RAM,
> > >> > > the problem as I see it is that you don't have enough RAM for your
> > >> index
> > >> > +
> > >> > > heap. Assuming your machine has 370GB of RAM there are only 34GB
> > left
> > >> for
> > >> > > your index, 1TO/34GB means that you can only have 1/30 of your
> > entire
> > >> > index
> > >> > > in RAM. I would advise you to check the swap activity on the
> machine
> > >> and
> > >> > > see if it correlates with the bad performance you're seeing. One
> > >> > important
> > >> > > thing to notice is that a significant 

Re: Queries for many terms

2015-11-02 Thread Erick Erickson
Or a really simple--minded approach, just use the frequency
as a ration of numFound to estimate terms.

Doesn't work of course if you need precise counts.

On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull
 wrote:
> How precise do you need to be?
>
> I wonder if you could efficiently approximate "number of matches" by
> getting the document frequency of each term. I realize this is an
> approximation, but the highest document frequency would be your floor.
>
> Let's say you have terms t1, t2, and t3 ... tn. t1 has highest doc freq, tn
> lowest.
>
> OK the following algorithm could refine your floor
> - count = t1.docfreq
> - Then issue a query for NOT t1, this eliminates many candidate documents
> to improve performance
> - Build a bloom filter or other set-membership data structure for t2...tn
> https://en.wikipedia.org/wiki/Bloom_filter
> - In a PostFilter(?) Lucene Collector(?) scan each collected/returned
> document and do a set membership test against the bloom filter. If member,
> then increment your count.
>
> It's O(numDocs that don't match t1)
>
> This is me just thinking outloud, but maybe it'll trigger thoughts in
> others...
> -Doug
>
>
> On Mon, Nov 2, 2015 at 12:14 PM, Upayavira  wrote:
>
>> I have a scenario where I want to search for documents that contain many
>> terms (maybe 100s or 1000s), and then know the number of terms that
>> matched. I'm happy to implement this as a query object/parser.
>>
>> I understand that Lucene isn't well suited to this scenario. Any
>> suggestions as to how to make this more efficient? Does the TermsQuery
>> work differently from the BooleanQuery regarding large numbers of terms?
>>
>> Upayavira
>>
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.


Re: Queries for many terms

2015-11-02 Thread Doug Turnbull
How precise do you need to be?

I wonder if you could efficiently approximate "number of matches" by
getting the document frequency of each term. I realize this is an
approximation, but the highest document frequency would be your floor.

Let's say you have terms t1, t2, and t3 ... tn. t1 has highest doc freq, tn
lowest.

OK the following algorithm could refine your floor
- count = t1.docfreq
- Then issue a query for NOT t1, this eliminates many candidate documents
to improve performance
- Build a bloom filter or other set-membership data structure for t2...tn
https://en.wikipedia.org/wiki/Bloom_filter
- In a PostFilter(?) Lucene Collector(?) scan each collected/returned
document and do a set membership test against the bloom filter. If member,
then increment your count.

It's O(numDocs that don't match t1)

This is me just thinking outloud, but maybe it'll trigger thoughts in
others...
-Doug


On Mon, Nov 2, 2015 at 12:14 PM, Upayavira  wrote:

> I have a scenario where I want to search for documents that contain many
> terms (maybe 100s or 1000s), and then know the number of terms that
> matched. I'm happy to implement this as a query object/parser.
>
> I understand that Lucene isn't well suited to this scenario. Any
> suggestions as to how to make this more efficient? Does the TermsQuery
> work differently from the BooleanQuery regarding large numbers of terms?
>
> Upayavira
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: contributor request

2015-11-02 Thread Erick Erickson
NP. I've occasionally taken to changing to another window and
refreshing the contributor page, seems to come back lots faster than
waiting which is very weird.

On Mon, Nov 2, 2015 at 9:01 AM, Steve Rowe  wrote:
> Yes, sorry, the wiki took so long to come back after changing it to include 
> Alex’s username that I forgot to send notification…  Thanks Erick.
>
>> On Oct 31, 2015, at 11:27 PM, Erick Erickson  wrote:
>>
>> Looks like Steve added you today, you should be all set.
>>
>> On Sat, Oct 31, 2015 at 12:50 PM, Alex  wrote:
>>> Oh, shoot, forgot to include my wiki username. Its "AlexYumas" sorry about
>>> that stupid me
>>>
>>> On Sat, Oct 31, 2015 at 10:48 PM, Alex  wrote:
>>>
 Hi,

 Please kindly add me to the Solr wiki contributors list. The app we're
 developing (Jitbit Help) is using Apache Solr to power our knowledge-base
 search engine, customers love it. (we were using MS Fulltext indexing
 service before, but it's a huge PITA).

 Thanks

>


Re: warning

2015-11-02 Thread Modassar Ather
The information is not sufficient to say something. You can refer to solr
log to find the reason of log replay.
You can also check if the index is as per expectation. E.g Number of
document indexed.

Regards,
Modassar

On Tue, Nov 3, 2015 at 11:11 AM, Midas A  wrote:

> Thanks Modassar for replying ,
>
> could u please elaborate ..what wuld have happened when we were getting
> this kind of warning  ds
>
> Regards,
> Abhishek Tiwari
>
> On Mon, Nov 2, 2015 at 6:00 PM, Modassar Ather 
> wrote:
>
> > Normally tlog is replayed in case if solr server crashes for some reason
> > and when restarted it tries to recover from the crash gracefully.
> > You can look into following documentation which explains about
> transaction
> > logs and related stuff of Solr.
> >
> >
> >
> http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> > regards,
> > Modassar
> >
> > On Mon, Nov 2, 2015 at 12:22 PM, Midas A  wrote:
> >
> > > Please explain following warning
> > >
> > > Starting log replay
> > > tlog{file=/mnt/vol1/path/data/tlog/tlog.0060544 refcount=2}
> > > active=false starting pos=0
> > >
> > > Is there any harm with this error ?
> > >
> >
>


Re: Very high memory and CPU utilization.

2015-11-02 Thread Toke Eskildsen
On Tue, 2015-11-03 at 11:09 +0530, Modassar Ather wrote:
> It is around 90GB of index (around 8 million documents) on one shard and
> there are 12 such shards. As per my understanding the sharding is required
> for this case. Please help me understand if it is not required.

Except for an internal limit of 2 billion documents/shard (or 2 billion
unique values in a field in a single shard), there are no requirements
as such.

Our shards are 900GB / 200M+ documents and works well for our use case,
but it all depends on what you are doing. Your heaps are quite large
already, so merging into a single shard would probably require a heap so
large that your would run into trouble with garbage collection.


Your problem seems to be query processing speed. If your machine is not
maxed out by many concurrent requests, sharding should help you there:
As you have noticed, it allows the search to take advantage of multiple
processors.


- Toke Eskildsen, State and University Library, Denmark




Re: Very high memory and CPU utilization.

2015-11-02 Thread Walter Underwood
One rule of thumb for Solr is to shard after you reach 100 million documents. 
With large documents, you might want to shard sooner.

We are running an unsharded index of 7 million documents (55GB) without 
problems.

The EdgeNgramFilter generates a set of prefix terms for each term in the 
document. For the term “secondary”, it would generate:

s
se
sec
seco
secon
second
seconda
secondar
secondary

Obviously, this makes the index larger. But it makes prefix match a simple 
lookup, without needing wildcards.

Again, we can help you more if you describe what you are trying to do.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 2, 2015, at 9:39 PM, Modassar Ather  wrote:
> 
> Thanks Walter for your response,
> 
> It is around 90GB of index (around 8 million documents) on one shard and
> there are 12 such shards. As per my understanding the sharding is required
> for this case. Please help me understand if it is not required.
> 
> We have requirements where we need full wild card support to be provided to
> our users.
> I will try using EdgeNgramFilter. Can you please help me understand if
> EdgeNgramFilter can be a replacement of wild cards?
> There are situations where the words may be extended with some special
> characters e.g. For se* there can be a match secondry-school which also
> needs to be considered.
> 
> Regards,
> Modassar
> 
> 
> 
> On Mon, Nov 2, 2015 at 10:17 PM, Walter Underwood 
> wrote:
> 
>> To back up a bit, how many documents are in this 90GB index? You might not
>> need to shard at all.
>> 
>> Why are you sending a query with a trailing wildcard? Are you matching the
>> prefix of words, for query completion? If so, look at the suggester, which
>> is designed to solve exactly that. Or you can use the EdgeNgramFilter to
>> index prefixes. That will make your index larger, but prefix searches will
>> be very fast.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Nov 2, 2015, at 5:17 AM, Toke Eskildsen 
>> wrote:
>>> 
>>> On Mon, 2015-11-02 at 17:27 +0530, Modassar Ather wrote:
>>> 
 The query q=network se* is quick enough in our system too. It takes
 around 3-4 seconds for around 8 million records.
 
 The problem is with the same query as phrase. q="network se*".
>>> 
>>> I misunderstood your query then. I tried replicating it with
>>> q="der se*"
>>> 
>>> http://rosalind:52300/solr/collection1/select?q=%22der+se*%
>>> 22=json=true=false=true=domain
>>> 
>>> gets expanded to
>>> 
>>> parsedquery": "(+DisjunctionMaxQuery((content_text:\"kan svane\" |
>>> author:kan svane* | text:\"kan svane\" | title:\"kan svane\" | url:kan
>>> svane* | description:\"kan svane\")) ())/no_coord"
>>> 
>>> The result was 1,043,258,271 hits in 15,211 ms
>>> 
>>> 
>>> Interestingly enough, a search for
>>> q="kan svane*"
>>> resulted in 711 hits in 12,470 ms. Maybe because 'kan' alone matches 1
>>> billion+ documents. On that note,
>>> q=se*
>>> resulted in -951812427 hits in 194,276 ms.
>>> 
>>> Now this is interesting. The negative number seems to be caused by
>>> grouping, but I finally got the response time up in the minutes. Still
>>> no memory problems though. Hits without grouping were 3,343,154,869.
>>> 
>>> For comparison,
>>> q=http
>>> resulted in -1527418054 hits in 87,464 ms. Without grouping the hit
>>> count was 7,062,516,538. Twice the hits of 'se*' in half the time.
>>> 
 I changed my SolrCloud setup from 12 shard to 8 shard and given each
 shard 30 GB of RAM on the same machine with same index size
 (re-indexed) but could not see the significant improvement for the
 query given.
>>> 
>>> Strange. I would have expected the extra free memory for disk space to
>>> help performance.
>>> 
 Also can you please share your experiences with respect to RAM, GC,
 solr cache setup etc as it seems by your comment that the SolrCloud
 environment you have is kind of similar to the one I work on?
 
>>> There is a short write up at
>>> https://sbdevel.wordpress.com/net-archive-search/
>>> 
>>> - Toke Eskildsen, State and University Library, Denmark
>>> 
>>> 
>>> 
>> 
>> 



RE: language plugin

2015-11-02 Thread Chaushu, Shani
Hi
When I make atomic update - set field - also on content field and also another 
field, the language field became generic. Meaning, it doesn’t work in the set 
field, only in the first inserting. Even if in the first time the language was 
detected, it just became generic after the update.
Any idea?

The chain is



  
title,content,text
language_t
language_all_t
generic
false 
0.8


  



Thanks,
Shani




-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Thursday, October 29, 2015 17:04
To: solr-user@lucene.apache.org
Subject: Re: language plugin

Are you trying to do an atomic update without the content field? If so, it 
sounds like Solr needs an enhancement (bug fix?) so that language detection 
would be skipped if the input field is not present. Or maybe that could be an 
option.


-- Jack Krupansky

On Thu, Oct 29, 2015 at 3:25 AM, Chaushu, Shani 
wrote:

> Hi,
>  I'm using solr language detection plugin on field name "content" 
> (solr 4.10, plugin LangDetectLanguageIdentifierUpdateProcessorFactory)
> When I'm indexing  on the first time it works fine, but if I want to 
> set one field again (regardless if it's the content or not) if goes to 
> its default language. If I'm setting other field I would like the 
> language to stay the way it was before, and o don't want to insert all 
> the content again. There is an option to set the plugin that it won't 
> calculate again the language? (put langid.overwrite to false didn't 
> work)
>
> Thanks,
> Shani
>
>
> -
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient, please contact the sender and delete all copies.
>
-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Queries for many terms

2015-11-02 Thread Upayavira
Let's say we're trying to do document to document matching (not with
MLT). We have a shingling analysis chain. The query is a document, which
is itself shingled. We then look up those shingles in the index. The %
of shingles found is in some sense a marker as to the extent to which
the documents are similar.

We could also index the number of shingles in a document, and include
that into the overall score, as a short document might be entirely
contained in a larger one but not really be a match.

I would code this in Java as a query object (because some clever person
called Doug wrote an excellent blog post on how to write query objects),
so really the important part is how to do the matching on many terms
efficiently.

Upayavira

On Mon, Nov 2, 2015, at 06:47 PM, Erick Erickson wrote:
> Or a really simple--minded approach, just use the frequency
> as a ration of numFound to estimate terms.
> 
> Doesn't work of course if you need precise counts.
> 
> On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull
>  wrote:
> > How precise do you need to be?
> >
> > I wonder if you could efficiently approximate "number of matches" by
> > getting the document frequency of each term. I realize this is an
> > approximation, but the highest document frequency would be your floor.
> >
> > Let's say you have terms t1, t2, and t3 ... tn. t1 has highest doc freq, tn
> > lowest.
> >
> > OK the following algorithm could refine your floor
> > - count = t1.docfreq
> > - Then issue a query for NOT t1, this eliminates many candidate documents
> > to improve performance
> > - Build a bloom filter or other set-membership data structure for t2...tn
> > https://en.wikipedia.org/wiki/Bloom_filter
> > - In a PostFilter(?) Lucene Collector(?) scan each collected/returned
> > document and do a set membership test against the bloom filter. If member,
> > then increment your count.
> >
> > It's O(numDocs that don't match t1)
> >
> > This is me just thinking outloud, but maybe it'll trigger thoughts in
> > others...
> > -Doug
> >
> >
> > On Mon, Nov 2, 2015 at 12:14 PM, Upayavira  wrote:
> >
> >> I have a scenario where I want to search for documents that contain many
> >> terms (maybe 100s or 1000s), and then know the number of terms that
> >> matched. I'm happy to implement this as a query object/parser.
> >>
> >> I understand that Lucene isn't well suited to this scenario. Any
> >> suggestions as to how to make this more efficient? Does the TermsQuery
> >> work differently from the BooleanQuery regarding large numbers of terms?
> >>
> >> Upayavira
> >>
> >
> >
> >
> > --
> > *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> > , LLC | 240.476.9983
> > Author: Relevant Search 
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.


Re: warning

2015-11-02 Thread Midas A
Thanks Modassar for replying ,

could u please elaborate ..what wuld have happened when we were getting
this kind of warning  ds

Regards,
Abhishek Tiwari

On Mon, Nov 2, 2015 at 6:00 PM, Modassar Ather 
wrote:

> Normally tlog is replayed in case if solr server crashes for some reason
> and when restarted it tries to recover from the crash gracefully.
> You can look into following documentation which explains about transaction
> logs and related stuff of Solr.
>
>
> http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> regards,
> Modassar
>
> On Mon, Nov 2, 2015 at 12:22 PM, Midas A  wrote:
>
> > Please explain following warning
> >
> > Starting log replay
> > tlog{file=/mnt/vol1/path/data/tlog/tlog.0060544 refcount=2}
> > active=false starting pos=0
> >
> > Is there any harm with this error ?
> >
>


Re: Very high memory and CPU utilization.

2015-11-02 Thread Modassar Ather
Thanks Walter for your response,

It is around 90GB of index (around 8 million documents) on one shard and
there are 12 such shards. As per my understanding the sharding is required
for this case. Please help me understand if it is not required.

We have requirements where we need full wild card support to be provided to
our users.
I will try using EdgeNgramFilter. Can you please help me understand if
EdgeNgramFilter can be a replacement of wild cards?
There are situations where the words may be extended with some special
characters e.g. For se* there can be a match secondry-school which also
needs to be considered.

Regards,
Modassar



On Mon, Nov 2, 2015 at 10:17 PM, Walter Underwood 
wrote:

> To back up a bit, how many documents are in this 90GB index? You might not
> need to shard at all.
>
> Why are you sending a query with a trailing wildcard? Are you matching the
> prefix of words, for query completion? If so, look at the suggester, which
> is designed to solve exactly that. Or you can use the EdgeNgramFilter to
> index prefixes. That will make your index larger, but prefix searches will
> be very fast.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Nov 2, 2015, at 5:17 AM, Toke Eskildsen 
> wrote:
> >
> > On Mon, 2015-11-02 at 17:27 +0530, Modassar Ather wrote:
> >
> >> The query q=network se* is quick enough in our system too. It takes
> >> around 3-4 seconds for around 8 million records.
> >>
> >> The problem is with the same query as phrase. q="network se*".
> >
> > I misunderstood your query then. I tried replicating it with
> > q="der se*"
> >
> > http://rosalind:52300/solr/collection1/select?q=%22der+se*%
> > 22=json=true=false=true=domain
> >
> > gets expanded to
> >
> > parsedquery": "(+DisjunctionMaxQuery((content_text:\"kan svane\" |
> > author:kan svane* | text:\"kan svane\" | title:\"kan svane\" | url:kan
> > svane* | description:\"kan svane\")) ())/no_coord"
> >
> > The result was 1,043,258,271 hits in 15,211 ms
> >
> >
> > Interestingly enough, a search for
> > q="kan svane*"
> > resulted in 711 hits in 12,470 ms. Maybe because 'kan' alone matches 1
> > billion+ documents. On that note,
> > q=se*
> > resulted in -951812427 hits in 194,276 ms.
> >
> > Now this is interesting. The negative number seems to be caused by
> > grouping, but I finally got the response time up in the minutes. Still
> > no memory problems though. Hits without grouping were 3,343,154,869.
> >
> > For comparison,
> > q=http
> > resulted in -1527418054 hits in 87,464 ms. Without grouping the hit
> > count was 7,062,516,538. Twice the hits of 'se*' in half the time.
> >
> >> I changed my SolrCloud setup from 12 shard to 8 shard and given each
> >> shard 30 GB of RAM on the same machine with same index size
> >> (re-indexed) but could not see the significant improvement for the
> >> query given.
> >
> > Strange. I would have expected the extra free memory for disk space to
> > help performance.
> >
> >> Also can you please share your experiences with respect to RAM, GC,
> >> solr cache setup etc as it seems by your comment that the SolrCloud
> >> environment you have is kind of similar to the one I work on?
> >>
> > There is a short write up at
> > https://sbdevel.wordpress.com/net-archive-search/
> >
> > - Toke Eskildsen, State and University Library, Denmark
> >
> >
> >
>
>


Re: Very high memory and CPU utilization.

2015-11-02 Thread Toke Eskildsen
On Mon, 2015-11-02 at 12:00 +0530, Modassar Ather wrote:
> I have a setup of 12 shard cluster started with 28gb memory each on a
> single server. There are no replica. The size of index is around 90gb on
> each shard. The Solr version is 5.2.1.

That is 12 machines, running a shard each?

What is the total amount of physical memory on each machine?

> When I query "network se*", the memory utilization goes upto 24-26 gb and
> the query takes around 3+ minutes to execute. Also the CPU utilization goes
> upto 400% in few of the nodes.

Well, se* probably expands to a great deal of documents, but a huge bump
in memory utilization and 3 minutes+ sounds strange.

- What are your normal query times?
- How many hits do you get from 'network se*'?
- How many results do you return (the rows-parameter)?
- If you issue a query without wildcards, but with approximately the
same amount of hits as 'network se*', how long does it take?

> Why the CPU utilization is so high and more than one core is used.
> As far as I understand querying is single threaded.

That is strange, yes. Have you checked the logs to see if something
unexpected is going on while you test?

> How can I disable replication(as it is implicitly enabled) permanently as
> in our case we are not using it but can see warnings related to leader
> election?

If you are using spinning drives and only have 32GB of RAM in total in
each machine, you are probably struggling just to keep things running.


- Toke Eskildsen, State and University Library, Denmark




Re: SSL on Solr with CA signed certificate

2015-11-02 Thread Alexandre Rafalovitch
I think (not tested) that it should be safe to select Tomcat from the
dropdown, as both use keytool (bundled with JDK) to generate the CSR.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 2 November 2015 at 09:53, davidphilip cherian
 wrote:
> The doc[1] on reference guide provides steps related to setting up ssl with
> self signed certificate. My employer wants me to set up and test with CA
> signed certificate.
> When I go to buy[2] a ssl certificate(just for testing), it asks for
> specific web server name and jetty is not listed on it.
>
> Is there something else that I need to look for, to enable ssl on solr,
> with CA signed certificate? Has anyone tried doing this instead of
> selfsigned one? Any further inputs? reference blogs?
>
>
> [1] https://cwiki.apache.org/confluence/display/solr/Enabling+SSL
> [2] https://www.instantssl.com/free-ssl-certificate.html


Re: Many files /dataImport in same project

2015-11-02 Thread Alexandre Rafalovitch
On 2 November 2015 at 11:30, Gora Mohanty  wrote:
> As per my last
> follow-up, there is currently no way to have DIH automatically pick up
> different data-config files without manually editing the DIH
> configuration each time.

I missed previous discussions, but the DIH config file is given in a
query parameter. So, if there is a bunch of them on a file system, one
could probably do
find . - name "*.dihconf" | xargs curl .

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


Queries for many terms

2015-11-02 Thread Upayavira
I have a scenario where I want to search for documents that contain many
terms (maybe 100s or 1000s), and then know the number of terms that
matched. I'm happy to implement this as a query object/parser.

I understand that Lucene isn't well suited to this scenario. Any
suggestions as to how to make this more efficient? Does the TermsQuery
work differently from the BooleanQuery regarding large numbers of terms?

Upayavira


Re: Many files /dataImport in same project

2015-11-02 Thread Gora Mohanty
On 2 November 2015 at 21:50, fabigol  wrote:
> Hi,
>  i have many files of config dataImport
> I want to start at once instead of launching DataImport for each file.
> is it possible??

Not to be antagonistic, but did you not ask this before, and have
various people not tried to help you?

With all due respect, it seems that you need to understand your
specific setup better in order to ask more specific questions. It
would be good if you stuck to one thread for that. As per my last
follow-up, there is currently no way to have DIH automatically pick up
different data-config files without manually editing the DIH
configuration each time. This is probably unlikely to get fixed as one
can put all DIH entities into one file, and import each as needed.
Further, if what you need is complex requirements in populating Solr,
it is advisable to use SolrJ, or similar libraries for other
languages.

Regards,
Gora


Re: contributor request

2015-11-02 Thread Steve Rowe
Yes, sorry, the wiki took so long to come back after changing it to include 
Alex’s username that I forgot to send notification…  Thanks Erick.
 
> On Oct 31, 2015, at 11:27 PM, Erick Erickson  wrote:
> 
> Looks like Steve added you today, you should be all set.
> 
> On Sat, Oct 31, 2015 at 12:50 PM, Alex  wrote:
>> Oh, shoot, forgot to include my wiki username. Its "AlexYumas" sorry about
>> that stupid me
>> 
>> On Sat, Oct 31, 2015 at 10:48 PM, Alex  wrote:
>> 
>>> Hi,
>>> 
>>> Please kindly add me to the Solr wiki contributors list. The app we're
>>> developing (Jitbit Help) is using Apache Solr to power our knowledge-base
>>> search engine, customers love it. (we were using MS Fulltext indexing
>>> service before, but it's a huge PITA).
>>> 
>>> Thanks
>>> 



Many files /dataImport in same project

2015-11-02 Thread fabigol
Hi,
 i have many files of config dataImport
I want to start at once instead of launching DataImport for each file.
is it possible??



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-files-dataImport-in-same-project-tp4237731.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Very high memory and CPU utilization.

2015-11-02 Thread Walter Underwood
To back up a bit, how many documents are in this 90GB index? You might not need 
to shard at all.

Why are you sending a query with a trailing wildcard? Are you matching the 
prefix of words, for query completion? If so, look at the suggester, which is 
designed to solve exactly that. Or you can use the EdgeNgramFilter to index 
prefixes. That will make your index larger, but prefix searches will be very 
fast.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 2, 2015, at 5:17 AM, Toke Eskildsen  wrote:
> 
> On Mon, 2015-11-02 at 17:27 +0530, Modassar Ather wrote:
> 
>> The query q=network se* is quick enough in our system too. It takes
>> around 3-4 seconds for around 8 million records.
>> 
>> The problem is with the same query as phrase. q="network se*".
> 
> I misunderstood your query then. I tried replicating it with
> q="der se*"
> 
> http://rosalind:52300/solr/collection1/select?q=%22der+se*%
> 22=json=true=false=true=domain
> 
> gets expanded to
> 
> parsedquery": "(+DisjunctionMaxQuery((content_text:\"kan svane\" |
> author:kan svane* | text:\"kan svane\" | title:\"kan svane\" | url:kan
> svane* | description:\"kan svane\")) ())/no_coord"
> 
> The result was 1,043,258,271 hits in 15,211 ms
> 
> 
> Interestingly enough, a search for 
> q="kan svane*"
> resulted in 711 hits in 12,470 ms. Maybe because 'kan' alone matches 1
> billion+ documents. On that note,
> q=se*
> resulted in -951812427 hits in 194,276 ms.
> 
> Now this is interesting. The negative number seems to be caused by
> grouping, but I finally got the response time up in the minutes. Still
> no memory problems though. Hits without grouping were 3,343,154,869.
> 
> For comparison,
> q=http
> resulted in -1527418054 hits in 87,464 ms. Without grouping the hit
> count was 7,062,516,538. Twice the hits of 'se*' in half the time.
> 
>> I changed my SolrCloud setup from 12 shard to 8 shard and given each
>> shard 30 GB of RAM on the same machine with same index size
>> (re-indexed) but could not see the significant improvement for the
>> query given.
> 
> Strange. I would have expected the extra free memory for disk space to
> help performance.
> 
>> Also can you please share your experiences with respect to RAM, GC,
>> solr cache setup etc as it seems by your comment that the SolrCloud
>> environment you have is kind of similar to the one I work on?
>> 
> There is a short write up at
> https://sbdevel.wordpress.com/net-archive-search/
> 
> - Toke Eskildsen, State and University Library, Denmark
> 
> 
> 



creating collection with solr5 - missing config data

2015-11-02 Thread tedsolr
I'm trying to plan a migration from a standalone solr instance to the
solrcloud. I understand the basic steps but am getting tripped up just
trying to create a new collection. For simplicity, I'm testing this on a
single machine, so I was trying to use the embedded zookeeper. I can't
figure out how to upload a config set to the embedded zookeeper. (I hope to
use the embedded zookeepers on all dev environments)

1. start first node: solr start -c
2. start second node: solr start -c -p 8984 -s solr2
3. create collection using API: curl
...collections?action=CREATE=mycollection=1
ERROR - no config found
4. copy standalone core "data" folder to mycollection

So, how do I get my shared config data (I'm using configsets in my
standalone model) uploaded to zookeeper?

thanks!
Solr 5.2.1



--
View this message in context: 
http://lucene.472066.n3.nabble.com/creating-collection-with-solr5-missing-config-data-tp4237802.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud breaks and does not recover

2015-11-02 Thread Erick Erickson
Without more data, I'd guess one of two things:

1> you're seeing stop-the-world GC pauses that cause Zookeeper to
think the node is unresponsive, which puts a node into recovery and
things go bad from there.

2> Somewhere in your solr logs you'll see OutOfMemory errors which can
also cascade a bunch of problems.

In general it's an anti-pattern to allocate such a large portion of
our physical memory to the JVM, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html



Best,
Erick



On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser  wrote:
> Hey there,
>
> we are running a SolrCloud, which has 4 nodes, same config. Each node
> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
> worked for a long time.
>
> We currently run with 2 shards, 2 replicas and 11 collections. The
> complete data-dir is about 5.3 GB.
> I think we should move some JVM heap back to the OS.
>
> We are running Solr 5.2.1., as I could not see any related bugs to
> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
> to upgrade first.
>
> One of our nodes (node A) reports these errors:
>
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
> version (expected 2, but 101) or the data in not in 'javabin' format
>
> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
>
> And shortly after (4 seconds) this happens on a *different* node (Node B):
>
> Stopping recovery for core=suggestion coreNodeName=core_node2
>
> No Stacktrace for this, but happens for all 11 collections.
>
> 6 seconds after that Node C reports these errors:
>
> org.apache.solr.common.SolrException:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /configs/customers/params.json
>
> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
>
> This also happens for 11 collections.
>
> And then different errors happen:
>
> OverseerAutoReplicaFailoverThread had an error in its thread work
> loop.:org.apache.solr.common.SolrException: Error reading cluster
> properties
>
> cancelElection did not find election node to remove
> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_000112
>
> At that point the cluster is broken and stops responding to the most
> queries. In the same time zookeeper looks okay.
>
> The cluster cannot selfheal from that situation and we are forced to
> take manual action and restart node after node and hope that solrcloud
> eventually recovers. Which sometimes takes several minutes and several
> restarts from various nodes.
>
> We can provide more logdata if needed.
>
> Is there anything where we can start digging to find the underlying
> error for that problem?
>
> Thanks in advance
> Björn


SolrCloud breaks and does not recover

2015-11-02 Thread Björn Häuser
Hey there,

we are running a SolrCloud, which has 4 nodes, same config. Each node
has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
worked for a long time.

We currently run with 2 shards, 2 replicas and 11 collections. The
complete data-dir is about 5.3 GB.
I think we should move some JVM heap back to the OS.

We are running Solr 5.2.1., as I could not see any related bugs to
SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
to upgrade first.

One of our nodes (node A) reports these errors:

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
version (expected 2, but 101) or the data in not in 'javabin' format

Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171

And shortly after (4 seconds) this happens on a *different* node (Node B):

Stopping recovery for core=suggestion coreNodeName=core_node2

No Stacktrace for this, but happens for all 11 collections.

6 seconds after that Node C reports these errors:

org.apache.solr.common.SolrException:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /configs/customers/params.json

Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8

This also happens for 11 collections.

And then different errors happen:

OverseerAutoReplicaFailoverThread had an error in its thread work
loop.:org.apache.solr.common.SolrException: Error reading cluster
properties

cancelElection did not find election node to remove
/overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_000112

At that point the cluster is broken and stops responding to the most
queries. In the same time zookeeper looks okay.

The cluster cannot selfheal from that situation and we are forced to
take manual action and restart node after node and hope that solrcloud
eventually recovers. Which sometimes takes several minutes and several
restarts from various nodes.

We can provide more logdata if needed.

Is there anything where we can start digging to find the underlying
error for that problem?

Thanks in advance
Björn


Re: creating collection with solr5 - missing config data

2015-11-02 Thread tedsolr
Thanks Erick, that did it. I had thought the -z option was only for external
zookeepers. Using port 9983 allowed me to upload a config.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/creating-collection-with-solr5-missing-config-data-tp4237802p4237811.html
Sent from the Solr - User mailing list archive at Nabble.com.


[ANN]: Blog article: every Solr home and example in Solr 5.3

2015-11-02 Thread Alexandre Rafalovitch
If you've recently downloaded Solr 5.x and trying to figure out what
example creates a home where and why the example creation command uses
configset directory but not configset URL parameter, you may find this
useful:

http://blog.outerthoughts.com/2015/11/oh-solr-home-where-art-thou/

Regards,
   Alex.
P.s. I don't normally mention my blog articles or SolrStart updates on
this list, but I've been having so many issues wrapping my head around
the new directory layouts and consequences of start scripts
auto-magic, that I figured I'll make an exception. Hopefully, that's
not too intrusive.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


Re: creating collection with solr5 - missing config data

2015-11-02 Thread Erick Erickson
The "new way" of doing things is to use the start
scripts, which is outlined at the start of the page I linked below.
You probably want to bite the bullet and get used to that
way of doing things, as it's likely going to be where ongoing
work is done.

If you still want to approach it the way you are, I see two issues:

The first issue is that you need to upload a configset to Zookeeper.
In SolrCloud, all configs live on the Zookeeper node, which will
eventually be a different machine so that's a necessary first step.

There's a section on this using zkcli here:
https://cwiki.apache.org/confluence/display/solr/Using+ZooKeeper+to+Manage+Configuration+Files


The second issue is that the second node you started is _also_
starting it's own embedded Zookeeper, and these don't know about
each other. You need the -z localhost:9983 option (I think the 9983 is
the default Zookeeper port when you run embedded ZK in one of
your Solr instances).

Best,
Erick

On Mon, Nov 2, 2015 at 1:30 PM, tedsolr  wrote:
> I'm trying to plan a migration from a standalone solr instance to the
> solrcloud. I understand the basic steps but am getting tripped up just
> trying to create a new collection. For simplicity, I'm testing this on a
> single machine, so I was trying to use the embedded zookeeper. I can't
> figure out how to upload a config set to the embedded zookeeper. (I hope to
> use the embedded zookeepers on all dev environments)
>
> 1. start first node: solr start -c
> 2. start second node: solr start -c -p 8984 -s solr2
> 3. create collection using API: curl
> ...collections?action=CREATE=mycollection=1
> ERROR - no config found
> 4. copy standalone core "data" folder to mycollection
>
> So, how do I get my shared config data (I'm using configsets in my
> standalone model) uploaded to zookeeper?
>
> thanks!
> Solr 5.2.1
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/creating-collection-with-solr5-missing-config-data-tp4237802.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Very high memory and CPU utilization.

2015-11-02 Thread Modassar Ather
Hi Toke,
Thanks for your response. My comments in-line.

That is 12 machines, running a shard each?
No! This is a single big machine with 12 shards on it.

What is the total amount of physical memory on each machine?
Around 370 gb on the single machine.

Well, se* probably expands to a great deal of documents, but a huge bump
in memory utilization and 3 minutes+ sounds strange.

- What are your normal query times?
Few simple queries are returned with in a couple of seconds. But the more
complex queries with proximity and wild cards have taken more than 3-4
minutes and some times some queries have timed out too where time out is
set to 5 minutes.
- How many hits do you get from 'network se*'?
More than a million records.
- How many results do you return (the rows-parameter)?
It is the default one 10. Grouping is enabled on a field.
- If you issue a query without wildcards, but with approximately the
same amount of hits as 'network se*', how long does it take?
A query resulting in around half a million record return within a couple of
seconds.

That is strange, yes. Have you checked the logs to see if something
unexpected is going on while you test?
Have not seen anything particularly. Will try to check again.

If you are using spinning drives and only have 32GB of RAM in total in
each machine, you are probably struggling just to keep things running.
As mentioned above this is a big machine with 370+ gb of RAM and Solr (12
nodes total) is assigned 336 GB. The rest is still a good for other system
activities.

Thanks,
Modassar

On Mon, Nov 2, 2015 at 1:30 PM, Toke Eskildsen 
wrote:

> On Mon, 2015-11-02 at 12:00 +0530, Modassar Ather wrote:
> > I have a setup of 12 shard cluster started with 28gb memory each on a
> > single server. There are no replica. The size of index is around 90gb on
> > each shard. The Solr version is 5.2.1.
>
> That is 12 machines, running a shard each?
>
> What is the total amount of physical memory on each machine?
>
> > When I query "network se*", the memory utilization goes upto 24-26 gb and
> > the query takes around 3+ minutes to execute. Also the CPU utilization
> goes
> > upto 400% in few of the nodes.
>
> Well, se* probably expands to a great deal of documents, but a huge bump
> in memory utilization and 3 minutes+ sounds strange.
>
> - What are your normal query times?
> - How many hits do you get from 'network se*'?
> - How many results do you return (the rows-parameter)?
> - If you issue a query without wildcards, but with approximately the
> same amount of hits as 'network se*', how long does it take?
>
> > Why the CPU utilization is so high and more than one core is used.
> > As far as I understand querying is single threaded.
>
> That is strange, yes. Have you checked the logs to see if something
> unexpected is going on while you test?
>
> > How can I disable replication(as it is implicitly enabled) permanently as
> > in our case we are not using it but can see warnings related to leader
> > election?
>
> If you are using spinning drives and only have 32GB of RAM in total in
> each machine, you are probably struggling just to keep things running.
>
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>


Re: Very high memory and CPU utilization.

2015-11-02 Thread Modassar Ather
Just to add one more point that one external Zookeeper instance is also
running on this particular machine.

Regards,
Modassar

On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather 
wrote:

> Hi Toke,
> Thanks for your response. My comments in-line.
>
> That is 12 machines, running a shard each?
> No! This is a single big machine with 12 shards on it.
>
> What is the total amount of physical memory on each machine?
> Around 370 gb on the single machine.
>
> Well, se* probably expands to a great deal of documents, but a huge bump
> in memory utilization and 3 minutes+ sounds strange.
>
> - What are your normal query times?
> Few simple queries are returned with in a couple of seconds. But the more
> complex queries with proximity and wild cards have taken more than 3-4
> minutes and some times some queries have timed out too where time out is
> set to 5 minutes.
> - How many hits do you get from 'network se*'?
> More than a million records.
> - How many results do you return (the rows-parameter)?
> It is the default one 10. Grouping is enabled on a field.
> - If you issue a query without wildcards, but with approximately the
> same amount of hits as 'network se*', how long does it take?
> A query resulting in around half a million record return within a couple
> of seconds.
>
> That is strange, yes. Have you checked the logs to see if something
> unexpected is going on while you test?
> Have not seen anything particularly. Will try to check again.
>
> If you are using spinning drives and only have 32GB of RAM in total in
> each machine, you are probably struggling just to keep things running.
> As mentioned above this is a big machine with 370+ gb of RAM and Solr (12
> nodes total) is assigned 336 GB. The rest is still a good for other system
> activities.
>
> Thanks,
> Modassar
>
> On Mon, Nov 2, 2015 at 1:30 PM, Toke Eskildsen 
> wrote:
>
>> On Mon, 2015-11-02 at 12:00 +0530, Modassar Ather wrote:
>> > I have a setup of 12 shard cluster started with 28gb memory each on a
>> > single server. There are no replica. The size of index is around 90gb on
>> > each shard. The Solr version is 5.2.1.
>>
>> That is 12 machines, running a shard each?
>>
>> What is the total amount of physical memory on each machine?
>>
>> > When I query "network se*", the memory utilization goes upto 24-26 gb
>> and
>> > the query takes around 3+ minutes to execute. Also the CPU utilization
>> goes
>> > upto 400% in few of the nodes.
>>
>> Well, se* probably expands to a great deal of documents, but a huge bump
>> in memory utilization and 3 minutes+ sounds strange.
>>
>> - What are your normal query times?
>> - How many hits do you get from 'network se*'?
>> - How many results do you return (the rows-parameter)?
>> - If you issue a query without wildcards, but with approximately the
>> same amount of hits as 'network se*', how long does it take?
>>
>> > Why the CPU utilization is so high and more than one core is used.
>> > As far as I understand querying is single threaded.
>>
>> That is strange, yes. Have you checked the logs to see if something
>> unexpected is going on while you test?
>>
>> > How can I disable replication(as it is implicitly enabled) permanently
>> as
>> > in our case we are not using it but can see warnings related to leader
>> > election?
>>
>> If you are using spinning drives and only have 32GB of RAM in total in
>> each machine, you are probably struggling just to keep things running.
>>
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>>
>>
>


Re: Kate Winslet vs Winslet Kate

2015-11-02 Thread Alexandre Rafalovitch
I just had a thought that perhaps Complex Phrase parser could be useful here:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser

You still need to mark that full name to search against specific
field, so it may or may not in a more general stream of user-provided
words.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 31 October 2015 at 23:52, Yangrui Guo  wrote:
> Hi today I found an interesting aspect of solr. I imported IMDB data into
> solr. The IMDB puts last name before first name for its person's name field
> eg. "Winslet, Kate". When I search "Winslet Kate" with quotation marks I
> could get the exact result. However if I search "Kate Winslet" or Kate AND
> Winslet solr seem to return me all result containing either Kate or Winslet
> which is similar to "Winslet Kate"~99. From user perspective I
> certainly want solr to treat Kate Winslet the same as Winslet Kate. Is
> there anyway to make solr score higher for terms in the same field?
>
> Yangrui


Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
12 shards with 28GB for the heap and 90GB for each index means that you
need at least 336GB for the heap (assuming you're using all of it which may
be easily the case considering the way the GC is handling memory) and ~=
1TO for the index. Let's say that you don't need your entire index in RAM,
the problem as I see it is that you don't have enough RAM for your index +
heap. Assuming your machine has 370GB of RAM there are only 34GB left for
your index, 1TO/34GB means that you can only have 1/30 of your entire index
in RAM. I would advise you to check the swap activity on the machine and
see if it correlates with the bad performance you're seeing. One important
thing to notice is that a significant part of your index needs to be in RAM
(especially if you're using SSDs) in order to achieve good performance:



*As mentioned above this is a big machine with 370+ gb of RAM and Solr (12
nodes total) is assigned 336 GB. The rest is still a good for other system
activities.*
The remaining size after you removed the heap usage should be reserved for
the index (not only the other system activities).


*Also the CPU utilization goes upto 400% in few of the nodes:*
You said that only machine is used so I assumed that 400% cpu is for a
single process (one solr node), right ?
This seems impossible if you are sure that only one query is played at a
time and no indexing is performed. Best thing to do is to dump stack trace
of the solr nodes during the query and to check what the threads are doing.

Jim



2015-11-02 10:38 GMT+01:00 Modassar Ather :

> Just to add one more point that one external Zookeeper instance is also
> running on this particular machine.
>
> Regards,
> Modassar
>
> On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather 
> wrote:
>
> > Hi Toke,
> > Thanks for your response. My comments in-line.
> >
> > That is 12 machines, running a shard each?
> > No! This is a single big machine with 12 shards on it.
> >
> > What is the total amount of physical memory on each machine?
> > Around 370 gb on the single machine.
> >
> > Well, se* probably expands to a great deal of documents, but a huge bump
> > in memory utilization and 3 minutes+ sounds strange.
> >
> > - What are your normal query times?
> > Few simple queries are returned with in a couple of seconds. But the more
> > complex queries with proximity and wild cards have taken more than 3-4
> > minutes and some times some queries have timed out too where time out is
> > set to 5 minutes.
> > - How many hits do you get from 'network se*'?
> > More than a million records.
> > - How many results do you return (the rows-parameter)?
> > It is the default one 10. Grouping is enabled on a field.
> > - If you issue a query without wildcards, but with approximately the
> > same amount of hits as 'network se*', how long does it take?
> > A query resulting in around half a million record return within a couple
> > of seconds.
> >
> > That is strange, yes. Have you checked the logs to see if something
> > unexpected is going on while you test?
> > Have not seen anything particularly. Will try to check again.
> >
> > If you are using spinning drives and only have 32GB of RAM in total in
> > each machine, you are probably struggling just to keep things running.
> > As mentioned above this is a big machine with 370+ gb of RAM and Solr (12
> > nodes total) is assigned 336 GB. The rest is still a good for other
> system
> > activities.
> >
> > Thanks,
> > Modassar
> >
> > On Mon, Nov 2, 2015 at 1:30 PM, Toke Eskildsen 
> > wrote:
> >
> >> On Mon, 2015-11-02 at 12:00 +0530, Modassar Ather wrote:
> >> > I have a setup of 12 shard cluster started with 28gb memory each on a
> >> > single server. There are no replica. The size of index is around 90gb
> on
> >> > each shard. The Solr version is 5.2.1.
> >>
> >> That is 12 machines, running a shard each?
> >>
> >> What is the total amount of physical memory on each machine?
> >>
> >> > When I query "network se*", the memory utilization goes upto 24-26 gb
> >> and
> >> > the query takes around 3+ minutes to execute. Also the CPU utilization
> >> goes
> >> > upto 400% in few of the nodes.
> >>
> >> Well, se* probably expands to a great deal of documents, but a huge bump
> >> in memory utilization and 3 minutes+ sounds strange.
> >>
> >> - What are your normal query times?
> >> - How many hits do you get from 'network se*'?
> >> - How many results do you return (the rows-parameter)?
> >> - If you issue a query without wildcards, but with approximately the
> >> same amount of hits as 'network se*', how long does it take?
> >>
> >> > Why the CPU utilization is so high and more than one core is used.
> >> > As far as I understand querying is single threaded.
> >>
> >> That is strange, yes. Have you checked the logs to see if something
> >> unexpected is going on while you test?
> >>
> >> > How can I disable replication(as it is implicitly enabled) permanently
> >> 

Re: Very high memory and CPU utilization.

2015-11-02 Thread Toke Eskildsen
On Mon, 2015-11-02 at 14:34 +0530, Modassar Ather wrote:

> No! This is a single big machine with 12 shards on it.
> Around 370 gb on the single machine.

Okay. I guess your observation of 400% for a single core is with top and
looking at that core's entry? If so, the 400% can be explained by
excessive garbage collection. You could turn GC-logging on to check
that. With a bit of luck GC would be the cause of the slow down.

> Few simple queries are returned with in a couple of seconds. But the
> more complex queries with proximity and wild cards have taken more
> than 3-4 minutes and some times some queries have timed out too where
> time out is set to 5 minutes.

The proximity information seems relevant here.

> - How many results do you return (the rows-parameter)?
> It is the default one 10. Grouping is enabled on a field.

If you have group.ngroups=true that would be heavy (and require a lot of
memory), but as your non-wildcard searches with many hits are fast, that
is probably not the problem here.

Toke:
> If you are using spinning drives and only have 32GB of RAM in total in
> each machine, you are probably struggling just to keep things running.
> 
> As mentioned above this is a big machine with 370+ gb of RAM and Solr
> (12 nodes total) is assigned 336 GB. The rest is still a good for
> other system activities.

Assuming the storage is spinning drives, it is quite a small machine,
measured by cache memory vs. index size: You have 30-40GB free for disk
cache and your index is 1TB, so ~3%. Unless you have a great deal of
stored content, 3% for disk caching means that there will be a high
amount of IO during a search. It works for you when the queries are
simple field:term, but I am not surprised that it doesn't work well in
other cases.

By nature, truncated queries touches a lot of terms, which means a lot
of lookups. I have no in-depth knowledge on how these lookups are
performed, but I guesstimate that it involves IO-intensive lookups. 


Coincidentally we also run a machine with multiple Solrs, terabytes of
index data and not much memory (< 1%) for disk cache. One difference
being that it is backed by SSDs. I tried doing a few ad-hoc searches
with grouping turned on (search terms are Danish words):

q=ostekiks 38,646 hits, 530 ms.
q=ost* 49,713,655 hits, 2,190 ms.
q=køer mælk 1,232,445 hits, 767 ms.
q=kat mad* 10,926,107 hits, 4624 ms.
q="kaniner harer"~50 161,009 hits, 726 ms.
q=kantarel 337,279 hits, 455 ms.
q=deres kan* 245,719,036 hits, 13,565 ms.

This was with Solr 4.10. No special garbage collection activity
occurred. Heap usage stayed well below 8GB per Solr, which is the
standard behaviour of our system.

In short, I could not replicate your observed special activity based on
the queries you have described. I have no reason to believe that Solr
5.3 should perform worse in this aspect.

The SSDs are probably part of the explanation, but I suspect we are
missing something else. It should not make a difference (as your
non-truncated queries are fast), but could you try to reduce the slow
request to the simplest possible? No grouping, faceting or other special
processing, just q=network se*


- Toke Eskildsen, State and University Library, Denmark





Re: Very high memory and CPU utilization.

2015-11-02 Thread Modassar Ather
Okay. I guess your observation of 400% for a single core is with top and
looking at that core's entry? If so, the 400% can be explained by
excessive garbage collection. You could turn GC-logging on to check
that. With a bit of luck GC would be the cause of the slow down.

Yes it is with top command. I will check GC activities and try to relate
with CPU usage.

The query q=network se* is quick enough in our system too. It takes around
3-4 seconds for around 8 million records.
The problem is with the same query as phrase. q="network se*".
Can you please share your experience with such query where the wild card
expansion is huge like in the query above?

I changed my SolrCloud setup from 12 shard to 8 shard and given each shard
30 GB of RAM on the same machine with same index size (re-indexed) but
could not see the significant improvement for the query given.

I will check the swap activity.

Also can you please share your experiences with respect to RAM, GC, solr
cache setup etc as it seems by your comment that the SolrCloud environment
you have is kind of similar to the one I work on?

Regards,
Modassar

On Mon, Nov 2, 2015 at 5:20 PM, Toke Eskildsen 
wrote:

> On Mon, 2015-11-02 at 16:25 +0530, Modassar Ather wrote:
> > The remaining size after you removed the heap usage should be reserved
> for
> > the index (not only the other system activities).
> > I am not able to get  the above point. So when I start Solr with 28g RAM,
> > for all the activities related to Solr it should not go beyond 28g. And
> the
> > remaining heap will be used for activities other than Solr. Please help
> me
> > understand.
>
> It is described here:
> https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
>
> I will be quick to add that I do not agree with Shawn (the primary
> author of the page) on the stated limits and find that the page in
> general ignores that performance requirements differ a great deal.
> Nevertheless, it is very true that Solr performance is tied to the
> amount of OS disk cache:
>
> You can have a machine with 10TB of RAM, but Solr performance will still
> be poor if you use it all for JVMs.
>
> Practically all modern operating system uses free memory for disk cache.
> Free memory is the memory not used for JVMs or other programs. It might
> be that you have a lot less than 30-40GB free: If you are on a Linux
> server, try calling 'top' and see what is says under 'cached'.
>
> Related, I support jim's suggestion to inspect the swap activity:
> In the past we had problem with a machine that insisted on swapping
> excessively, although there were high IO and free memory.
>
> > The disks are SSDs.
>
> That makes your observations stranger still.
>
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>


Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
*if it correlates with the bad performance you're seeing. One important
thing to notice is that a significant part of your index needs to be in RAM
(especially if you're using SSDs) in order to achieve good performance.*

Especially if you're not using SSDs, sorry ;)

2015-11-02 11:38 GMT+01:00 jim ferenczi :

> 12 shards with 28GB for the heap and 90GB for each index means that you
> need at least 336GB for the heap (assuming you're using all of it which may
> be easily the case considering the way the GC is handling memory) and ~=
> 1TO for the index. Let's say that you don't need your entire index in RAM,
> the problem as I see it is that you don't have enough RAM for your index +
> heap. Assuming your machine has 370GB of RAM there are only 34GB left for
> your index, 1TO/34GB means that you can only have 1/30 of your entire index
> in RAM. I would advise you to check the swap activity on the machine and
> see if it correlates with the bad performance you're seeing. One important
> thing to notice is that a significant part of your index needs to be in RAM
> (especially if you're using SSDs) in order to achieve good performance:
>
>
>
> *As mentioned above this is a big machine with 370+ gb of RAM and Solr (12
> nodes total) is assigned 336 GB. The rest is still a good for other system
> activities.*
> The remaining size after you removed the heap usage should be reserved for
> the index (not only the other system activities).
>
>
> *Also the CPU utilization goes upto 400% in few of the nodes:*
> You said that only machine is used so I assumed that 400% cpu is for a
> single process (one solr node), right ?
> This seems impossible if you are sure that only one query is played at a
> time and no indexing is performed. Best thing to do is to dump stack trace
> of the solr nodes during the query and to check what the threads are doing.
>
> Jim
>
>
>
> 2015-11-02 10:38 GMT+01:00 Modassar Ather :
>
>> Just to add one more point that one external Zookeeper instance is also
>> running on this particular machine.
>>
>> Regards,
>> Modassar
>>
>> On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather 
>> wrote:
>>
>> > Hi Toke,
>> > Thanks for your response. My comments in-line.
>> >
>> > That is 12 machines, running a shard each?
>> > No! This is a single big machine with 12 shards on it.
>> >
>> > What is the total amount of physical memory on each machine?
>> > Around 370 gb on the single machine.
>> >
>> > Well, se* probably expands to a great deal of documents, but a huge bump
>> > in memory utilization and 3 minutes+ sounds strange.
>> >
>> > - What are your normal query times?
>> > Few simple queries are returned with in a couple of seconds. But the
>> more
>> > complex queries with proximity and wild cards have taken more than 3-4
>> > minutes and some times some queries have timed out too where time out is
>> > set to 5 minutes.
>> > - How many hits do you get from 'network se*'?
>> > More than a million records.
>> > - How many results do you return (the rows-parameter)?
>> > It is the default one 10. Grouping is enabled on a field.
>> > - If you issue a query without wildcards, but with approximately the
>> > same amount of hits as 'network se*', how long does it take?
>> > A query resulting in around half a million record return within a couple
>> > of seconds.
>> >
>> > That is strange, yes. Have you checked the logs to see if something
>> > unexpected is going on while you test?
>> > Have not seen anything particularly. Will try to check again.
>> >
>> > If you are using spinning drives and only have 32GB of RAM in total in
>> > each machine, you are probably struggling just to keep things running.
>> > As mentioned above this is a big machine with 370+ gb of RAM and Solr
>> (12
>> > nodes total) is assigned 336 GB. The rest is still a good for other
>> system
>> > activities.
>> >
>> > Thanks,
>> > Modassar
>> >
>> > On Mon, Nov 2, 2015 at 1:30 PM, Toke Eskildsen 
>> > wrote:
>> >
>> >> On Mon, 2015-11-02 at 12:00 +0530, Modassar Ather wrote:
>> >> > I have a setup of 12 shard cluster started with 28gb memory each on a
>> >> > single server. There are no replica. The size of index is around
>> 90gb on
>> >> > each shard. The Solr version is 5.2.1.
>> >>
>> >> That is 12 machines, running a shard each?
>> >>
>> >> What is the total amount of physical memory on each machine?
>> >>
>> >> > When I query "network se*", the memory utilization goes upto 24-26 gb
>> >> and
>> >> > the query takes around 3+ minutes to execute. Also the CPU
>> utilization
>> >> goes
>> >> > upto 400% in few of the nodes.
>> >>
>> >> Well, se* probably expands to a great deal of documents, but a huge
>> bump
>> >> in memory utilization and 3 minutes+ sounds strange.
>> >>
>> >> - What are your normal query times?
>> >> - How many hits do you get from 'network se*'?
>> >> - How many results do you return (the rows-parameter)?
>> >> - 

Re: Very high memory and CPU utilization.

2015-11-02 Thread Modassar Ather
Thanks Jim for your response.

The remaining size after you removed the heap usage should be reserved for
the index (not only the other system activities).
I am not able to get  the above point. So when I start Solr with 28g RAM,
for all the activities related to Solr it should not go beyond 28g. And the
remaining heap will be used for activities other than Solr. Please help me
understand.

*Also the CPU utilization goes upto 400% in few of the nodes:*
You said that only machine is used so I assumed that 400% cpu is for a
single process (one solr node), right ?
Yes you are right that 400% is for single process.
The disks are SSDs.

Regards,
Modassar

On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi  wrote:

> *if it correlates with the bad performance you're seeing. One important
> thing to notice is that a significant part of your index needs to be in RAM
> (especially if you're using SSDs) in order to achieve good performance.*
>
> Especially if you're not using SSDs, sorry ;)
>
> 2015-11-02 11:38 GMT+01:00 jim ferenczi :
>
> > 12 shards with 28GB for the heap and 90GB for each index means that you
> > need at least 336GB for the heap (assuming you're using all of it which
> may
> > be easily the case considering the way the GC is handling memory) and ~=
> > 1TO for the index. Let's say that you don't need your entire index in
> RAM,
> > the problem as I see it is that you don't have enough RAM for your index
> +
> > heap. Assuming your machine has 370GB of RAM there are only 34GB left for
> > your index, 1TO/34GB means that you can only have 1/30 of your entire
> index
> > in RAM. I would advise you to check the swap activity on the machine and
> > see if it correlates with the bad performance you're seeing. One
> important
> > thing to notice is that a significant part of your index needs to be in
> RAM
> > (especially if you're using SSDs) in order to achieve good performance:
> >
> >
> >
> > *As mentioned above this is a big machine with 370+ gb of RAM and Solr
> (12
> > nodes total) is assigned 336 GB. The rest is still a good for other
> system
> > activities.*
> > The remaining size after you removed the heap usage should be reserved
> for
> > the index (not only the other system activities).
> >
> >
> > *Also the CPU utilization goes upto 400% in few of the nodes:*
> > You said that only machine is used so I assumed that 400% cpu is for a
> > single process (one solr node), right ?
> > This seems impossible if you are sure that only one query is played at a
> > time and no indexing is performed. Best thing to do is to dump stack
> trace
> > of the solr nodes during the query and to check what the threads are
> doing.
> >
> > Jim
> >
> >
> >
> > 2015-11-02 10:38 GMT+01:00 Modassar Ather :
> >
> >> Just to add one more point that one external Zookeeper instance is also
> >> running on this particular machine.
> >>
> >> Regards,
> >> Modassar
> >>
> >> On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather 
> >> wrote:
> >>
> >> > Hi Toke,
> >> > Thanks for your response. My comments in-line.
> >> >
> >> > That is 12 machines, running a shard each?
> >> > No! This is a single big machine with 12 shards on it.
> >> >
> >> > What is the total amount of physical memory on each machine?
> >> > Around 370 gb on the single machine.
> >> >
> >> > Well, se* probably expands to a great deal of documents, but a huge
> bump
> >> > in memory utilization and 3 minutes+ sounds strange.
> >> >
> >> > - What are your normal query times?
> >> > Few simple queries are returned with in a couple of seconds. But the
> >> more
> >> > complex queries with proximity and wild cards have taken more than 3-4
> >> > minutes and some times some queries have timed out too where time out
> is
> >> > set to 5 minutes.
> >> > - How many hits do you get from 'network se*'?
> >> > More than a million records.
> >> > - How many results do you return (the rows-parameter)?
> >> > It is the default one 10. Grouping is enabled on a field.
> >> > - If you issue a query without wildcards, but with approximately the
> >> > same amount of hits as 'network se*', how long does it take?
> >> > A query resulting in around half a million record return within a
> couple
> >> > of seconds.
> >> >
> >> > That is strange, yes. Have you checked the logs to see if something
> >> > unexpected is going on while you test?
> >> > Have not seen anything particularly. Will try to check again.
> >> >
> >> > If you are using spinning drives and only have 32GB of RAM in total in
> >> > each machine, you are probably struggling just to keep things running.
> >> > As mentioned above this is a big machine with 370+ gb of RAM and Solr
> >> (12
> >> > nodes total) is assigned 336 GB. The rest is still a good for other
> >> system
> >> > activities.
> >> >
> >> > Thanks,
> >> > Modassar
> >> >
> >> > On Mon, Nov 2, 2015 at 1:30 PM, Toke Eskildsen <
> t...@statsbiblioteket.dk>
> >> > 

Re: Very high memory and CPU utilization.

2015-11-02 Thread Toke Eskildsen
On Mon, 2015-11-02 at 16:25 +0530, Modassar Ather wrote:
> The remaining size after you removed the heap usage should be reserved for
> the index (not only the other system activities).
> I am not able to get  the above point. So when I start Solr with 28g RAM,
> for all the activities related to Solr it should not go beyond 28g. And the
> remaining heap will be used for activities other than Solr. Please help me
> understand.

It is described here:
https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

I will be quick to add that I do not agree with Shawn (the primary
author of the page) on the stated limits and find that the page in
general ignores that performance requirements differ a great deal.
Nevertheless, it is very true that Solr performance is tied to the
amount of OS disk cache:

You can have a machine with 10TB of RAM, but Solr performance will still
be poor if you use it all for JVMs.

Practically all modern operating system uses free memory for disk cache.
Free memory is the memory not used for JVMs or other programs. It might
be that you have a lot less than 30-40GB free: If you are on a Linux
server, try calling 'top' and see what is says under 'cached'.

Related, I support jim's suggestion to inspect the swap activity:
In the past we had problem with a machine that insisted on swapping
excessively, although there were high IO and free memory.

> The disks are SSDs.

That makes your observations stranger still.


- Toke Eskildsen, State and University Library, Denmark




SSL on Solr with CA signed certificate

2015-11-02 Thread davidphilip cherian
The doc[1] on reference guide provides steps related to setting up ssl with
self signed certificate. My employer wants me to set up and test with CA
signed certificate.
When I go to buy[2] a ssl certificate(just for testing), it asks for
specific web server name and jetty is not listed on it.

Is there something else that I need to look for, to enable ssl on solr,
with CA signed certificate? Has anyone tried doing this instead of
selfsigned one? Any further inputs? reference blogs?


[1] https://cwiki.apache.org/confluence/display/solr/Enabling+SSL
[2] https://www.instantssl.com/free-ssl-certificate.html


Re: Very high memory and CPU utilization.

2015-11-02 Thread Modassar Ather
I monitored swap activities for the query using vmstat. The *so* and *si*
shows 0 till the completion of query. Also the top showed 0 against swap.
This means there was no scarcity of physical memory. Swap activity seems
not to be a bottleneck.
Kindly note that this I ran on 8 node cluster with 30 gb RAM and 140 gb of
index on each node.

Regards,
Modassar

On Mon, Nov 2, 2015 at 5:27 PM, Modassar Ather 
wrote:

> Okay. I guess your observation of 400% for a single core is with top and
> looking at that core's entry? If so, the 400% can be explained by
> excessive garbage collection. You could turn GC-logging on to check
> that. With a bit of luck GC would be the cause of the slow down.
>
> Yes it is with top command. I will check GC activities and try to relate
> with CPU usage.
>
> The query q=network se* is quick enough in our system too. It takes around
> 3-4 seconds for around 8 million records.
> The problem is with the same query as phrase. q="network se*".
> Can you please share your experience with such query where the wild card
> expansion is huge like in the query above?
>
> I changed my SolrCloud setup from 12 shard to 8 shard and given each shard
> 30 GB of RAM on the same machine with same index size (re-indexed) but
> could not see the significant improvement for the query given.
>
> I will check the swap activity.
>
> Also can you please share your experiences with respect to RAM, GC, solr
> cache setup etc as it seems by your comment that the SolrCloud environment
> you have is kind of similar to the one I work on?
>
> Regards,
> Modassar
>
> On Mon, Nov 2, 2015 at 5:20 PM, Toke Eskildsen 
> wrote:
>
>> On Mon, 2015-11-02 at 16:25 +0530, Modassar Ather wrote:
>> > The remaining size after you removed the heap usage should be reserved
>> for
>> > the index (not only the other system activities).
>> > I am not able to get  the above point. So when I start Solr with 28g
>> RAM,
>> > for all the activities related to Solr it should not go beyond 28g. And
>> the
>> > remaining heap will be used for activities other than Solr. Please help
>> me
>> > understand.
>>
>> It is described here:
>> https://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
>>
>> I will be quick to add that I do not agree with Shawn (the primary
>> author of the page) on the stated limits and find that the page in
>> general ignores that performance requirements differ a great deal.
>> Nevertheless, it is very true that Solr performance is tied to the
>> amount of OS disk cache:
>>
>> You can have a machine with 10TB of RAM, but Solr performance will still
>> be poor if you use it all for JVMs.
>>
>> Practically all modern operating system uses free memory for disk cache.
>> Free memory is the memory not used for JVMs or other programs. It might
>> be that you have a lot less than 30-40GB free: If you are on a Linux
>> server, try calling 'top' and see what is says under 'cached'.
>>
>> Related, I support jim's suggestion to inspect the swap activity:
>> In the past we had problem with a machine that insisted on swapping
>> excessively, although there were high IO and free memory.
>>
>> > The disks are SSDs.
>>
>> That makes your observations stranger still.
>>
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>>
>>
>


Re: warning

2015-11-02 Thread Modassar Ather
Normally tlog is replayed in case if solr server crashes for some reason
and when restarted it tries to recover from the crash gracefully.
You can look into following documentation which explains about transaction
logs and related stuff of Solr.

http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

regards,
Modassar

On Mon, Nov 2, 2015 at 12:22 PM, Midas A  wrote:

> Please explain following warning
>
> Starting log replay
> tlog{file=/mnt/vol1/path/data/tlog/tlog.0060544 refcount=2}
> active=false starting pos=0
>
> Is there any harm with this error ?
>


Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
*I am not able to get  the above point. So when I start Solr with 28g RAM,
for all the activities related to Solr it should not go beyond 28g. And the
remaining heap will be used for activities other than Solr. Please help me
understand.*

Well those 28GB of heap are the memory "reserved" for your Solr
application, though some parts of the index (not to say all) are retrieved
via MMap (if you use the default MMapDirectory) which do not use the heap
at all. This is a very important part of Lucene/Solr, the heap should be
sized in a way that let a significant amount of RAM available for the
index. If not then you rely on the speed of your disk, if you have SSDs
it's better but reads are still significantly slower with SSDs than with
direct RAM access. Another thing to keep in mind is that mmap will always
tries to put things in RAM, this is why I suspect that you swap activity is
killing your performance.

2015-11-02 11:55 GMT+01:00 Modassar Ather :

> Thanks Jim for your response.
>
> The remaining size after you removed the heap usage should be reserved for
> the index (not only the other system activities).
> I am not able to get  the above point. So when I start Solr with 28g RAM,
> for all the activities related to Solr it should not go beyond 28g. And the
> remaining heap will be used for activities other than Solr. Please help me
> understand.
>
> *Also the CPU utilization goes upto 400% in few of the nodes:*
> You said that only machine is used so I assumed that 400% cpu is for a
> single process (one solr node), right ?
> Yes you are right that 400% is for single process.
> The disks are SSDs.
>
> Regards,
> Modassar
>
> On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi 
> wrote:
>
> > *if it correlates with the bad performance you're seeing. One important
> > thing to notice is that a significant part of your index needs to be in
> RAM
> > (especially if you're using SSDs) in order to achieve good performance.*
> >
> > Especially if you're not using SSDs, sorry ;)
> >
> > 2015-11-02 11:38 GMT+01:00 jim ferenczi :
> >
> > > 12 shards with 28GB for the heap and 90GB for each index means that you
> > > need at least 336GB for the heap (assuming you're using all of it which
> > may
> > > be easily the case considering the way the GC is handling memory) and
> ~=
> > > 1TO for the index. Let's say that you don't need your entire index in
> > RAM,
> > > the problem as I see it is that you don't have enough RAM for your
> index
> > +
> > > heap. Assuming your machine has 370GB of RAM there are only 34GB left
> for
> > > your index, 1TO/34GB means that you can only have 1/30 of your entire
> > index
> > > in RAM. I would advise you to check the swap activity on the machine
> and
> > > see if it correlates with the bad performance you're seeing. One
> > important
> > > thing to notice is that a significant part of your index needs to be in
> > RAM
> > > (especially if you're using SSDs) in order to achieve good performance:
> > >
> > >
> > >
> > > *As mentioned above this is a big machine with 370+ gb of RAM and Solr
> > (12
> > > nodes total) is assigned 336 GB. The rest is still a good for other
> > system
> > > activities.*
> > > The remaining size after you removed the heap usage should be reserved
> > for
> > > the index (not only the other system activities).
> > >
> > >
> > > *Also the CPU utilization goes upto 400% in few of the nodes:*
> > > You said that only machine is used so I assumed that 400% cpu is for a
> > > single process (one solr node), right ?
> > > This seems impossible if you are sure that only one query is played at
> a
> > > time and no indexing is performed. Best thing to do is to dump stack
> > trace
> > > of the solr nodes during the query and to check what the threads are
> > doing.
> > >
> > > Jim
> > >
> > >
> > >
> > > 2015-11-02 10:38 GMT+01:00 Modassar Ather :
> > >
> > >> Just to add one more point that one external Zookeeper instance is
> also
> > >> running on this particular machine.
> > >>
> > >> Regards,
> > >> Modassar
> > >>
> > >> On Mon, Nov 2, 2015 at 2:34 PM, Modassar Ather <
> modather1...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Toke,
> > >> > Thanks for your response. My comments in-line.
> > >> >
> > >> > That is 12 machines, running a shard each?
> > >> > No! This is a single big machine with 12 shards on it.
> > >> >
> > >> > What is the total amount of physical memory on each machine?
> > >> > Around 370 gb on the single machine.
> > >> >
> > >> > Well, se* probably expands to a great deal of documents, but a huge
> > bump
> > >> > in memory utilization and 3 minutes+ sounds strange.
> > >> >
> > >> > - What are your normal query times?
> > >> > Few simple queries are returned with in a couple of seconds. But the
> > >> more
> > >> > complex queries with proximity and wild cards have taken more than
> 3-4
> > >> > minutes and some times some queries have timed 

Re: Very high memory and CPU utilization.

2015-11-02 Thread jim ferenczi
Oups I did not read the thread carrefully.
*The problem is with the same query as phrase. q="network se*".*
I was not aware that you could do that with Solr ;). I would say this is
expected because in such case if the number of expansions for "se*" is big
then you would have to check the positions for a significant words. I don't
know if there is a limitation in the number of expansions for a prefix
query contained into a phrase query but I would look at this parameter
first (limit the number of expansion per prefix search, let's say the N
most significant words based on the frequency of the words for instance).

2015-11-02 13:36 GMT+01:00 jim ferenczi :

>
>
>
> *I am not able to get  the above point. So when I start Solr with 28g RAM,
> for all the activities related to Solr it should not go beyond 28g. And the
> remaining heap will be used for activities other than Solr. Please help me
> understand.*
>
> Well those 28GB of heap are the memory "reserved" for your Solr
> application, though some parts of the index (not to say all) are retrieved
> via MMap (if you use the default MMapDirectory) which do not use the heap
> at all. This is a very important part of Lucene/Solr, the heap should be
> sized in a way that let a significant amount of RAM available for the
> index. If not then you rely on the speed of your disk, if you have SSDs
> it's better but reads are still significantly slower with SSDs than with
> direct RAM access. Another thing to keep in mind is that mmap will always
> tries to put things in RAM, this is why I suspect that you swap activity is
> killing your performance.
>
> 2015-11-02 11:55 GMT+01:00 Modassar Ather :
>
>> Thanks Jim for your response.
>>
>> The remaining size after you removed the heap usage should be reserved for
>> the index (not only the other system activities).
>> I am not able to get  the above point. So when I start Solr with 28g RAM,
>> for all the activities related to Solr it should not go beyond 28g. And
>> the
>> remaining heap will be used for activities other than Solr. Please help me
>> understand.
>>
>> *Also the CPU utilization goes upto 400% in few of the nodes:*
>> You said that only machine is used so I assumed that 400% cpu is for a
>> single process (one solr node), right ?
>> Yes you are right that 400% is for single process.
>> The disks are SSDs.
>>
>> Regards,
>> Modassar
>>
>> On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi 
>> wrote:
>>
>> > *if it correlates with the bad performance you're seeing. One important
>> > thing to notice is that a significant part of your index needs to be in
>> RAM
>> > (especially if you're using SSDs) in order to achieve good performance.*
>> >
>> > Especially if you're not using SSDs, sorry ;)
>> >
>> > 2015-11-02 11:38 GMT+01:00 jim ferenczi :
>> >
>> > > 12 shards with 28GB for the heap and 90GB for each index means that
>> you
>> > > need at least 336GB for the heap (assuming you're using all of it
>> which
>> > may
>> > > be easily the case considering the way the GC is handling memory) and
>> ~=
>> > > 1TO for the index. Let's say that you don't need your entire index in
>> > RAM,
>> > > the problem as I see it is that you don't have enough RAM for your
>> index
>> > +
>> > > heap. Assuming your machine has 370GB of RAM there are only 34GB left
>> for
>> > > your index, 1TO/34GB means that you can only have 1/30 of your entire
>> > index
>> > > in RAM. I would advise you to check the swap activity on the machine
>> and
>> > > see if it correlates with the bad performance you're seeing. One
>> > important
>> > > thing to notice is that a significant part of your index needs to be
>> in
>> > RAM
>> > > (especially if you're using SSDs) in order to achieve good
>> performance:
>> > >
>> > >
>> > >
>> > > *As mentioned above this is a big machine with 370+ gb of RAM and Solr
>> > (12
>> > > nodes total) is assigned 336 GB. The rest is still a good for other
>> > system
>> > > activities.*
>> > > The remaining size after you removed the heap usage should be reserved
>> > for
>> > > the index (not only the other system activities).
>> > >
>> > >
>> > > *Also the CPU utilization goes upto 400% in few of the nodes:*
>> > > You said that only machine is used so I assumed that 400% cpu is for a
>> > > single process (one solr node), right ?
>> > > This seems impossible if you are sure that only one query is played
>> at a
>> > > time and no indexing is performed. Best thing to do is to dump stack
>> > trace
>> > > of the solr nodes during the query and to check what the threads are
>> > doing.
>> > >
>> > > Jim
>> > >
>> > >
>> > >
>> > > 2015-11-02 10:38 GMT+01:00 Modassar Ather :
>> > >
>> > >> Just to add one more point that one external Zookeeper instance is
>> also
>> > >> running on this particular machine.
>> > >>
>> > >> Regards,
>> > >> Modassar
>> > >>
>> > >> On Mon, Nov 2, 2015 at 2:34 

Re: Very high memory and CPU utilization.

2015-11-02 Thread Modassar Ather
The problem is with the same query as phrase. q="network se*".

The last . is fullstops for the sentence and the query is q=field:"network
se*"

Best,
Modassar

On Mon, Nov 2, 2015 at 6:10 PM, jim ferenczi  wrote:

> Oups I did not read the thread carrefully.
> *The problem is with the same query as phrase. q="network se*".*
> I was not aware that you could do that with Solr ;). I would say this is
> expected because in such case if the number of expansions for "se*" is big
> then you would have to check the positions for a significant words. I don't
> know if there is a limitation in the number of expansions for a prefix
> query contained into a phrase query but I would look at this parameter
> first (limit the number of expansion per prefix search, let's say the N
> most significant words based on the frequency of the words for instance).
>
> 2015-11-02 13:36 GMT+01:00 jim ferenczi :
>
> >
> >
> >
> > *I am not able to get  the above point. So when I start Solr with 28g
> RAM,
> > for all the activities related to Solr it should not go beyond 28g. And
> the
> > remaining heap will be used for activities other than Solr. Please help
> me
> > understand.*
> >
> > Well those 28GB of heap are the memory "reserved" for your Solr
> > application, though some parts of the index (not to say all) are
> retrieved
> > via MMap (if you use the default MMapDirectory) which do not use the heap
> > at all. This is a very important part of Lucene/Solr, the heap should be
> > sized in a way that let a significant amount of RAM available for the
> > index. If not then you rely on the speed of your disk, if you have SSDs
> > it's better but reads are still significantly slower with SSDs than with
> > direct RAM access. Another thing to keep in mind is that mmap will always
> > tries to put things in RAM, this is why I suspect that you swap activity
> is
> > killing your performance.
> >
> > 2015-11-02 11:55 GMT+01:00 Modassar Ather :
> >
> >> Thanks Jim for your response.
> >>
> >> The remaining size after you removed the heap usage should be reserved
> for
> >> the index (not only the other system activities).
> >> I am not able to get  the above point. So when I start Solr with 28g
> RAM,
> >> for all the activities related to Solr it should not go beyond 28g. And
> >> the
> >> remaining heap will be used for activities other than Solr. Please help
> me
> >> understand.
> >>
> >> *Also the CPU utilization goes upto 400% in few of the nodes:*
> >> You said that only machine is used so I assumed that 400% cpu is for a
> >> single process (one solr node), right ?
> >> Yes you are right that 400% is for single process.
> >> The disks are SSDs.
> >>
> >> Regards,
> >> Modassar
> >>
> >> On Mon, Nov 2, 2015 at 4:09 PM, jim ferenczi 
> >> wrote:
> >>
> >> > *if it correlates with the bad performance you're seeing. One
> important
> >> > thing to notice is that a significant part of your index needs to be
> in
> >> RAM
> >> > (especially if you're using SSDs) in order to achieve good
> performance.*
> >> >
> >> > Especially if you're not using SSDs, sorry ;)
> >> >
> >> > 2015-11-02 11:38 GMT+01:00 jim ferenczi :
> >> >
> >> > > 12 shards with 28GB for the heap and 90GB for each index means that
> >> you
> >> > > need at least 336GB for the heap (assuming you're using all of it
> >> which
> >> > may
> >> > > be easily the case considering the way the GC is handling memory)
> and
> >> ~=
> >> > > 1TO for the index. Let's say that you don't need your entire index
> in
> >> > RAM,
> >> > > the problem as I see it is that you don't have enough RAM for your
> >> index
> >> > +
> >> > > heap. Assuming your machine has 370GB of RAM there are only 34GB
> left
> >> for
> >> > > your index, 1TO/34GB means that you can only have 1/30 of your
> entire
> >> > index
> >> > > in RAM. I would advise you to check the swap activity on the machine
> >> and
> >> > > see if it correlates with the bad performance you're seeing. One
> >> > important
> >> > > thing to notice is that a significant part of your index needs to be
> >> in
> >> > RAM
> >> > > (especially if you're using SSDs) in order to achieve good
> >> performance:
> >> > >
> >> > >
> >> > >
> >> > > *As mentioned above this is a big machine with 370+ gb of RAM and
> Solr
> >> > (12
> >> > > nodes total) is assigned 336 GB. The rest is still a good for other
> >> > system
> >> > > activities.*
> >> > > The remaining size after you removed the heap usage should be
> reserved
> >> > for
> >> > > the index (not only the other system activities).
> >> > >
> >> > >
> >> > > *Also the CPU utilization goes upto 400% in few of the nodes:*
> >> > > You said that only machine is used so I assumed that 400% cpu is
> for a
> >> > > single process (one solr node), right ?
> >> > > This seems impossible if you are sure that only one query is played
> >> at a
> >> > > time and no