Re: Index optimization takes too long

2018-11-04 Thread Toke Eskildsen
On Sat, 2018-11-03 at 21:41 -0700, Wei wrote:
> Thanks everyone! I checked the system metrics during the optimization
> process. CPU usage is quite low, there is no I/O wait,  and memory
> usage is not much different from before the docValues change.  So I
> wonder what could be the bottleneck.

Are you looking at overall CPU usage or single-core? When we run force
merge, we have a single core at 100% while the rest are idle.


NB: There is currently a thread "Static index, fastest way to do
forceMerge" in the Lucene users mailinglist, which seem to be quite
parallel to this thread.

- Toke Eskildsen, royal Danish Library




Re: Questions about stored fields and updates.

2018-11-04 Thread Erick Erickson
Ash:

Atomic updates are really a reindex of all the original fields. What happens is:
1> Solr gets all the stored fields from the disk
2> Solr overlays the new data
3> Solr re-indexes  the entire document just as though it came from outside.

For step <3>, there's no difference at all between an atomic update
and the client having resent the entire document. There's still a doc
marked as deleted in the old segment and an entirely new document
being indexed into the current segment.

As for efficiency, in the atomic update case you have to
1> seek/read the stored data off disk
2> decompress a 16K block (minimum)

.vs. in the re-index the whole doc from outside case where you

1> read the entire document off the wire and deserialize it

>From there, everything's the same.

I haven't actually measured, but I'd guess that atomic updates are
actually more work than simply re-sending the doc from the client.
Now, all that  said, and even assuming I'm right, unless you have a
pretty high indexing rate I doubt you'd notice.

But in general I strongly prefer re-indexing from my system of record
if at all possible, if for no other reason than you'll have to
sometime anyway when you need to make changes to your schema to
support different use-cases.

Best,
Erick

On Sun, Nov 4, 2018 at 5:10 PM Ash Ramesh  wrote:
>
> Also thanks for the information Shawn! :)
>
> On Mon, Nov 5, 2018 at 12:09 PM Ash Ramesh  wrote:
>
> > Sorry Shawn,
> >
> > I seem to have gotten my wording wrong. I meant that we wanted to move
> > away from atomic-updates to replacing/reindexing the document entirely
> > again when changes are made.
> > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-index-handlers.html#adding-documents
> >
> > Regards,
> >
> > Ash
> >
> > On Mon, Nov 5, 2018 at 11:29 AM Shawn Heisey  wrote:
> >
> >> On 11/3/2018 9:45 PM, Ash Ramesh wrote:
> >> > My company currently uses SOLR to completely hydrate client objects by
> >> > storing all fields (stored=true). Therefore we have 2 types of fields:
> >> >
> >> > 1. indexed=true | stored=true : For fields that will be used for
> >> > searching, sorting, etc.
> >> > 2. indexed=false | stored=true: For fields that only need hydrating
> >> for
> >> > clients
> >> >
> >> > We are re-architecting this so that we will eventually only get the id
> >> from
> >> > SOLR (fl=id) and hydrate from another data source. This means we can
> >> > obviously delete all the indexed=false | stored=true fields to reduce
> >> our
> >> > index size.
> >> >
> >> > However, when it comes to the indexed=true | stored=true fields, we are
> >> not
> >> > sure whether to also set them to be stored=false and perform in-place
> >> > updates or leave it as is and perform atomic updates. We've done a fair
> >> bit
> >> > of research on the archives of this mailing list, but are still a bit
> >> > confused:
> >> >
> >> > 1. Will having the fields be converted from indexed=true | stored=true
> >> ->
> >> > indexed=true | stored=false cause our index size to reduce? Will it also
> >> > mean that indexing will be less compute expensive due to the
> >> compression of
> >> > stored field logic?
> >>
> >> Pretty much anything you change from true to false in the schema will
> >> reduce index size.
> >>
> >> Removal of stored data will not *directly* improve query speed -- stored
> >> data is not used during the query phase.  It might *indirectly* increase
> >> query speed by removing data from the OS disk cache, leaving more room
> >> for inverted index data.
> >>
> >> The direct improvement from removing stored data will be during data
> >> retrieval (after the query itself).  It will also mean there is less
> >> data to compress, which means that indexing speed might increase.
> >>
> >> > 2. Are atomic updates preferred to in-place updates? Obviously if we
> >> move
> >> > to index only fields, then we have to do in-place updates all the time.
> >> > This isn't an issue for us, but we are a bit concerned about how SOLR's
> >> > indexing speed will suffer & deleted docs increase. Currently we perform
> >> > both.
> >>
> >> If you change stored to false, you will most likely not be able to do
> >> atomic updates.  Atomic update functionality has very specific
> >> requirements:
> >>
> >>
> >> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage
> >>
> >> In-place updates have requirements that are even more strict than atomic
> >> updates -- the field cannot be indexed:
> >>
> >>
> >> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
> --
> *P.S. We've launched a new blog to share the latest ideas and case studies
> from our team. Check it out here: product.canva.com
> . ***
> ** Empowering the world
> to design
> Also, we're hiring. Apply here!
> 
>  
> 

Re: SolrCloud performance

2018-11-04 Thread Chuming Chen
Hi Shawn,

Thank you very much for your analysis. I currently don’t have multiple machines 
to play with. I will try "one Solr instance and one ZK instance would be more 
efficient on a single server” you suggested.

Thanks again,

Chuming



On Nov 4, 2018, at 7:56 PM, Shawn Heisey  wrote:

> On 11/4/2018 8:38 AM, Chuming Chen wrote:
>> I have shared a tar ball with you (apa...@elyograg.org) from google drive. 
>> The tar ball includes logs directories of 4 nodes, solrconfig.xml, 
>> solr.in.sh, and screenshot of TOP command. The log files is about 1 day’s 
>> log. However, I restarted the solr cloud several times during that period.
> 
> Runtime represented in the GC log for node1 is 23 minutes. Not anywhere near 
> a full day.
> 
> Runtime represented in thc GC log for node2 is just under 16 minutes.
> 
> Runtime represented in the GC log for node3 is 434 milliseconds.
> 
> Runtime represented in the GC log for node4 is 501 milliseconds.
> 
> This is not enough to even make a guess, much less a reasoned recommendation 
> about the heap size you will actually need.  There must be enough runtime 
> that there have been significant garbage collections so we can get a sense 
> about how much memory the application actually needs.
> 
>> I want to make it clear. I don’t have 4 physical machines. I have 48 cores 
>> server. All 4 solr nodes are running on the same physical machine. Each node 
>> has 1 shard and 1 replicate. I also have a ZooKeeper ensemble running on the 
>> same machine with 3 different ports.
> 
> Why?  You get absolutely no redundancy that way.  One Solr instance and one 
> ZK instance would be more efficient on a single server.  The increase in 
> efficiency probably wouldn't be significant, but it WOULD be more efficient.  
> You really can't get a sense about how separate servers will behave if all 
> the software is running on a single server.
> 
>> I am curious to know what Solr is doing when the CPU usage is 100% or more 
>> than 100%. Because for some queries, I think even just looping through all 
>> the document without using any index might be faster.
> 
> I have no way to answer this question.  Solr will be doing whatever you asked 
> it to do.
> 
> The screenshot of the top output shows that all four of the nodes there are 
> using about 3GB of memory each (RES minus SHR).  Which would be consistent 
> with the very short runtimes noted by the GC logs.  The VIRT column reveals 
> that each node has about 100GB of index data.  So about 400GB total index 
> data.  Not much can be determined when the runtime is so small.
> 
> Thanks,
> Shawn
> 



Re: Questions about stored fields and updates.

2018-11-04 Thread Ash Ramesh
Also thanks for the information Shawn! :)

On Mon, Nov 5, 2018 at 12:09 PM Ash Ramesh  wrote:

> Sorry Shawn,
>
> I seem to have gotten my wording wrong. I meant that we wanted to move
> away from atomic-updates to replacing/reindexing the document entirely
> again when changes are made.
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-index-handlers.html#adding-documents
>
> Regards,
>
> Ash
>
> On Mon, Nov 5, 2018 at 11:29 AM Shawn Heisey  wrote:
>
>> On 11/3/2018 9:45 PM, Ash Ramesh wrote:
>> > My company currently uses SOLR to completely hydrate client objects by
>> > storing all fields (stored=true). Therefore we have 2 types of fields:
>> >
>> > 1. indexed=true | stored=true : For fields that will be used for
>> > searching, sorting, etc.
>> > 2. indexed=false | stored=true: For fields that only need hydrating
>> for
>> > clients
>> >
>> > We are re-architecting this so that we will eventually only get the id
>> from
>> > SOLR (fl=id) and hydrate from another data source. This means we can
>> > obviously delete all the indexed=false | stored=true fields to reduce
>> our
>> > index size.
>> >
>> > However, when it comes to the indexed=true | stored=true fields, we are
>> not
>> > sure whether to also set them to be stored=false and perform in-place
>> > updates or leave it as is and perform atomic updates. We've done a fair
>> bit
>> > of research on the archives of this mailing list, but are still a bit
>> > confused:
>> >
>> > 1. Will having the fields be converted from indexed=true | stored=true
>> ->
>> > indexed=true | stored=false cause our index size to reduce? Will it also
>> > mean that indexing will be less compute expensive due to the
>> compression of
>> > stored field logic?
>>
>> Pretty much anything you change from true to false in the schema will
>> reduce index size.
>>
>> Removal of stored data will not *directly* improve query speed -- stored
>> data is not used during the query phase.  It might *indirectly* increase
>> query speed by removing data from the OS disk cache, leaving more room
>> for inverted index data.
>>
>> The direct improvement from removing stored data will be during data
>> retrieval (after the query itself).  It will also mean there is less
>> data to compress, which means that indexing speed might increase.
>>
>> > 2. Are atomic updates preferred to in-place updates? Obviously if we
>> move
>> > to index only fields, then we have to do in-place updates all the time.
>> > This isn't an issue for us, but we are a bit concerned about how SOLR's
>> > indexing speed will suffer & deleted docs increase. Currently we perform
>> > both.
>>
>> If you change stored to false, you will most likely not be able to do
>> atomic updates.  Atomic update functionality has very specific
>> requirements:
>>
>>
>> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage
>>
>> In-place updates have requirements that are even more strict than atomic
>> updates -- the field cannot be indexed:
>>
>>
>> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates
>>
>> Thanks,
>> Shawn
>>
>>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the world 
to design
Also, we're hiring. Apply here! 

  
  








Re: Questions about stored fields and updates.

2018-11-04 Thread Ash Ramesh
Sorry Shawn,

I seem to have gotten my wording wrong. I meant that we wanted to move away
from atomic-updates to replacing/reindexing the document entirely again
when changes are made.
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-index-handlers.html#adding-documents

Regards,

Ash

On Mon, Nov 5, 2018 at 11:29 AM Shawn Heisey  wrote:

> On 11/3/2018 9:45 PM, Ash Ramesh wrote:
> > My company currently uses SOLR to completely hydrate client objects by
> > storing all fields (stored=true). Therefore we have 2 types of fields:
> >
> > 1. indexed=true | stored=true : For fields that will be used for
> > searching, sorting, etc.
> > 2. indexed=false | stored=true: For fields that only need hydrating
> for
> > clients
> >
> > We are re-architecting this so that we will eventually only get the id
> from
> > SOLR (fl=id) and hydrate from another data source. This means we can
> > obviously delete all the indexed=false | stored=true fields to reduce our
> > index size.
> >
> > However, when it comes to the indexed=true | stored=true fields, we are
> not
> > sure whether to also set them to be stored=false and perform in-place
> > updates or leave it as is and perform atomic updates. We've done a fair
> bit
> > of research on the archives of this mailing list, but are still a bit
> > confused:
> >
> > 1. Will having the fields be converted from indexed=true | stored=true ->
> > indexed=true | stored=false cause our index size to reduce? Will it also
> > mean that indexing will be less compute expensive due to the compression
> of
> > stored field logic?
>
> Pretty much anything you change from true to false in the schema will
> reduce index size.
>
> Removal of stored data will not *directly* improve query speed -- stored
> data is not used during the query phase.  It might *indirectly* increase
> query speed by removing data from the OS disk cache, leaving more room
> for inverted index data.
>
> The direct improvement from removing stored data will be during data
> retrieval (after the query itself).  It will also mean there is less
> data to compress, which means that indexing speed might increase.
>
> > 2. Are atomic updates preferred to in-place updates? Obviously if we move
> > to index only fields, then we have to do in-place updates all the time.
> > This isn't an issue for us, but we are a bit concerned about how SOLR's
> > indexing speed will suffer & deleted docs increase. Currently we perform
> > both.
>
> If you change stored to false, you will most likely not be able to do
> atomic updates.  Atomic update functionality has very specific
> requirements:
>
>
> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage
>
> In-place updates have requirements that are even more strict than atomic
> updates -- the field cannot be indexed:
>
>
> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates
>
> Thanks,
> Shawn
>
>

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
. ***
** Empowering the world 
to design
Also, we're hiring. Apply here! 

  
  








Re: SolrCloud performance

2018-11-04 Thread Shawn Heisey

On 11/4/2018 8:38 AM, Chuming Chen wrote:

I have shared a tar ball with you (apa...@elyograg.org) from google drive. The 
tar ball includes logs directories of 4 nodes, solrconfig.xml, solr.in.sh, and 
screenshot of TOP command. The log files is about 1 day’s log. However, I 
restarted the solr cloud several times during that period.


Runtime represented in the GC log for node1 is 23 minutes. Not anywhere 
near a full day.


Runtime represented in thc GC log for node2 is just under 16 minutes.

Runtime represented in the GC log for node3 is 434 milliseconds.

Runtime represented in the GC log for node4 is 501 milliseconds.

This is not enough to even make a guess, much less a reasoned 
recommendation about the heap size you will actually need.  There must 
be enough runtime that there have been significant garbage collections 
so we can get a sense about how much memory the application actually needs.



I want to make it clear. I don’t have 4 physical machines. I have 48 cores 
server. All 4 solr nodes are running on the same physical machine. Each node 
has 1 shard and 1 replicate. I also have a ZooKeeper ensemble running on the 
same machine with 3 different ports.


Why?  You get absolutely no redundancy that way.  One Solr instance and 
one ZK instance would be more efficient on a single server.  The 
increase in efficiency probably wouldn't be significant, but it WOULD be 
more efficient.  You really can't get a sense about how separate servers 
will behave if all the software is running on a single server.



I am curious to know what Solr is doing when the CPU usage is 100% or more than 
100%. Because for some queries, I think even just looping through all the 
document without using any index might be faster.


I have no way to answer this question.  Solr will be doing whatever you 
asked it to do.


The screenshot of the top output shows that all four of the nodes there 
are using about 3GB of memory each (RES minus SHR).  Which would be 
consistent with the very short runtimes noted by the GC logs.  The VIRT 
column reveals that each node has about 100GB of index data.  So about 
400GB total index data.  Not much can be determined when the runtime is 
so small.


Thanks,
Shawn



Re: Questions about stored fields and updates.

2018-11-04 Thread Shawn Heisey

On 11/3/2018 9:45 PM, Ash Ramesh wrote:

My company currently uses SOLR to completely hydrate client objects by
storing all fields (stored=true). Therefore we have 2 types of fields:

1. indexed=true | stored=true : For fields that will be used for
searching, sorting, etc.
2. indexed=false | stored=true: For fields that only need hydrating for
clients

We are re-architecting this so that we will eventually only get the id from
SOLR (fl=id) and hydrate from another data source. This means we can
obviously delete all the indexed=false | stored=true fields to reduce our
index size.

However, when it comes to the indexed=true | stored=true fields, we are not
sure whether to also set them to be stored=false and perform in-place
updates or leave it as is and perform atomic updates. We've done a fair bit
of research on the archives of this mailing list, but are still a bit
confused:

1. Will having the fields be converted from indexed=true | stored=true ->
indexed=true | stored=false cause our index size to reduce? Will it also
mean that indexing will be less compute expensive due to the compression of
stored field logic?


Pretty much anything you change from true to false in the schema will 
reduce index size.


Removal of stored data will not *directly* improve query speed -- stored 
data is not used during the query phase.  It might *indirectly* increase 
query speed by removing data from the OS disk cache, leaving more room 
for inverted index data.


The direct improvement from removing stored data will be during data 
retrieval (after the query itself).  It will also mean there is less 
data to compress, which means that indexing speed might increase.



2. Are atomic updates preferred to in-place updates? Obviously if we move
to index only fields, then we have to do in-place updates all the time.
This isn't an issue for us, but we are a bit concerned about how SOLR's
indexing speed will suffer & deleted docs increase. Currently we perform
both.


If you change stored to false, you will most likely not be able to do 
atomic updates.  Atomic update functionality has very specific requirements:


https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage

In-place updates have requirements that are even more strict than atomic 
updates -- the field cannot be indexed:


https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates

Thanks,
Shawn



RE: Solr OCR Support

2018-11-04 Thread Terry Steichen
+1
My experience is that you can't easily tell ahead of time whether your PDF is 
searchable or not. If it is, you may not even retrieve it because there's no 
text to index.  Also, if you blindly OCR a file that has already been OCR'd, it 
can create a mess.  Most higher end PDF editors have a batch mode to do OCR 
processing, if that works better for you.

On November 4, 2018 5:20:41 PM EST, Phil Scadden  wrote:
>I would strongly consider OCR offline, BEFORE loading the documents
>into Solr. The  advantage of this is that you convert your OCRed PDF
>into searchable PDF. Consider someone using Solr and they have found a
>document that matches their search criteria. Once they retrieve the
>document, they will discover it is has not been OCRed and they cannot
>use a text search within a document. If the document that you are
>feeding Solr is large, then this is major pain. Setting up Tesseract
>(or whatever engine - tesseract involves a bit of a tool chain) to OCR
>and save as searchable PDF, means you can provide a much more useful
>document as the result of Solr search. Feed that searchable PDF to
>SolrJ with OCR turned off.
>
>   PDFParserConfig pdfConfig = new PDFParserConfig();
>   pdfConfig.setExtractInlineImages(false);
> pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
>   context.set(PDFParserConfig.class,pdfConfig);
>   context.set(Parser.class,parser);
>
>-Original Message-
>From: Furkan KAMACI 
>Sent: Saturday, 3 November 2018 03:30
>To: solr-user@lucene.apache.org
>Subject: Solr OCR Support
>
>Hi All,
>
>I want to index images and pdf documents which have images into Solr. I
>test it with my Solr 6.3.0.
>
>I've installed tesseract at my computer (Mac). I verify that Tesseract
>works fine to extract text from an image.
>
>I index image into Solr but it has no content. However, as far as I
>know, I don't need to do anything else to integrate Tesseract with
>Solr.
>
>I've checked these but they were not useful for me:
>
>http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
>http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html
>
>My question is, how can I support OCR with Solr?
>
>Kind Regards,
>Furkan KAMACI
>Notice: This email and any attachments are confidential and may not be
>used, published or redistributed without the prior written consent of
>the Institute of Geological and Nuclear Sciences Limited (GNS Science).
>If received in error please destroy and immediately notify GNS Science.
>Do not copy or disclose the contents.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

RE: Solr OCR Support

2018-11-04 Thread Phil Scadden
I would strongly consider OCR offline, BEFORE loading the documents into Solr. 
The  advantage of this is that you convert your OCRed PDF into searchable PDF. 
Consider someone using Solr and they have found a document that matches their 
search criteria. Once they retrieve the document, they will discover it is has 
not been OCRed and they cannot use a text search within a document. If the 
document that you are feeding Solr is large, then this is major pain. Setting 
up Tesseract (or whatever engine - tesseract involves a bit of a tool chain) to 
OCR and save as searchable PDF, means you can provide a much more useful 
document as the result of Solr search. Feed that searchable PDF to SolrJ with 
OCR turned off.

   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setExtractInlineImages(false);
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);

-Original Message-
From: Furkan KAMACI 
Sent: Saturday, 3 November 2018 03:30
To: solr-user@lucene.apache.org
Subject: Solr OCR Support

Hi All,

I want to index images and pdf documents which have images into Solr. I test it 
with my Solr 6.3.0.

I've installed tesseract at my computer (Mac). I verify that Tesseract works 
fine to extract text from an image.

I index image into Solr but it has no content. However, as far as I know, I 
don't need to do anything else to integrate Tesseract with Solr.

I've checked these but they were not useful for me:

http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html

My question is, how can I support OCR with Solr?

Kind Regards,
Furkan KAMACI
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: migrating cores with Solr upgrade

2018-11-04 Thread Erick Erickson
Oops fumble fingers. Anyway I'd recommend completely reindexing into a new
collection.

On Sun, Nov 4, 2018, 12:53 Erick Erickson  Lucene does not guarantee back comparability over two major versions, so
> I'd recommend completely reinde
>
> On Sun, Nov 4, 2018, 02:02 Piyush Kumar Nayak  wrote:
>
>> Hi,
>>
>> What is the best way to migrate cores from an old version of Solr (say
>> 5.x) to a newer version (say 7.x). I did not find anything pertinent to the
>> matter in the Solr reference guide.
>> Is there a tool that can do that seamlessly?
>>
>> Regards,
>> Piyush.
>>
>>


Re: migrating cores with Solr upgrade

2018-11-04 Thread Erick Erickson
Lucene does not guarantee back comparability over two major versions, so
I'd recommend completely reinde

On Sun, Nov 4, 2018, 02:02 Piyush Kumar Nayak  Hi,
>
> What is the best way to migrate cores from an old version of Solr (say
> 5.x) to a newer version (say 7.x). I did not find anything pertinent to the
> matter in the Solr reference guide.
> Is there a tool that can do that seamlessly?
>
> Regards,
> Piyush.
>
>


Re: SolrCloud performance

2018-11-04 Thread Chuming Chen
Hi Shawn,

I have shared a tar ball with you (apa...@elyograg.org) from google drive. The 
tar ball includes logs directories of 4 nodes, solrconfig.xml, solr.in.sh, and 
screenshot of TOP command. The log files is about 1 day’s log. However, I 
restarted the solr cloud several times during that period.

I want to make it clear. I don’t have 4 physical machines. I have 48 cores 
server. All 4 solr nodes are running on the same physical machine. Each node 
has 1 shard and 1 replicate. I also have a ZooKeeper ensemble running on the 
same machine with 3 different ports.

I am curious to know what Solr is doing when the CPU usage is 100% or more than 
100%. Because for some queries, I think even just looping through all the 
document without using any index might be faster.

If you have problem accessing the tar ball, please let me know.

Thanks a lot!

Chuming


On Nov 2, 2018, at 6:56 PM, Shawn Heisey  wrote:

> On 11/2/2018 1:38 PM, Chuming Chen wrote:
>> I am running a Solr cloud 7.4 with 4 shards and 4 nodes (JVM "-Xms20g 
>> -Xmx40g”), each shard has 32 million documents and 32Gbytes in size.
> 
> A 40GB heap is probably completely unnecessary for an index of that size.  
> Does each machine have one replica on it or two? If you are trying for high 
> availability, then it will be at least two shard replicas per machine.
> 
> The values on -Xms and -Xmx should normally be set the same.  Java will 
> always tend to allocate the entire max heap it has been allowed, so it's 
> usually better to just let it have the whole amount right up front.
> 
>> For a given query (I use complexphrase query), typically, the first time it 
>> took a couple of seconds to return the first 20 docs. However, for the 
>> following page, or sorting by a field, even run the same query again took a 
>> lot longer to return results. I can see my 4 solr nodes running crazy with 
>> more than 100%CPU.
> 
> Can you obtain a screenshot of a process listing as described at the 
> following URL, and provide the image using a file sharing site?
> 
> https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
> 
> There are separate instructions there for Windows and for Linux/UNIX 
> operating systems.
> 
> Also useful are the GC logs that are written by Java when Solr is started 
> using the included scripts.  I'm looking for logfiles that cover several days 
> of runtime.  You'll need to share them with a file sharing website -- files 
> will not normally make it to the mailing list if attached to a message.
> 
> Getting a copy of the solrconfig.xml in use on your collection can also be 
> helpful.
> 
>> My understanding is that Solr has query cache, run same query should be 
>> faster.
> 
> If the query is absolutely identical in *every* way, then yes, it can be 
> satisfied from Solr caches, if their size is sufficient.  If you change 
> ANYTHING, including things like rows or start, filters, sorting, facets, and 
> other parameters, then the query probably cannot be satisfied completely from 
> cache.  At that point, Solr is very reliant on how much memory has NOT been 
> allocated to programs -- it must be a sufficient quantity of memory that the 
> Solr index data can be effectively cached.
> 
>> What could be wrong here? How do I debug? I checked solr.log in all nodes 
>> and didn’t see anything unusual. Most frequent log entry looks like this.
>> 
>> INFO  - 2018-11-02 19:32:55.189; [   ] org.apache.solr.servlet.HttpSolrCall; 
>> [admin] webapp=null path=/admin/metrics 
>> params={wt=javabin=2=solr.core.patternmatch.shard3.replica_n8:UPDATE./update.requests=solr.core.patternmatch.shard3.replica_n8:INDEX.sizeInBytes=solr.core.patternmatch.shard1.replica_n1:QUERY./select.requests=solr.core.patternmatch.shard1.replica_n1:INDEX.sizeInBytes=solr.core.patternmatch.shard1.replica_n1:UPDATE./update.requests=solr.core.patternmatch.shard3.replica_n8:QUERY./select.requests}
>>  status=0 QTime=7
>> INFO  - 2018-11-02 19:32:55.192; [   ] org.apache.solr.servlet.HttpSolrCall; 
>> [admin] webapp=null path=/admin/metrics 
>> params={wt=javabin=2=solr.jvm:os.processCpuLoad=solr.node:CONTAINER.fs.coreRoot.usableSpace=solr.jvm:os.systemLoadAverage=solr.jvm:memory.heap.used}
>>  status=0 QTime=1
> 
> That is not a query.  It is a call to the Metrics API. When I've made this 
> call on a production Solr machine, it seems to be very resource-intensive, 
> taking a long time.  I don't think it should be made frequently.  Probably no 
> more than once a minute. If you are seeing that kind of entry in your logs a 
> lot, then that might be contributing to your performance issues.
> 
> Thanks,
> Shawn
> 



Phrase query as feature in LTR not working

2018-11-04 Thread AshB
Phrase query is not working when applied in LTR.

Feature supplied is
 {
"name" : "isPook",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : {
  "fq": ["{!type=edismax qf=text v=$qq}=\"${query}\""]
}
  }

Tested this feature outside and it returns only one result ,i.e phrase but
with LTR it is matching on terms

http://localhost:8983/solr/techproducts/query?q=game%20of%20thrones=id,name,[features%20*efi.query=thrones%20of%20game*],name,cat=true

"response":{"numFound":6,"start":0,"docs":[
  {
"id":"05535734023",
"cat":["book"],
"name":"A Thrones of Game",
   
"[features]":"documentRecency=0.02011838,isBook=1.0,*isPook=1.0*,originalScore=8.337603"},
  {
"id":"05535734021",
"cat":["book"],
"name":"A Game of meeting Thrones",
   
"[features]":"documentRecency=0.02011838,isBook=1.0,*isPook=1.0*,originalScore=8.179235"},

How to set the feature so that it score only the first document containing
the phrase




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


migrating cores with Solr upgrade

2018-11-04 Thread Piyush Kumar Nayak
Hi,

What is the best way to migrate cores from an old version of Solr (say 5.x) to 
a newer version (say 7.x). I did not find anything pertinent to the matter in 
the Solr reference guide.
Is there a tool that can do that seamlessly?

Regards,
Piyush.