from:"Rahul"

Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-11 Thread Rahul Goswami

I haven’t delved into the exact reason for this, but what generally helps
to avoid this situation in a cluster is
i) During shutdown (in case you need to restart the cluster), let the
overseer node be the last one to shut down.
ii) While restarting, let the Overseer node be the first one to start
iii) Wait for 5-10 seconds between each subsequent node start

Hope this helps.

Best,
Rahul


On Thu, Feb 11, 2021 at 12:03 PM mmb1234  wrote:

> Hello,
>
> On reboot of one of the solr nodes in the cluster, we often see a
> collection's shards with
> 1. LEADER replica in DOWN state, and/or
> 2. shard with no LEADER
>
> Output from /solr/admin/collections?action=CLUSTERSTATUS is below.
>
> Even after 5 to 10 minutes, the collection often does not recover. Unclear
> why this is happening and what we can try to prevent or remedy it.
>
> ps: perReplicaState= true in solr v8.8.0 didn't work well because after a
> rebalance all replicas somehow get a "leader:true" status even though
> states.json looked ok.
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 2
>   },
>   "cluster": {
> "collections": {
>   "datacore": {
> "pullReplicas": "0",
> "replicationFactor": "0",
> "shards": {
>   "__": {
> "range": null,
> "state": "active",
> "replicas": {
>   "core_node1": {
> "core": "datacore____replica_t187",
> "base_url": "http://solr-0.solr-headless:8983/solr;,
> "node_name": "solr-0.solr-headless:8983_solr",
> "state": "down",
> "type": "TLOG",
> "force_set_state": "false",
> "property.preferredleader": "true",
> "leader": "true"
>   },
>   "core_node2": {
> "core": "datacore____replica_t188",
> "base_url": "http://solr-1.solr-headless:8983/solr;,
> "node_name": "solr-1.solr-headless:8983_solr",
> "state": "active",
> "type": "TLOG",
> "force_set_state": "false"
>   },
>   "core_node3": {
> "core": "datacore____replica_t189",
> "base_url": "http://solr-2.solr-headless:8983/solr;,
> "node_name": "solr-2.solr-headless:8983_solr",
> "state": "active",
> "type": "TLOG",
> "force_set_state": "false"
>   }
> }
>   },
>   "__j": {
> "range": null,
> "state": "active",
> "replicas": {
>   "core_node19": {
> "core": "datacore___j_replica_t187",
> "base_url": "http://solr-0.solr-headless:8983/solr;,
> "node_name": "solr-0.solr-headless:8983_solr",
> "state": "down",
> "type": "TLOG",
> "force_set_state": "false",
> "property.preferredleader": "true"
>   },
>   "core_node20": {
> "core": "datacore___j_replica_t188",
> "base_url": "http://solr-1.solr-headless:8983/solr;,
> "node_name": "solr-1.solr-headless:8983_solr",
> "state": "active",
> "type": "TLOG",
> "force_set_state": "false"
>   },
>   "core_node21": {
> "core": "datacore___j_replica_t189",
> "base_url": "http://solr-2.solr-headless:8983/solr;,
> "node_name": "solr-2.solr-headless:8983_solr",
> "state": "active",
> "type": "TLOG",
> "force_set_state": "false"
>   }
> }
>   },
>   "__": {
> "range": null,
> "state": "active",
> "replicas": {
>   "core_node4": {
> "core": "datacore____replica_t91",
> "base_url": "http://solr-0...
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: StandardTokenizerFactory doesn't split on underscore

2021-01-09 Thread Rahul Goswami

Ah ok! Thanks Adam and Xiefeng

On Sat, Jan 9, 2021 at 6:02 PM Adam Walz  wrote:

> It is expected that the StandardTokenizer will not break on underscores.
> The StandardTokenizer follows the Unicode UAX 29
> <https://unicode.org/reports/tr29/#Word_Boundaries> standard which
> specifies an underscore as an "extender" and this rule
> <https://unicode.org/reports/tr29/#WB13a> says to not break from
> extenders.
> This is why xiefengchang was suggesting to use a
> PatternReplaceFilterFactory after the StandardTokenizer in order to further
> split on underscores if that is your use case.
>
> On Sat, Jan 9, 2021 at 2:58 PM Rahul Goswami 
> wrote:
>
> > Nope. The underscore is preserved right after tokenization even before it
> > reaches any filters. You can choose the type "text_general" and try an
> > index time analysis through the "Analysis" page on Solr Admin UI.
> >
> > Thanks,
> > Rahul
> >
> > On Sat, Jan 9, 2021 at 8:22 AM xiefengchang 
> > wrote:
> >
> > > did you configured PatternReplaceFilterFactory?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2021-01-08 12:16:06, "Rahul Goswami"  wrote:
> > > >Hello,
> > > >So recently I was debugging a problem on Solr 7.7.2 where the query
> > wasn't
> > > >returning the desired results. Turned out that the indexed terms had
> > > >underscore separated terms, but the query didn't. I was under the
> > > >impression that terms separated by underscore are also tokenized by
> > > >StandardTokenizerFactory, but turns out that's not the case. Eg:
> > > >'hello-world' would be tokenized into 'hello' and 'world', but
> > > >'hello_world' is treated as a single token.
> > > >Is this a bug or a designed behavior?
> > > >
> > > >If this is by design, it would be helpful if this behavior is included
> > in
> > > >the documentation since it is similar to the behavior with periods.
> > > >
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> > > >"Periods (dots) that are not followed by whitespace are kept as part
> of
> > > the
> > > >token, including Internet domain names. "
> > > >
> > > >Thanks,
> > > >Rahul
> > >
> >
>
>
> --
> Adam Walz
>

Re: StandardTokenizerFactory doesn't split on underscore

2021-01-09 Thread Rahul Goswami

Nope. The underscore is preserved right after tokenization even before it
reaches any filters. You can choose the type "text_general" and try an
index time analysis through the "Analysis" page on Solr Admin UI.

Thanks,
Rahul

On Sat, Jan 9, 2021 at 8:22 AM xiefengchang  wrote:

> did you configured PatternReplaceFilterFactory?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> At 2021-01-08 12:16:06, "Rahul Goswami"  wrote:
> >Hello,
> >So recently I was debugging a problem on Solr 7.7.2 where the query wasn't
> >returning the desired results. Turned out that the indexed terms had
> >underscore separated terms, but the query didn't. I was under the
> >impression that terms separated by underscore are also tokenized by
> >StandardTokenizerFactory, but turns out that's not the case. Eg:
> >'hello-world' would be tokenized into 'hello' and 'world', but
> >'hello_world' is treated as a single token.
> >Is this a bug or a designed behavior?
> >
> >If this is by design, it would be helpful if this behavior is included in
> >the documentation since it is similar to the behavior with periods.
> >
> >
> https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> >"Periods (dots) that are not followed by whitespace are kept as part of
> the
> >token, including Internet domain names. "
> >
> >Thanks,
> >Rahul
>

StandardTokenizerFactory doesn't split on underscore

2021-01-07 Thread Rahul Goswami

Hello,
So recently I was debugging a problem on Solr 7.7.2 where the query wasn't
returning the desired results. Turned out that the indexed terms had
underscore separated terms, but the query didn't. I was under the
impression that terms separated by underscore are also tokenized by
StandardTokenizerFactory, but turns out that's not the case. Eg:
'hello-world' would be tokenized into 'hello' and 'world', but
'hello_world' is treated as a single token.
Is this a bug or a designed behavior?

If this is by design, it would be helpful if this behavior is included in
the documentation since it is similar to the behavior with periods.

https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
"Periods (dots) that are not followed by whitespace are kept as part of the
token, including Internet domain names. "

Thanks,
Rahul

Re: Need urgent help -- High cpu on solr

2020-10-16 Thread Rahul Goswami

In addition to the insightful pointers by Zisis and Erick, I would like to
mention an approach in the link below that I generally use to pinpoint
exactly which threads are causing the CPU spike. Knowing this you can
understand which aspect of Solr (search thread, GC, update thread etc) is
taking more CPU and develop a mitigation strategy accordingly. (eg: if it's
a GC thread, maybe try tuning the params or switch to G1 GC). Just helps to
take the guesswork out of the many possible causes. Of course the
suggestions received earlier are best practices and should be taken into
consideration nevertheless.

https://backstage.forgerock.com/knowledge/kb/article/a39551500

The hex number the author talks about in the link above is the native
thread id.

Best,
Rahul


On Wed, Oct 14, 2020 at 8:00 AM Erick Erickson 
wrote:

> Zisis makes good points. One other thing is I’d look to
> see if the CPU spikes coincide with commits. But GC
> is where I’d look first.
>
> Continuing on with the theme of caches, yours are far too large
> at first glance. The default is, indeed, size=512. Every time
> you open a new searcher, you’ll be executing 128 queries
> for autowarming the filterCache and another 128 for the queryResultCache.
> autowarming alone might be accounting for it. I’d reduce
> the size back to 512 and an autowarm count nearer 16
> and monitor the cache hit ratio. There’s little or no benefit
> in squeezing the last few percent from the hit ratio. If your
> hit ratio is small even with the settings you have, then your caches
> don’t do you much good anyway so I’d make them much smaller.
>
> You haven’t told us how often your indexes are
> updated, which will be significant CPU hit due to
> your autowarming.
>
> Once you’re done with that, I’d then try reducing the heap. Most
> of the actual searching is done in Lucene via MMapDirectory,
> which resides in the OS memory space. See:
>
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Finally, if it is GC, consider G1GC if you’re not using that
> already.
>
> Best,
> Erick
>
>
> > On Oct 14, 2020, at 7:37 AM, Zisis T.  wrote:
> >
> > The values you have for the caches and the maxwarmingsearchers do not
> look
> > like the default. Cache sizes are 512 for the most part and
> > maxwarmingsearchers are 2 (if not limit them to 2)
> >
> > Sudden CPU spikes probably indicate GC issues. The #  of documents you
> have
> > is small, are they huge documents? The # of collections is OK in general
> but
> > since they are crammed in 5 Solr nodes the memory requirements might be
> > bigger. Especially if filter and the other caches get populated with 50K
> > entries.
> >
> > I'd first go through the GC activity to make sure that this is not
> causing
> > the issue. The fact that you lose some Solr servers is also an indicator
> of
> > large GC pauses that might create a problem when Solr communicates with
> > Zookeeper.
> >
> >
> >
> > --
> > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
>

Re: Question about solr commits

2020-10-08 Thread Rahul Goswami

Shawn,
So if the autoCommit interval is 15 seconds, and one update request arrives
at t=0 and another at t=10 seconds, then will there be two timers one
expiring at t=15 and another at t=25 seconds, but this would amount to ONLY
ONE commit at t=15 since that one would include changes from both updates.
Is this understanding correct ?

Thanks,
Rahul

On Wed, Oct 7, 2020 at 11:39 PM yaswanth kumar 
wrote:

> Thank you very much both Eric and Shawn
>
> Sent from my iPhone
>
> > On Oct 7, 2020, at 10:41 PM, Shawn Heisey  wrote:
> >
> > On 10/7/2020 4:40 PM, yaswanth kumar wrote:
> >> I have the below in my solrconfig.xml
> >> 
> >> 
> >>   ${solr.Data.dir:}
> >> 
> >> 
> >>   ${solr.autoCommit.maxTime:6}
> >>   false
> >> 
> >> 
> >>   ${solr.autoSoftCommit.maxTime:5000}
> >> 
> >>   
> >> Does this mean even though we are always sending data with commit=false
> on
> >> update solr api, the above should do the commit every minute (6 ms)
> >> right?
> >
> > Assuming that you have not defined the "solr.autoCommit.maxTime" and/or
> "solr.autoSoftCommit.maxTime" properties, this config has autoCommit set to
> 60 seconds without opening a searcher, and autoSoftCommit set to 5 seconds.
> >
> > So five seconds after any indexing begins, Solr will do a soft commit.
> When that commit finishes, changes to the index will be visible to
> queries.  One minute after any indexing begins, Solr will do a hard commit,
> which guarantees that data is written to disk, but it will NOT open a new
> searcher, which means that when the hard commit happens, any pending
> changes to the index will not be visible.
> >
> > It's not "every five seconds" or "every 60 seconds" ... When any changes
> are made, Solr starts a timer.  When the timer expires, the commit is
> fired.  If no changes are made, no commits happen, because the timer isn't
> started.
> >
> > Thanks,
> > Shawn
>

Re: Solr 7.7 - Few Questions

2020-10-06 Thread Rahul Goswami

1. What tool they use to run Solr as a service on windows.
>> Look into procrun. Afterall. Solr runs inside Jetty. So you should have
a way to invoke Jetty’s Main class with required parameters and bundle that
as a procrun service

2. How to set up the disaster recovery?
>> You can back up your indexes at regular periods. This can be done by
taking snapshots and backing them up...and then using the appropriate
snapshot names to restore a certain commit point. For more details please
refer to this link:
https://lucene.apache.org/solr/guide/7_7/making-and-restoring-backups.html

3. How to scale up the servers for the better performance?
>> This is too open ended a question and depends on a lot of factors
specific to your environment and use-case :)

- Rahul


On Tue, Oct 6, 2020 at 4:26 PM Manisha Rahatadkar <
manisha.rahatad...@anjusoftware.com> wrote:

> Hi All
>
> First of all thanks to Shawn, Rahul and Charlie for taking time to reply
> my questions and valuable information.
>
> I was very concerned about the size of the each document and on several
> follow ups got more information that the documents which have 0.5GB size
> are mp4 documents and these are not synced to Solr.
>
> @Shawn Heisey recommended NOT to use Windows because of windows license
> cost and service installer testing is done on Linux.
> I agree with him. We are using NSSM tool to run solr as a service.
>
> Are there any members here using Solr on Windows? I look forward to hear
> from them on:
>
> 1. What tool they use to run Solr as a service on windows.
> 2. How to set up the disaster recovery?
> 3. How to scale up the servers for the better performance?
>
> Thanks in advance and looking forward to hear back your experiences on
> Solr Scale up.
>
> Regards,
> Manisha Rahatadkar
>
> -Original Message-
> From: Rahul Goswami 
> Sent: Sunday, October 4, 2020 11:49 PM
> To: ch...@opensourceconnections.com; solr-user@lucene.apache.org
> Subject: Re: Solr 7.7 - Few Questions
>
> Charlie,
> Thanks for providing an alternate approach to doing this. It would be
> interesting to know how one  could go about organizing the docs in this
> case? (Nested documents?) How would join queries perform on a large
> index(200 million+ docs)?
>
> Thanks,
> Rahul
>
>
>
> On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:
>
> > Hi Rahul,
> >
> >
> >
> > In addition to the wise advice below: remember in Solr, a 'document'
> > is
> >
> > just the name for the thing that would appear as one of the results
> > when
> >
> > you search (analagous to a database record). It's not the same
> >
> > conceptually as a 'Word document' or a 'PDF document'. If your source
> >
> > documents are so big, consider how they might be broken into parts, or
> >
> > whether you really need to index all of them for retrieval purposes,
> > or
> >
> > what parts of them need to be extracted as text. Thus, the Solr
> >
> > documents don't necessarily need to be as large as your source documents.
> >
> >
> >
> > Consider an email size 20kb with ten PDF attachments, each 20MB. You
> >
> > probably shouldn't push all this data into a single Solr document, but
> >
> > you *could* index them as 11 separate Solr documents, but with
> > metadata
> >
> > to indicate that one is an email and ten are PDFs, and a shared ID of
> >
> > some kind to indicate they're related. Then at query time there are
> >
> > various ways for you to group these together, so for example if the
> >
> > query hit one of the PDFs you could show the user the original email,
> >
> > plus the 9 other attachments, using the shared ID as a key.
> >
> >
> >
> > HTH,
> >
> >
> >
> > Charlie
> >
> >
> >
> > On 02/10/2020 01:53, Rahul Goswami wrote:
> >
> > > Manisha,
> >
> > > In addition to what Shawn has mentioned above, I would also like you
> > > to
> >
> > > reevaluate your use case. Do you *need to* index the whole document ?
> eg:
> >
> > > If it's an email, the body of the email *might* be more important
> > > than
> > any
> >
> > > attachments, in which case you could choose to only index the email
> > > body
> >
> > > and ignore (or only partially index) the text from attachments. If
> > > you
> >
> > > could afford to index the documents partially, you could consider
> > > Solr's
> >
> > > "Limit token count filter": See the link below.
> >
> >

Re: Solr 7.7 - Few Questions

2020-10-04 Thread Rahul Goswami

Charlie,
Thanks for providing an alternate approach to doing this. It would be
interesting to know how one  could go about organizing the docs in this
case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:

> Hi Rahul,
>
>
>
> In addition to the wise advice below: remember in Solr, a 'document' is
>
> just the name for the thing that would appear as one of the results when
>
> you search (analagous to a database record). It's not the same
>
> conceptually as a 'Word document' or a 'PDF document'. If your source
>
> documents are so big, consider how they might be broken into parts, or
>
> whether you really need to index all of them for retrieval purposes, or
>
> what parts of them need to be extracted as text. Thus, the Solr
>
> documents don't necessarily need to be as large as your source documents.
>
>
>
> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>
> probably shouldn't push all this data into a single Solr document, but
>
> you *could* index them as 11 separate Solr documents, but with metadata
>
> to indicate that one is an email and ten are PDFs, and a shared ID of
>
> some kind to indicate they're related. Then at query time there are
>
> various ways for you to group these together, so for example if the
>
> query hit one of the PDFs you could show the user the original email,
>
> plus the 9 other attachments, using the shared ID as a key.
>
>
>
> HTH,
>
>
>
> Charlie
>
>
>
> On 02/10/2020 01:53, Rahul Goswami wrote:
>
> > Manisha,
>
> > In addition to what Shawn has mentioned above, I would also like you to
>
> > reevaluate your use case. Do you *need to* index the whole document ? eg:
>
> > If it's an email, the body of the email *might* be more important than
> any
>
> > attachments, in which case you could choose to only index the email body
>
> > and ignore (or only partially index) the text from attachments. If you
>
> > could afford to index the documents partially, you could consider Solr's
>
> > "Limit token count filter": See the link below.
>
> >
>
> >
> https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter
>
> >
>
> > You'll need to configure it in the schema for the "index" analyzer for
> the
>
> > data type of the field with large text.
>
> > Indexing documents of the order of half a GB will definitely come to hurt
>
> > your operations, if not now, later (think OOM, extremely slow atomic
>
> > updates, long running merges etc.).
>
> >
>
> > - Rahul
>
> >
>
> >
>
> >
>
> > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:
>
> >
>
> >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>
> >>> We are using Apache Solr 7.7 on Windows platform. The data is synced to
>
> >> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>
> >> The document size is very huge (~0.5GB average) and solr indexing is
> taking
>
> >> long time. Total document size is ~200GB. As the solr commit is done as
> a
>
> >> part of API, the API calls are failing as document indexing is not
>
> >> completed.
>
> >>
>
> >> A single document is five hundred megabytes?  What kind of documents do
>
> >> you have?  You can't even index something that big without tweaking
>
> >> configuration parameters that most people don't even know about.
>
> >> Assuming you can even get it working, there's no way that indexing a
>
> >> document like that is going to be fast.
>
> >>
>
> >>> 1.  What is your advise on syncing such a large volume of data to
>
> >> Solr KB.
>
> >>
>
> >> What is "KB"?  I have never heard of this in relation to Solr.
>
> >>
>
> >>> 2.  Because of the search requirements, almost 8 fields are defined
>
> >> as Text fields.
>
> >>
>
> >> I can't figure out what you are trying to say with this statement.
>
> >>
>
> >>> 3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such
> a
>
> >> large volume of data?
>
> >>
>
> >> If just one of the documents you're sending to Solr really is five
>
> >> hundred megabytes, then 2 gigabytes would probably be just barely enough
>
> >> to index one document into an empty index ... and it would proba

Re: Solr 7.7 - Few Questions

2020-10-01 Thread Rahul Goswami

Manisha,
In addition to what Shawn has mentioned above, I would also like you to
reevaluate your use case. Do you *need to* index the whole document ? eg:
If it's an email, the body of the email *might* be more important than any
attachments, in which case you could choose to only index the email body
and ignore (or only partially index) the text from attachments. If you
could afford to index the documents partially, you could consider Solr's
"Limit token count filter": See the link below.

https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter

You'll need to configure it in the schema for the "index" analyzer for the
data type of the field with large text.
Indexing documents of the order of half a GB will definitely come to hurt
your operations, if not now, later (think OOM, extremely slow atomic
updates, long running merges etc.).

- Rahul



On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:

> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
> > We are using Apache Solr 7.7 on Windows platform. The data is synced to
> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
> The document size is very huge (~0.5GB average) and solr indexing is taking
> long time. Total document size is ~200GB. As the solr commit is done as a
> part of API, the API calls are failing as document indexing is not
> completed.
>
> A single document is five hundred megabytes?  What kind of documents do
> you have?  You can't even index something that big without tweaking
> configuration parameters that most people don't even know about.
> Assuming you can even get it working, there's no way that indexing a
> document like that is going to be fast.
>
> >1.  What is your advise on syncing such a large volume of data to
> Solr KB.
>
> What is "KB"?  I have never heard of this in relation to Solr.
>
> >2.  Because of the search requirements, almost 8 fields are defined
> as Text fields.
>
> I can't figure out what you are trying to say with this statement.
>
> >3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a
> large volume of data?
>
> If just one of the documents you're sending to Solr really is five
> hundred megabytes, then 2 gigabytes would probably be just barely enough
> to index one document into an empty index ... and it would probably be
> doing garbage collection so frequently that it would make things REALLY
> slow.  I have no way to predict how much heap you will need.  That will
> require experimentation.  I can tell you that 2GB is definitely not enough.
>
> >4.  How to set up Solr in production on Windows? Currently it's set
> up as a standalone engine and client is requested to take the backup of the
> drive. Is there any other better way to do? How to set up for the disaster
> recovery?
>
> I would suggest NOT doing it on Windows.  My reasons for that come down
> to costs -- a Windows Server license isn't cheap.
>
> That said, there's nothing wrong with running on Windows, but you're on
> your own as far as running it as a service.  We only have a service
> installer for UNIX-type systems.  Most of the testing for that is done
> on Linux.
>
> >5.  How to benchmark the system requirements for such a huge data
>
> I do not know what all your needs are, so I have no way to answer this.
> You're going to know a lot more about it that any of us are.
>
> Thanks,
> Shawn
>

Re: ApacheCon at Home 2020 starts tomorrow!

2020-09-29 Thread Rahul Goswami

Thanks for sharing this Anshum. Day 1 had some really interesting sessions.
Missed out on a couple that I would have liked to listen to. Are the
recordings of these sessions available anywhere?

-Rahul

On Mon, Sep 28, 2020 at 7:08 PM Anshum Gupta  wrote:

> Hey everyone!
>
> ApacheCon at Home 2020 starts tomorrow. The event is 100% virtual, and free
> to register. What’s even better is that this year we have reintroduced the
> Lucene/Solr/Search track at ApacheCon.
>
> With 2 full days of sessions covering various Lucene, Solr, and Search, I
> hope you are able to find some time to attend the sessions and learn
> something new and interesting.
>
> There are also various other tracks that span the 3 days of the conference.
> The conference starts in just a few hours for our community in Asia and
> tomorrow morning for the Americas and Europe. Check out the complete
> schedule in the link below.
>
> Here are a few resources you may find useful if you plan to attend
> ApacheCon at Home.
>
> ApacheCon website - https://www.apachecon.com/acna2020/index.html
> Registration - https://hopin.to/events/apachecon-home
> Slack - http://s.apache.org/apachecon-slack
> Search Track - https://www.apachecon.com/acah2020/tracks/search.html
>
> See you at ApacheCon.
>
> --
> Anshum Gupta
>

Re: Delete from Solr console fails

2020-09-26 Thread Rahul Goswami

You mention high CPU usage...Can you share the thread dump (using jstack)
for both the delete by id and delete by query?
Also, an output of /solr//schema executed on the host?
Lastly, is this standalone Solr or SolrCloud?
Attachments won’t make it to the list, so I would recommend sharing a link
to any file sharing service.
On a side note, I have observed the UI timing out requests after a certain
point even though the actual request is still being processed. In case
something like that is happening here, did you try the delete by id as an
HTTP request through a curl or Postman? Having said that I would still
expect delete by id to execute in reasonable time, so I would start by
looking at what is s eating up the CPU in your request.

-Rahul

On Sat, Sep 26, 2020 at 4:50 AM Goutham Tholpadi 
wrote:

> Thanks Dominique! I just tried deleting a single document using its id. I
>
> tried this:
>
> 
>
>  id123 
>
> 
>
>
>
> and this:
>
> 
>
>  id:id123 
>
> 
>
>
>
> In each case, I still get the same "Solr connection lost" error. I checked
>
> that the Solr instance has enough RAM (it was using 73% of the RAM), but it
>
> was using 110% CPU. Could this be a CPU under-allocation problem (the Solr
>
> container has 4 cores allocated to it)?
>
>
>
> Thanks
>
> Goutham
>
>
>
> On Fri, Sep 25, 2020 at 7:41 PM Dominique Bejean <
> dominique.bej...@eolya.fr>
>
> wrote:
>
>
>
> > Hi Goutham,
>
> >
>
> > I agree with Rahul, avoid large deletebyquery.
>
> > It you can, prefere one query to get all the ids first than use ids with
>
> > deletebyid
>
> >
>
> > Regards
>
> >
>
> > Dominique
>
> >
>
> >
>
> > Le ven. 25 sept. 2020 à 06:50, Goutham Tholpadi  a
>
> > écrit :
>
> >
>
> > > I spoke too soon. I am getting the "Connection lost" error again.
>
> > >
>
> > > I have never faced this problem when there are a small number of docs
> in
>
> > > the index. I was wondering if the size of the index (30M docs) has
>
> > anything
>
> > > to do with this.
>
> > >
>
> > > Thanks
>
> > > Goutham
>
> > >
>
> > > On Fri, Sep 25, 2020 at 9:55 AM Goutham Tholpadi 
>
> > > wrote:
>
> > >
>
> > > > Thanks for your response Rahul!
>
> > > >
>
> > > > Yes, all the fields I tried with were indexed=true, but it did not
>
> > work.
>
> > > >
>
> > > > Btw, when I try to today, I am no longer getting the "Connection
> lost"
>
> > > > error. The delete command returns with status=success, however the
>
> > > document
>
> > > > is not actually deleted when I check in the search console again.
>
> > > >
>
> > > > I tried using Document Type as XML just now and I see the same
>
> > behaviour
>
> > > > as above.
>
> > > >
>
> > > > Thanks
>
> > > > Goutham
>
> > > >
>
> > > > On Fri, Sep 25, 2020 at 7:17 AM Rahul Goswami  >
>
> > > > wrote:
>
> > > >
>
> > > >> Goutham,
>
> > > >> Is the field you are trying to delete by indexed=true in the schema
> ?
>
> > > >> If the uniqueKey is indexed=true, does delete by id work for you?
>
> > > >> ( uniqueKey:value)
>
> > > >> Also, instead of  "Solr Command" if you choose the Document type as
>
> > > "XML"
>
> > > >> does it make any difference?
>
> > > >>
>
> > > >> Rahul
>
> > > >>
>
> > > >> On Thu, Sep 24, 2020 at 1:04 PM Goutham Tholpadi <
> gtholp...@gmail.com
>
> > >
>
> > > >> wrote:
>
> > > >>
>
> > > >> > Hi,
>
> > > >> >
>
> > > >> > Setup:
>
> > > >> > We have a stand-alone Solr (v7.2) with around 30 million documents
>
> > and
>
> > > >> with
>
> > > >> > 4 cores, 38G of RAM, and a 1TB disk. The documents were not
> directly
>
> > > >> > indexed but came from a restore of a back from another Solr
>
> > instance.
>
> > > >> >
>
> > > >> > Problem:
>
> > > >> > Search queries seem to be working fine. However, when I try to
>
> > delete
>
> > > >> > documents from the Solr console, I get a "Connection to Solr lost"
>
> > > >> error. I
>
> > > >> > am trying by navigating to the "Documents" section of the chosen
>
> > core,
>
> > > >> > using "Solr Command" as the "Document Type", and entering
> something
>
> > > >> this in
>
> > > >> > the box below:
>
> > > >> > 
>
> > > >> > 
>
> > > >> > field:value
>
> > > >> > 
>
> > > >> > 
>
> > > >> >
>
> > > >> > I tried with the field being the unique key, and otherwise. I also
>
> > > tried
>
> > > >> > with values containing wild cards. I got the error in all cases.
>
> > > >> >
>
> > > >> > Any pointers on this?
>
> > > >> >
>
> > > >> > Thanks
>
> > > >> > Goutham
>
> > > >> >
>
> > > >>
>
> > > >
>
> > >
>
> >
>
>

Re: Delete from Solr console fails

2020-09-24 Thread Rahul Goswami

Goutham,
Is the field you are trying to delete by indexed=true in the schema ?
If the uniqueKey is indexed=true, does delete by id work for you?
( uniqueKey:value)
Also, instead of  "Solr Command" if you choose the Document type as "XML"
does it make any difference?

Rahul

On Thu, Sep 24, 2020 at 1:04 PM Goutham Tholpadi 
wrote:

> Hi,
>
> Setup:
> We have a stand-alone Solr (v7.2) with around 30 million documents and with
> 4 cores, 38G of RAM, and a 1TB disk. The documents were not directly
> indexed but came from a restore of a back from another Solr instance.
>
> Problem:
> Search queries seem to be working fine. However, when I try to delete
> documents from the Solr console, I get a "Connection to Solr lost" error. I
> am trying by navigating to the "Documents" section of the chosen core,
> using "Solr Command" as the "Document Type", and entering something this in
> the box below:
> 
> 
> field:value
> 
> 
>
> I tried with the field being the unique key, and otherwise. I also tried
> with values containing wild cards. I got the error in all cases.
>
> Any pointers on this?
>
> Thanks
> Goutham
>

Re: How to remove duplicate tokens from solr

2020-09-17 Thread Rahul Goswami

Is this for a phrase search? If yes then the position of the token would
matter too and not sure which token would you want to remove. "eg
"tshirt hat tshirt".
Also, are you looking to save space and want this at index time? Or just
want to remove duplicates from the search string?

If this is at search time AND is not a phrase search, there are a couple
approaches I could think of :

1) You could either handle this in the application layer to only pass the
deduplicated string before it hits solr
2) You can write a custom search component and configure it in the
  list to process the search string and remove duplicates
before it hits the default search components. See here (
https://lucene.apache.org/solr/guide/7_7/requesthandlers-and-searchcomponents-in-solrconfig.html#first-components-and-last-components
).

However if for search, I would still evaluate if writing those extra lines
of code is worth the investment. I say so since my assumption is that for
duplicated tokens in search string, lucene would have the intelligence to
not fetch the doc ids again, so you should not be worried about spending
computation resources to reevaluate the same tokens (Someone correct me if
I am wrong!)

-Rahul

On Thu, Sep 17, 2020 at 2:56 PM Rajdeep Sahoo 
wrote:

> If someone is searching with " tshirt tshirt tshirt tshirt tshirt tshirt"
> we need to remove the duplicates and search with tshirt.
>
>
> On Fri, 18 Sep, 2020, 12:19 am Alexandre Rafalovitch, 
> wrote:
>
> > This is not quite enough information.
> > There is
> >
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#remove-duplicates-token-filter
> > but it has specific limitations.
> >
> > What is the problem that you are trying to solve that you feel is due
> > to duplicate tokens? Why are they duplicates? Is it about storage or
> > relevancy?
> >
> > Regards,
> >Alex.
> >
> > On Thu, 17 Sep 2020 at 14:35, Rajdeep Sahoo 
> > wrote:
> > >
> > > Hi team,
> > >  Is there any way to remove duplicate tokens from solr. Is there any
> > filter
> > > for this.
> >
>

Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Rahul Goswami

I agree with Phill, Noble and Ilan above. The problematic term is "slave"
(not master) which I am all for changing if it causes less regression than
removing BOTH master and slave. Since some people have pointed out Github
changing the "master" terminology, in my personal opinion, it was not a
measured response to addressing the bigger problem we are all trying to
tackle. There is no concept of a "slave" branch, and "master" by itself is
a pretty generic term (Is someone having "mastery" over a skill a bad
thing?). I fear all it would end up achieving in the end with Github is a
mess of broken build scripts at best.
So +1 on "slave" being the problematic term IMO, not "master".

On Thu, Jun 18, 2020 at 8:19 PM Phill Campbell
 wrote:

> Master - Worker
> Master - Peon
> Master - Helper
> Master - Servant
>
> The term that is not wanted is “slave’. The term “master” is not a problem
> IMO.
>
> > On Jun 18, 2020, at 3:59 PM, Jan Høydahl  wrote:
> >
> > I support Mike Drob and Trey Grainger. We shuold re-use the
> leader/replica
> > terminology from Cloud. Even if you hand-configure a master/slave cluster
> > and orchestrate what doc goes to which node/shard, and hand-code your
> shards
> > parameter, you will still have a cluster where you’d send updates to the
> leader of
> > each shard and the replicas would replicate the index from the leader.
> >
> > Let’s instead find a new good name for the cluster type. Standalone kind
> of works
> > for me, but I see it can be confused with single-node. We have also
> discussed
> > replacing SolrCloud (which is a terrible name) with something more
> descriptive.
> >
> > Today: SolrCloud vs Master/slave
> > Alt A: SolrCloud vs Standalone
> > Alt B: SolrCloud vs Legacy
> > Alt C: Clustered vs Independent
> > Alt D: Clustered vs Manual mode
> >
> > Jan
> >
> >> 18. jun. 2020 kl. 15:53 skrev Mike Drob :
> >>
> >> I personally think that using Solr cloud terminology for this would be
> fine
> >> with leader/follower. The leader is the one that accepts updates,
> followers
> >> cascade the updates somehow. The presence of ZK or election doesn’t
> really
> >> change this detail.
> >>
> >> However, if folks feel that it’s confusing, then I can’t tell them that
> >> they’re not confused. Especially when they’re working with others who
> have
> >> less Solr experience than we do and are less familiar with the
> intricacies.
> >>
> >> Primary/Replica seems acceptable. Coordinator instead of Overseer seems
> >> acceptable.
> >>
> >> Would love to see this in 9.0!
> >>
> >> Mike
> >>
> >> On Thu, Jun 18, 2020 at 8:25 AM John Gallagher
> >>  wrote:
> >>
> >>> While on the topic of renaming roles, I'd like to propose finding a
> better
> >>> term than "overseer" which has historical slavery connotations as well.
> >>> Director, perhaps?
> >>>
> >>>
> >>> John Gallagher
> >>>
> >>> On Thu, Jun 18, 2020 at 8:48 AM Jason Gerlowski  >
> >>> wrote:
> >>>
>  +1 to rename master/slave, and +1 to choosing terminology distinct
>  from what's used for SolrCloud.  I could be happy with several of the
>  proposed options.  Since a good few have been proposed though, maybe
>  an eventual vote thread is the most organized way to aggregate the
>  opinions here.
> 
>  I'm less positive about the prospect of changing the name of our
>  primary git branch.  Most projects that contributors might come from,
>  most tutorials out there to learn git, most tools built on top of git
>  - the majority are going to assume "master" as the main branch.  I
>  appreciate the change that Github is trying to effect in changing the
>  default for new projects, but it'll be a long time before that
>  competes with the huge bulk of projects, documentation, etc. out there
>  using "master".  Our contributors are smart and I'm sure they'd figure
>  it out if we used "main" or something else instead, but having a
>  non-standard git setup would be one more "papercut" in understanding
>  how to contribute to a project that already makes that harder than it
>  should.
> 
>  Jason
> 
> 
>  On Thu, Jun 18, 2020 at 7:33 AM Demian Katz <
> demian.k...@villanova.edu>
>  wrote:
> >
> > Regarding people having a problem with the word "master" -- GitHub is
>  changing the default branch name away from "master," even in isolation
> >>> from
>  a "slave" pairing... so the terminology seems to be falling out of
> favor
> >>> in
>  all contexts. See:
> >
> >
> 
> >>>
> https://www.cnet.com/news/microsofts-github-is-removing-coding-terms-like-master-and-slave/
> >
> > I'm not here to start a debate about the semantics of that, just to
>  provide evidence that in some communities, the term "master" is
> causing
>  concern all by itself. If we're going to make the change anyway, it
> might
>  be best to get it over with and pick the most appropriate terminology
> we
>  can agree upon,

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Rahul Goswami

+1 on avoiding SolrCloud terminology. In the interest of keeping it obvious
and simple, may I I please suggest primary/secondary?

On Wed, Jun 17, 2020 at 5:14 PM Atita Arora  wrote:

> I agree avoiding using of solr cloud terminology too.
>
> I may suggest going for "prime" and "clone"
> (Short and precise as Master and Slave).
>
> Best,
> Atita
>
>
>
>
>
> On Wed, 17 Jun 2020, 22:50 Walter Underwood, 
> wrote:
>
> > I strongly disagree with using the Solr Cloud leader/follower terminology
> > for non-Cloud clusters. People in my company are confused enough without
> > using polysemous terminology.
> >
> > “This node is the leader, but it means something different than the
> leader
> > in this other cluster.” I’m dreading that conversation.
> >
> > I like “principal”. How about “clone” for the slave role? That suggests
> > that
> > it does not accept updates and that it is loosely-coupled, only depending
> > on the state of the no-longer-called-master.
> >
> > Chegg has five production Solr Cloud clusters and one production
> > master/slave
> > cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in
> > production.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> > >
> > > Proposal:
> > > "A Solr COLLECTION is composed of one or more SHARDS, which each have
> one
> > > or more REPLICAS. Each replica can have a ROLE of either:
> > > 1) A LEADER, which can process external updates for the shard
> > > 2) A FOLLOWER, which receives updates from another replica"
> > >
> > > (Note: I prefer "role" but if others think it's too overloaded due to
> the
> > > overseer role, we could replace it with "mode" or something similar)
> > > ---
> > >
> > > To be explicit with the above definitions:
> > > 1) In SolrCloud, the roles of leaders and followers can dynamically
> > change
> > > based upon the status of the cluster. In standalone mode, they can be
> > > changed by manual intervention.
> > > 2) A leader does not have to have any followers (i.e. only one active
> > > replica)
> > > 3) Each shard always has one leader.
> > > 4) A follower can also pull updates from another follower instead of a
> > > leader (traditionally known as a REPEATER). A repeater is still a
> > follower,
> > > but would not be considered a leader because it can't process external
> > > updates.
> > > 5) A replica cannot be both a leader and a follower.
> > >
> > > In addition to the above roles, each replica can have a TYPE of one of:
> > > 1) NRT - which can serve in the role of leader or follower
> > > 2) TLOG - which can only serve in the role of follower
> > > 3) PULL - which can only serve in the role of follower
> > >
> > > A replica's type may be changed automatically in the event that its
> role
> > > changes.
> > >
> > > I think this terminology is consistent with the current Leader/Follower
> > > usage while also being able to easily accomodate a rename of the
> > historical
> > > master/slave terminology without mental gymnastics or the introduction
> or
> > > more cognitive load through new terminology. I think adopting the
> > > Primary/Replica terminology will be incredibly confusing given the
> > already
> > > specific and well established meaning of "replica" within Solr.
> > >
> > > All the Best,
> > >
> > > Trey Grainger
> > > Founder, Searchkernel
> > > https://searchkernel.com
> > >
> > >
> > >
> > > On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta 
> > wrote:
> > >
> > >> Hi everyone,
> > >>
> > >> Moving a conversation that was happening on the PMC list to the public
> > >> forum. Most of the following is just me recapping the conversation
> that
> > has
> > >> happened so far.
> > >>
> > >> Some members of the community have been discussing getting rid of the
> > >> master/slave nomenclature from Solr.
> > >>
> > >> While this may require a non-trivial effort, a general consensus so
> far
> > >> seems to be to start this process and switch over incrementally, if a
> > >> single change ends up being too big.
> > >>
> > >> There have been a lot of suggestions around what the new nomenclature
> > might
> > >> look like, a few people don’t want to overlap the naming here with
> what
> > >> already exists in SolrCloud i.e. leader/follower.
> > >>
> > >> Primary/Replica was an option that was suggested based on what other
> > >> vendors are moving towards based on Wikipedia:
> > >> https://en.wikipedia.org/wiki/Master/slave_(technology)
> > >> , however there were concerns around the use of “replica” as that
> > denotes a
> > >> very specific concept in SolrCloud. Current terminology clearly
> > >> differentiates the use of the traditional replication model from
> > SolrCloud
> > >> and reusing the names would make it difficult for that to happen.
> > >>
> > >> There were similar concerns around using Leader/follower.
> > >>
> > >> Let’s continue this

Re: when to use docvalue

2020-05-20 Thread Rahul Goswami

Eric,
Thanks for that explanation. I have a follow up question on that. I find
the scenario of stored=true and docValues=true to be tricky at times...
would like to know when is each of these scenarios preferred over the other
two for primitive datatypes:

1) stored=true and docValues=false
2) stored=false and docValues=true
3) stored=true and docValues=true

Thanks,
Rahul

On Tue, May 19, 2020 at 5:55 PM Erick Erickson 
wrote:

> They are _absolutely_ able to be used together. Background:
>
> “In the bad old days”, there was no docValues. So whenever you needed
> to facet/sort/group/use function queries Solr (well, Lucene) had to take
> the inverted structure resulting from “index=true” and “uninvert” it on the
> Java heap.
>
> docValues essentially does the “uninverting” at index time and puts
> that structure in a separate file for each segment. So rather than uninvert
> the index on the heap, Lucene can just read it in from disk in
> MMapDirectory
> (i.e. OS) memory space.
>
> The downside is that your index will be bigger when you do both, that is
> the
> size on disk will be bigger. But, it’ll be much faster to load, much
> faster to
> autowarm, and will move the structures necessary to do faceting/sorting/etc
> into OS memory where the garbage collection is vastly more efficient than
> Javas.
>
> And frankly I don’t think the increased size on disk is a downside. You’ll
> have
> to have the memory anyway, and having it used on the OS memory space is
> so much more efficient than on Java’s heap that it’s a win-win IMO.
>
> Oh, and if you never sort/facet/group/use function queries, then the
> docValues structures are never even read into MMapDirectory space.
>
> So yes, freely do both.
>
> Best,
> Erick
>
>
> > On May 19, 2020, at 5:41 PM, matthew sporleder 
> wrote:
> >
> > You can index AND docvalue?  For some reason I thought they were
> exclusive
> >
> > On Tue, May 19, 2020 at 5:36 PM Erick Erickson 
> wrote:
> >>
> >> Yes. You should also index them….
> >>
> >> Here’s the way I think of it.
> >>
> >> For questions “For term X, which docs contain that value?” means
> index=true. This is a search.
> >>
> >> For questions “Does doc X have value Y in field Z”, means
> docValues=true.
> >>
> >> what’s the difference? Well, the first one is to get the result set.
> The second is for, given a result set,
> >> count/sort/whatever.
> >>
> >> fq clauses are searches, so index=true.
> >>
> >> sorting, faceting, grouping and function queries  are “for each doc in
> the result set, what values does field Y contain?”
> >>
> >> Maybe that made things clear as mud, but it’s the way I think of it ;)
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> fq clauses are searches. Indexed=true is for searching.
> >>
> >> sort
> >>
> >>> On May 19, 2020, at 4:00 PM, matthew sporleder 
> wrote:
> >>>
> >>> I have quite a few numeric / meta-data type fields in my schema and
> >>> pretty much only use them in fq=, sort=, and friends.  Should I always
> >>> use DocValue on these if i never plan to q=search: on them?  Are there
> >>> any drawbacks?
> >>>
> >>> Thanks,
> >>> Matt
> >>
>
>

Re: Solr filter cache hits not reflecting

2020-04-20 Thread Rahul Goswami

Hoss,
Thank you for such a succinct explanation! I was not aware of the order of
lookups (queryResultCache  followed by filterCache). Makes sense now. Sorry
for the false alarm!

Rahul

On Mon, Apr 20, 2020 at 4:04 PM Chris Hostetter 
wrote:

> : 4) A query with different fq.
> :
> http://localhost:8984/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung
> ...
> : 5) A query with the same fq again (fq=manu:samsung OR manu:apple)the
> : numbers don't get update for this fq hereafter for subsequent searches
> :
> :
> http://localhost:8984/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung%20OR%20manu:apple
>
> that's not just *A* query with the same fq, it's the *exact* same request
> (q + sort + pagination + all filters)
>
> Whch means that everything solr needs to reply to this request is
> available in the *queryResultCache* -- no filterCache needed at all (if
> you had faceting enabled that would be a different issue: then the
> filterCache would still be needed in order to compute facet counts over
> the entire DocSet matching the query, not just the current page window)...
>
>
> $ bin/solr -e techproducts
> ...
>
> # mostly empty caches (techproudct has a single static warming query)
>
> $ curl -sS '
> http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true'
> | grep -E
> 'CACHE.searcher.(queryResultCache|filterCache).(inserts|hits|lookups)'
>   "CACHE.searcher.queryResultCache.lookups":0,
>   "CACHE.searcher.queryResultCache.inserts":1,
>   "CACHE.searcher.queryResultCache.hits":0}},
>   "CACHE.searcher.filterCache.hits":0,
>   "CACHE.searcher.filterCache.lookups":0,
>   "CACHE.searcher.filterCache.inserts":0,
>
> # new q and fq: lookup & insert into both caches...
>
> $ curl -sS '
> http://localhost:8983/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung%20OR%20manu:apple'
> > /dev/null
> $ curl -sS '
> http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true'
> | grep -E
> 'CACHE.searcher.(queryResultCache|filterCache).(inserts|hits|lookups)'
>   "CACHE.searcher.queryResultCache.lookups":1,
>   "CACHE.searcher.queryResultCache.inserts":2,
>   "CACHE.searcher.queryResultCache.hits":0}},
>   "CACHE.searcher.filterCache.hits":0,
>   "CACHE.searcher.filterCache.lookups":1,
>   "CACHE.searcher.filterCache.inserts":1,
>
> # new q, same fq:
> # lookup on both caches, hit on filter, insert on queryResultCache
>
> $ curl -sS '
> http://localhost:8983/solr/techproducts/select?q=*:*=manu:samsung%20OR%20manu:apple'
> > /dev/null
> $ curl -sS '
> http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true'
> | grep -E
> 'CACHE.searcher.(queryResultCache|filterCache).(inserts|hits|lookups)'
>   "CACHE.searcher.queryResultCache.lookups":2,
>   "CACHE.searcher.queryResultCache.inserts":3,
>   "CACHE.searcher.queryResultCache.hits":0}},
>   "CACHE.searcher.filterCache.hits":1,
>   "CACHE.searcher.filterCache.lookups":2,
>   "CACHE.searcher.filterCache.inserts":1,
>
> # same q & fq as before:
> # hit on queryresultCache means no filterCache needed...
>
> $ curl -sS '
> http://localhost:8983/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung%20OR%20manu:apple'
> > /dev/null
> $ curl -sS '
> http://localhost:8983/solr/techproducts/admin/mbeans?wt=json=true=CACHE=true'
> | grep -E
> 'CACHE.searcher.(queryResultCache|filterCache).(inserts|hits|lookups)'
>   "CACHE.searcher.queryResultCache.lookups":3,
>   "CACHE.searcher.queryResultCache.inserts":3,
>   "CACHE.searcher.queryResultCache.hits":1}},
>   "CACHE.searcher.filterCache.hits":1,
>   "CACHE.searcher.filterCache.lookups":2,
>   "CACHE.searcher.filterCache.inserts":1,
>
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: Solr filter cache hits not reflecting

2020-04-20 Thread Rahul Goswami

Hi Hoss,

Thanks for your detailed response. In your steps if you go a step
further and search again with the same fq, you should be able to
uncover the problem. Here are the step-by-step observations on Solr
8.5 (7.2.1 and 7.7.2 have the same issue)


1) Before any queries:

http://localhost:8984/solr/admin/metrics?group=core=CACHE.searcher.filterCache

   "solr.core.techproducts":{
  "CACHE.searcher.filterCache":{
"lookups":0,
"idleEvictions":0,
"evictions":0,
"cumulative_inserts":0,
"ramBytesUsed":1328,
"cumulative_hits":0,
"cumulative_idleEvictions":0,
"hits":0,
"cumulative_evictions":0,
"cleanupThread":false,
"size":0,
"hitratio":0.0,
"cumulative_lookups":0,
"cumulative_hitratio":0.0,
"warmupTime":0,
"maxRamMB":-1,
"inserts":0}},


2) With fq:manu:samsung OR manu:apple

http://localhost:8984/solr/techproducts/select?q=*:*=manu:samsung%20OR%20manu:apple

"solr.core.techproducts":{
  "CACHE.searcher.filterCache":{
"lookups":1,
"idleEvictions":0,
"evictions":0,
"cumulative_inserts":1,
"ramBytesUsed":4800,
"cumulative_hits":0,
"cumulative_idleEvictions":0,
"hits":0,
"cumulative_evictions":0,
"cleanupThread":false,
"size":1,
"hitratio":0.0,
"cumulative_lookups":1,
"cumulative_hitratio":0.0,
"item_manu:samsung
manu:apple":"SortedIntDocSet{size=2,ramUsed=40 bytes}",
"warmupTime":0,
"maxRamMB":-1,
"inserts":1}},

3) q changed but same fq... the hits and lookups are updated as expected:
http://localhost:8984/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung%20OR%20manu:apple

   "solr.core.techproducts":{
  "CACHE.searcher.filterCache":{
"lookups":2,
"idleEvictions":0,
"evictions":0,
"cumulative_inserts":1,
"ramBytesUsed":4800,
"cumulative_hits":1,
"cumulative_idleEvictions":0,
"hits":1,
"cumulative_evictions":0,
"cleanupThread":false,
"size":1,
"hitratio":0.5,
"cumulative_lookups":2,
"cumulative_hitratio":0.5,
"item_manu:samsung
manu:apple":"SortedIntDocSet{size=2,ramUsed=40 bytes}",
"warmupTime":0,
"maxRamMB":-1,
"inserts":1}},

4) A query with different fq.
http://localhost:8984/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung

"solr.core.techproducts":{
  "CACHE.searcher.filterCache":{
"lookups":3,
"idleEvictions":0,
"evictions":0,
"cumulative_inserts":2,
"ramBytesUsed":6076,
"cumulative_hits":1,
"cumulative_idleEvictions":0,
"hits":1,
"cumulative_evictions":0,
"cleanupThread":false,
"size":2,
"item_manu:samsung":"SortedIntDocSet{size=1,ramUsed=36 bytes}",
"hitratio":0.33,
"cumulative_lookups":3,
"cumulative_hitratio":0.33,
"item_manu:samsung
manu:apple":"SortedIntDocSet{size=2,ramUsed=40 bytes}",
"warmupTime":0,
"maxRamMB":-1,

5) A query with the same fq again (fq=manu:samsung OR manu:apple)the
numbers don't get update for this fq hereafter for subsequent searches

http://localhost:8984/solr/techproducts/select?q=popularity:[5%20TO%2012]=manu:samsung%20OR%20manu:apple

"solr.core.techproducts":{
  "CACHE.searcher.filterCache":{
"lookups":3,
"idleEvictions":0,
"evictions":0,
"cumulative_inserts":2,
"ramBytesUsed":6076,
"cumulative_hits":1,
"cumulative_idleEvictions":0,
"hits":1,
"cumulative_evictions":0,
"cleanupThread":false,
"size":2,
"item_manu:samsung":"SortedIntDocSet{size=1,ramUsed=36 bytes}",
"hitratio":0.33,
"cumulative_lookups":3,
"cumulative_hitratio":0.33,

Solr filter cache hits not reflecting

2020-04-20 Thread Rahul Goswami

Hello,

I was trying to analyze the filter cache performance and noticed a strange
thing. Upon searching with fq, the entry gets added to the cache the first
time. Observing from the "Stats/Plugins" tab on Solr admin UI, the 'lookup'
and 'inserts' count gets incremented.
However, if I search with the same fq again, I expect the lookup and hits
count to increase, but it doesn't. This ultimately results in an incorrect
hitratio.
I tried this scenario on Solr 7.2.1, 7.7.2 and 8.5 and observe the same
behavior on all three versions.

Is this a bug or am I missing something here?

Thanks,
Rahul

Re: Zookeeper upgrade required with Solr upgrade?

2020-02-13 Thread Rahul Goswami

Thanks Eric. Also, thanks for that little pointer about the JIRA. Just to
make sure I also checked for the upgrade JIRAs for other two intermediate
Zookeeper versions 3.4.11 and 3.4.13 between Solr 7.2.1 and Solr 7.7.2 and
they didn't seem to contain any Solr code changes either.

On Thu, Feb 13, 2020 at 9:26 AM Erick Erickson 
wrote:

> That should be OK. There were no code changes necessary for that upgrade.
> see SOLR-13363
>
> > On Feb 12, 2020, at 5:34 PM, Rahul Goswami 
> wrote:
> >
> > Hello,
> > We are running a SolrCloud (7.2.1) cluster and upgrading to Solr 7.7.2.
> We
> > run a separate multi node zookeeper ensemble which currently runs
> > Zookeeper 3.4.10.
> > Is it also required to upgrade Zookeeper (to  3.4.14 as per change.txt
> for
> > Solr 7.7.2) along with Solr ?
> >
> > I tried a few basic updates requests for a 2 node SolrCloud cluster with
> > the older (3.4.10) zookeeper and it seemed to work fine. But just want to
> > know if there are any caveats I should be aware of.
> >
> > Thanks,
> > Rahul
>
>

Zookeeper upgrade required with Solr upgrade?

2020-02-12 Thread Rahul Goswami

Hello,
We are running a SolrCloud (7.2.1) cluster and upgrading to Solr 7.7.2. We
run a separate multi node zookeeper ensemble which currently runs
Zookeeper 3.4.10.
Is it also required to upgrade Zookeeper (to  3.4.14 as per change.txt for
Solr 7.7.2) along with Solr ?

I tried a few basic updates requests for a 2 node SolrCloud cluster with
the older (3.4.10) zookeeper and it seemed to work fine. But just want to
know if there are any caveats I should be aware of.

Thanks,
Rahul

Performance comparison for wildcard searches

2020-02-03 Thread Rahul Goswami

Hello,

I am working with Solr 7.2.1 and had a question regarding the performance
of wildcard searches.

q=*:*
vs
q=id:*
vs
q=id:[* TO *]

Can someone please rank them in the order of performance with the
underlying reason?

Thanks,
Rahul

Re: How expensive is core loading?

2020-01-29 Thread Rahul Goswami

Hi Shawn,
Thanks for the inputs. I realize I could have been clearer. By "expensive",
I mean expensive in terms of memory utilization. Eg: Let's say I have a
core with an index size of 10 GB and is not loaded on startup as per
configuration. If I load it in order to know the total documents and the
index size (to gather stats about the Solr server), is the amount of memory
consumed proportional to the index size in some way?

Thanks,
Rahul

On Wed, Jan 29, 2020 at 6:43 PM Shawn Heisey  wrote:

> On 1/29/2020 3:01 PM, Rahul Goswami wrote:
> > 1) How expensive is core loading if I am only getting stats like the
> total
> > docs and size of the index (no expensive queries)?
> > 2) Does the memory consumption on core loading depend on the index size ?
> > 3) What is a reasonable value for transient cache size in a production
> > setup with above configuration?
>
> What I would do is issue a RELOAD command.  For non-cloud deployments,
> I'd use the CoreAdmin API.  For cloud deployments, I'd use the
> Collections API.  To discover the answer, see how long it takes for the
> response to come back.
>
> The time interval for a RELOAD is likely different than when Solr starts
> ... but it sounds like you're more interested in the numbers for core
> loading after Solr starts than the ones during startup.
>
> Thanks,
> Shawn
>

Re: How expensive is core loading?

2020-01-29 Thread Rahul Goswami

Thanks for your response Walter. But I could not find a Java api for Luke
for writing my tool. Is there one? I also tried using the  LukeRequestHandler
that comes with Solr, but invoking it causes the Solr core to be loaded.

Rahul

On Wed, Jan 29, 2020 at 5:20 PM Walter Underwood 
wrote:

> You might use Luke to get that info from the index files without loading
> them
> into Solr.
>
> https://code.google.com/archive/p/luke/
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jan 29, 2020, at 2:01 PM, Rahul Goswami 
> wrote:
> >
> > Hello,
> > I am using Solr 7.2.1 on a Solr node running in standalone mode (-Xmx 8
> > GB). I wish to implement a service to monitor the server stats (like
> number
> > of docs per core, index size etc) .This would require me to load the core
> > and my concern is that for a node hosting 100+ cores, this could be
> > expensive. So here are my questions:
> >
> > 1) How expensive is core loading if I am only getting stats like the
> total
> > docs and size of the index (no expensive queries)?
> > 2) Does the memory consumption on core loading depend on the index size ?
> > 3) What is a reasonable value for transient cache size in a production
> > setup with above configuration?
> >
> > Thanks,
> > Rahul
>
>

How expensive is core loading?

2020-01-29 Thread Rahul Goswami

Hello,
I am using Solr 7.2.1 on a Solr node running in standalone mode (-Xmx 8
GB). I wish to implement a service to monitor the server stats (like number
of docs per core, index size etc) .This would require me to load the core
and my concern is that for a node hosting 100+ cores, this could be
expensive. So here are my questions:

1) How expensive is core loading if I am only getting stats like the total
docs and size of the index (no expensive queries)?
2) Does the memory consumption on core loading depend on the index size ?
3) What is a reasonable value for transient cache size in a production
setup with above configuration?

Thanks,
Rahul

Solr indexing performance

2019-12-05 Thread Rahul Goswami

Hello,

We have a Solr 7.2.1 Solr Cloud setup where the client is indexing in 5
parallel threads with 5000 docs per batch. This is a test setup and all
documents are indexed on the same node. We are seeing connection timeout
issues thereafter some time into indexing. I am yet to analyze GC pauses
and other possibilities, but as a guideline just wanted to know what
indexing rate might be "too high" for Solr so as to consider throttling ?
The documents are mostly metadata with about 25 odd fields, so not very
heavy.
Would be nice to know a baseline performance expectation for better
application design considerations.

Thanks,
Rahul

Re: [ANNOUNCE] Apache Solr 8.3.1 released

2019-12-04 Thread Rahul Goswami

Thanks Ishan. I was just going through the list of fixes in 8.3.1
(published in changes.txt) and couldn't see the below JIRA.

SOLR-13971 <http://issues.apache.org/jira/browse/SOLR-13971>: Velocity
response writer's resource loading now possible only through startup
parameters.

Is it linked appropriately? Or is it some access rights issue for non-PMC
members like me ?

Thanks,
Rahul


On Wed, Dec 4, 2019 at 7:12 AM Noble Paul  wrote:

> Thanks ishan
>
> On Wed, Dec 4, 2019, 3:32 PM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com>
> wrote:
>
> > ## 3 December 2019, Apache Solr™ 8.3.1 available
> >
> > The Lucene PMC is pleased to announce the release of Apache Solr 8.3.1.
> >
> > Solr is the popular, blazing fast, open source NoSQL search platform
> > from the Apache Lucene project. Its major features include powerful
> > full-text search, hit highlighting, faceted search, dynamic
> > clustering, database integration, rich document handling, and
> > geospatial search. Solr is highly scalable, providing fault tolerant
> > distributed search and indexing, and powers the search and navigation
> > features of many of the world's largest internet sites.
> >
> > Solr 8.3.1 is available for immediate download at:
> >
> >   <https://lucene.apache.org/solr/downloads.html>
> >
> > ### Solr 8.3.1 Release Highlights:
> >
> >   * JavaBinCodec has concurrent modification of CharArr resulting in
> > corrupt internode updates
> >   * findRequestType in AuditEvent is more robust
> >   * CoreContainer.auditloggerPlugin is volatile now
> >   * Velocity response writer's resource loading now possible only
> > through startup parameters
> >
> >
> > Please read CHANGES.txt for a full list of changes:
> >
> >   <https://lucene.apache.org/solr/8_3_1/changes/Changes.html>
> >
> > Solr 8.3.1 also includes  and bugfixes in the corresponding Apache
> > Lucene release:
> >
> >   <https://lucene.apache.org/core/8_3_1/changes/Changes.html>
> >
> > Note: The Apache Software Foundation uses an extensive mirroring network
> > for
> > distributing releases. It is possible that the mirror you are using may
> > not have
> > replicated the release yet. If that is the case, please try another
> mirror.
> > This also applies to Maven access.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >
>

Re: Solr 8.2 indexing issues

2019-11-21 Thread Rahul Goswami

Hi Sujatha,

How did you upgrade your cluster ? Did you restart each node in the cluster
one by one after upgrade (while other nodes were running on 6.6.2) or did
you bring down the entire cluster and bring up one upgraded node at a time?

Thanks,
Rahul

On Thu, Nov 14, 2019 at 7:03 AM Paras Lehana 
wrote:

> Hi Sujatha,
>
> Apologies that I am not addressing your bug directly but have you tried 8.3
> <https://lucene.apache.org/solr/downloads.html> that has just been
> released?
>
> On Wed, 13 Nov 2019 at 02:12, Sujatha Arun  wrote:
>
> > We recently migrated from 6.6.2 to 8.2. We are seeing issues with
> indexing
> > where the leader and the replica document counts do not match. We get
> > different results every time we do a *:* search.
> >
> > The only issue we see in the logs is Jira issue : Solr-13293
> >
> > Has anybody seen similar issues?
> >
> > Thanks
> >
>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
>

Re: Upgrade solr from 7.2.1 to 8.2

2019-11-19 Thread Rahul Goswami

Hello,

Just wanted to follow up in case my question fell through the cracks :)
Would appreciate help on this.

Thanks,
Rahul

On Fri, Nov 15, 2019 at 5:32 PM Rahul Goswami  wrote:

> Hello,
>
> We are planning to upgrade our SolrCloud cluster from 7.2.1 (hosted on
> Windows server) to 8.2.
> I read the documentation
> <https://lucene.apache.org/solr/guide/8_2/major-changes-in-solr-8.html#upgrade-prerequisites>
> which mentions that I need to be on Solr 7.3 and higher to be able to
> upgrade to 8.x. I want to know if this is a hard requirement or a
> requirement for rolling upgrades (?).
> Let's say I am fine with bringing the whole cluster down and upgrade all
> the nodes to Solr 8.2, and then bring up one node at a time. Will it be ok
> to upgrade directly from 7.2.1 to 8.2 in that case?
>
> Thanks in advance!
>
> Regards,
> Rahul
>

Upgrade solr from 7.2.1 to 8.2

2019-11-15 Thread Rahul Goswami

Hello,

We are planning to upgrade our SolrCloud cluster from 7.2.1 (hosted on
Windows server) to 8.2.
I read the documentation
<https://lucene.apache.org/solr/guide/8_2/major-changes-in-solr-8.html#upgrade-prerequisites>
which mentions that I need to be on Solr 7.3 and higher to be able to
upgrade to 8.x. I want to know if this is a hard requirement or a
requirement for rolling upgrades (?).
Let's say I am fine with bringing the whole cluster down and upgrade all
the nodes to Solr 8.2, and then bring up one node at a time. Will it be ok
to upgrade directly from 7.2.1 to 8.2 in that case?

Thanks in advance!

Regards,
Rahul

Re: Custom update processor not kicking in

2019-09-19 Thread Rahul Goswami

Eric,
The 200 million docs are all large as they are content indexed. Also it
would be hard to convince the customer to rebuild their index. But more
than that, I also want to clear my understanding on this topic and know if
it’s an expected behaviour for a distributed update processor to not call
any further custom processors other than the run update processor in
standalone mode? Alternatively, is there a way I can get a handle on a
complete document once it’s reconstructed from an atomic update?

Thanks,
Rahul

On Thu, Sep 19, 2019 at 7:06 AM Erick Erickson 
wrote:

> _Why_ is reindexing not an option? 200M doc isn't that many.
> Since you have Atomic updates working, you could easily
> write a little program that pulled the docs from you existing
> collection and pushed them to a new one with the new schema.
>
> Do use CursorMark if you try that You have to be ready to
> reindex as time passes, either to upgrade to a major version
> 2 greater than what you're using now or because the requirements
> change yet again.
>
> Best,
> Erick
>
> On Thu, Sep 19, 2019 at 12:36 AM Rahul Goswami 
> wrote:
> >
> > Eric, Markus,
> > Thank you for your inputs. I made sure that the jar file is found
> correctly
> > since the core reloads fine and also prints the log lines from my
> processor
> > during update request (getInstane() method of the update factory). The
> > reason why I want to insert the processor between distributed update
> > processor (DUP) and run update processor (RUP) is because there are
> certain
> > fields which were indexed against a dynamic field “*” and later the
> schema
> > was patched to remove the * field, causing atomic updates to fail for
> such
> > documents. Reindexing is not option since the index has nearly 200
> million
> > docs. My understanding is that the atomic updates are stitched back to a
> > complete document in the DUP before being reindexed by RUP. Hence if I am
> > able to access the document before being indexed and check for fields
> which
> > are not defined in the schema, I can remove them from the stitched back
> > document so that the atomic update can happen successfully for such docs.
> > The documentation below mentions that even if I don’t include the DUP in
> my
> > chain it is automatically inserted just before RUP.
> >
> >
> https://lucene.apache.org/solr/guide/7_2/update-request-processors.html#custom-update-request-processor-chain
> >
> >
> > I tried both approaches viz. explicitly specifying my processor after DUP
> > in the chain and also tried using the “post-processor” option in the
> chain,
> > to have the custom processor execute after DUP. Still looks like the
> > processor is just short circuited. I have defined my logic in the
> > processAdd() of the  processor. Is this an expected behavior?
> >
> > Regards,
> > Rahul
> >
> >
> > On Wed, Sep 18, 2019 at 5:28 PM Erick Erickson 
> > wrote:
> >
> > > It Depends (tm). This is a little confused. Why do you have
> > > distributed processor in stand-alone Solr? Stand-alone doesn't, well,
> > > distribute updates so that seems odd. Do try switching it around and
> > > putting it on top, this should be OK since distributed is irrelevant.
> > >
> > > You can also just set a breakpoint and see for instance, the
> > > instructions in the "IntelliJ" section here:
> > > https://cwiki.apache.org/confluence/display/solr/HowToContribute
> > >
> > > One thing I'd do is make very, very sure that my jar file was being
> > > found. IIRC, the -v startup option will log exactly where solr looks
> > > for jar files. Be sure your custom jar is in one of them and is picked
> > > up. I've set a lib directive to one place only to discover that
> > > there's an old copy lying around someplace else
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Sep 18, 2019 at 5:08 PM Markus Jelsma
> > >  wrote:
> > > >
> > > > Hello Rahul,
> > > >
> > > > I don't know why you don't see your logs lines, but if i remember
> > > correctly, you must put all custom processors above Log, Distributed
> and
> > > Run, at least i remember i read it somewhere a long time ago.
> > > >
> > > > We put all our custom processors on top of the three default
> processors
> > > and they run just fine.
> > > >
> > > > Try it.
> > > >
> > > > Regards,
> > > > Markus
> > > >
> > > >

Re: Custom update processor not kicking in

2019-09-18 Thread Rahul Goswami

Eric, Markus,
Thank you for your inputs. I made sure that the jar file is found correctly
since the core reloads fine and also prints the log lines from my processor
during update request (getInstane() method of the update factory). The
reason why I want to insert the processor between distributed update
processor (DUP) and run update processor (RUP) is because there are certain
fields which were indexed against a dynamic field “*” and later the schema
was patched to remove the * field, causing atomic updates to fail for such
documents. Reindexing is not option since the index has nearly 200 million
docs. My understanding is that the atomic updates are stitched back to a
complete document in the DUP before being reindexed by RUP. Hence if I am
able to access the document before being indexed and check for fields which
are not defined in the schema, I can remove them from the stitched back
document so that the atomic update can happen successfully for such docs.
The documentation below mentions that even if I don’t include the DUP in my
chain it is automatically inserted just before RUP.

https://lucene.apache.org/solr/guide/7_2/update-request-processors.html#custom-update-request-processor-chain

I tried both approaches viz. explicitly specifying my processor after DUP
in the chain and also tried using the “post-processor” option in the chain,
to have the custom processor execute after DUP. Still looks like the
processor is just short circuited. I have defined my logic in the
processAdd() of the  processor. Is this an expected behavior?

Regards,
Rahul

On Wed, Sep 18, 2019 at 5:28 PM Erick Erickson 
wrote:

> It Depends (tm). This is a little confused. Why do you have
> distributed processor in stand-alone Solr? Stand-alone doesn't, well,
> distribute updates so that seems odd. Do try switching it around and
> putting it on top, this should be OK since distributed is irrelevant.
>
> You can also just set a breakpoint and see for instance, the
> instructions in the "IntelliJ" section here:
> https://cwiki.apache.org/confluence/display/solr/HowToContribute
>
> One thing I'd do is make very, very sure that my jar file was being
> found. IIRC, the -v startup option will log exactly where solr looks
> for jar files. Be sure your custom jar is in one of them and is picked
> up. I've set a lib directive to one place only to discover that
> there's an old copy lying around someplace else
>
> Best,
> Erick
>
> On Wed, Sep 18, 2019 at 5:08 PM Markus Jelsma
>  wrote:
> >
> > Hello Rahul,
> >
> > I don't know why you don't see your logs lines, but if i remember
> correctly, you must put all custom processors above Log, Distributed and
> Run, at least i remember i read it somewhere a long time ago.
> >
> > We put all our custom processors on top of the three default processors
> and they run just fine.
> >
> > Try it.
> >
> > Regards,
> > Markus
> >
> > -Original message-
> > > From:Rahul Goswami 
> > > Sent: Wednesday 18th September 2019 22:20
> > > To: solr-user@lucene.apache.org
> > > Subject: Custom update processor not kicking in
> > >
> > > Hello,
> > >
> > > I am using solr 7.2.1 in a standalone mode. I created a custom update
> > > request processor and placed it between the distributed processor and
> run
> > > update processor in my chain. I made sure the chain is invoked since I
> see
> > > log lines from the getInstance() method of my processor factory. But I
> > > don’t see any log lines from the processAdd() method.
> > >
> > > Any inputs on why the processor is getting skipped if placed after
> > > distributed processor?
> > >
> > > Thanks,
> > > Rahul
> > >
>

Custom update processor not kicking in

2019-09-18 Thread Rahul Goswami

Hello,

I am using solr 7.2.1 in a standalone mode. I created a custom update
request processor and placed it between the distributed processor and run
update processor in my chain. I made sure the chain is invoked since I see
log lines from the getInstance() method of my processor factory. But I
don’t see any log lines from the processAdd() method.

Any inputs on why the processor is getting skipped if placed after
distributed processor?

Thanks,
Rahul

java.lang.OutOfMemoryError: Java heap space

2019-07-24 Thread Mandava, Rahul

I am using SOLR version 6.6.0 and the heap size is set to 512 MB, I believe 
which is default. We do have almost 10 million documents in the index, we do 
perform frequent updates (we are doing soft commit on every update: heap issue 
was seen with and without soft commit) to the index and obviously search 
heavily. We have experienced Heap space out of memory exception twice so far in 
the whole year span since we started using SOLR. Since we are just using 
default value for heap size, I am thinking to increase it and I do know that 
high heap size can slow down the performance due to GC pauses. As we can't 
really come up with an ideal number that can work with any scenario, I want to 
increase it to just 1gb only.

I did some reading around this, in which I learned that there can be lot of 
parameters that contribute to this issue and there is no perfect way to address 
this. And also read that increasing heap size above 2gb is where we definitely 
be in the danger zone, since I am thinking to increase it to just 1gb and if I 
monitor the consumption on a daily basis for a while, I should be good and 
resolve the heap memory issue. Is that a safe assumption ??

Does anyone has experienced similar issue ?? Any thoughts or suggestions ??


Below are heap usages if it helps. Usage was almost 490 mb, which makes me feel 
with the load we have 512 mb is not enough and should be good if I increase it 
to 1gb.


[cid:image002.png@01D54174.BF7511F0]

[cid:image001.png@01D54174.6D329160]



Thanks

Re: SolrCloud indexing triggers merges and timeouts

2019-07-12 Thread Rahul Goswami

Upon further investigation on this issue, I see the below log lines during
the indexing process:

2019-06-06 22:24:56.203 INFO  (qtp1169794610-5652)
[c:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623
s:shard22 r:core_node87
x:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623_shard22_replica_n84]
org.apache.solr.update.LoggingInfoStream [FP][qtp1169794610-5652]: trigger
flush: activeBytes=352402600 deleteBytes=279 vs limit=104857600
2019-06-06 22:24:56.203 INFO  (qtp1169794610-5652)
[c:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623
s:shard22 r:core_node87
x:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623_shard22_replica_n84]
org.apache.solr.update.LoggingInfoStream [FP][qtp1169794610-5652]: thread
state has 352402600 bytes; docInRAM=1
2019-06-06 22:24:56.204 INFO  (qtp1169794610-5652)
[c:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623
s:shard22 r:core_node87
x:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623_shard22_replica_n84]
org.apache.solr.update.LoggingInfoStream [FP][qtp1169794610-5652]: 1 in-use
non-flushing threads states
2019-06-06 22:24:56.204 INFO  (qtp1169794610-5652)
[c:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623
s:shard22 r:core_node87

I have the below questions:
1) The log line which says "thread state has 352402600 bytes; docInRAM=1 ",
does it mean that the buffer was flushed to disk with only one huge
document ?
2) If yes, does this flush create a segment with just one document ?
3) Heap dump analysis shows large (>350 MB) instances of
DocumentWritersPerThread. Does one instance of this class correspond to one
document?


Help is much appreciated.

Thanks,
Rahul


On Fri, Jul 5, 2019 at 2:11 AM Rahul Goswami  wrote:

> Shawn,Erick,
> Thank you for the explanation. The merge scheduler params make sense now.
>
> Thanks,
> Rahul
>
> On Wed, Jul 3, 2019 at 11:30 AM Erick Erickson 
> wrote:
>
>> Two more tidbits to add to Shawn’s explanation:
>>
>> There are heuristics built in to ConcurrentMergeScheduler.
>> From the Javadocs:
>> * If it's an SSD,
>> *  {@code maxThreadCount} is set to {@code max(1, min(4,
>> cpuCoreCount/2))},
>> *  otherwise 1.  Note that detection only currently works on
>> *  Linux; other platforms will assume the index is not on an SSD.
>>
>> Second, TieredMergePolicy (the default) merges in “tiers” that
>> are of similar size. So you can have multiple merges going on
>> at the same time on disjoint sets of segments.
>>
>> Best,
>> Erick
>>
>> > On Jul 3, 2019, at 7:54 AM, Shawn Heisey  wrote:
>> >
>> > On 7/2/2019 10:53 PM, Rahul Goswami wrote:
>> >> Hi Shawn,
>> >> Thank you for the detailed suggestions. Although, I would like to
>> >> understand the maxMergeCount and maxThreadCount params better. The
>> >> documentation
>> >> <
>> https://lucene.apache.org/solr/guide/7_3/indexconfig-in-solrconfig.html#mergescheduler
>> >
>> >> mentions
>> >> that
>> >> maxMergeCount : The maximum number of simultaneous merges that are
>> allowed.
>> >> maxThreadCount : The maximum number of simultaneous merge threads that
>> >> should be running at once
>> >> Since one thread can only do 1 merge at any given point of time, how
>> does
>> >> maxMergeCount being greater than maxThreadCount help anyway? I am
>> having
>> >> difficulty wrapping my head around this, and would appreciate if you
>> could
>> >> help clear it for me.
>> >
>> > The maxMergeCount setting controls the number of merges that can be
>> *scheduled* at the same time.  As soon as that number of merges is reached,
>> the indexing thread(s) will be paused until the number of merges in the
>> schedule drops below this number.  This ensures that no more merges will be
>> scheduled.
>> >
>> > By setting maxMergeCount higher than the number of merges that are
>> expected in the schedule, you can ensure that indexing will never be
>> paused.  It would require very atypical merge policy settings for the
>> number of scheduled merges to ever reach six.  On my own indexing, I
>> reached three scheduled merges quite frequently.  The default setting for
>> maxMergeCount is three.
>> >
>> > The maxThreadCount setting controls how many of the scheduled merges
>> will be simultaneously executed. With index data on standard spinning
>> disks, you do not want to increase this number beyond 1, or you will have a
>> performance problem due to thrashing disk heads.  If your data is on SSD,
>> you can make it larger than 1.
>> >
>> > Thanks,
>> > Shawn
>>
>>

Re: SolrCloud indexing triggers merges and timeouts

2019-07-05 Thread Rahul Goswami

Shawn,Erick,
Thank you for the explanation. The merge scheduler params make sense now.

Thanks,
Rahul

On Wed, Jul 3, 2019 at 11:30 AM Erick Erickson 
wrote:

> Two more tidbits to add to Shawn’s explanation:
>
> There are heuristics built in to ConcurrentMergeScheduler.
> From the Javadocs:
> * If it's an SSD,
> *  {@code maxThreadCount} is set to {@code max(1, min(4, cpuCoreCount/2))},
> *  otherwise 1.  Note that detection only currently works on
> *  Linux; other platforms will assume the index is not on an SSD.
>
> Second, TieredMergePolicy (the default) merges in “tiers” that
> are of similar size. So you can have multiple merges going on
> at the same time on disjoint sets of segments.
>
> Best,
> Erick
>
> > On Jul 3, 2019, at 7:54 AM, Shawn Heisey  wrote:
> >
> > On 7/2/2019 10:53 PM, Rahul Goswami wrote:
> >> Hi Shawn,
> >> Thank you for the detailed suggestions. Although, I would like to
> >> understand the maxMergeCount and maxThreadCount params better. The
> >> documentation
> >> <
> https://lucene.apache.org/solr/guide/7_3/indexconfig-in-solrconfig.html#mergescheduler
> >
> >> mentions
> >> that
> >> maxMergeCount : The maximum number of simultaneous merges that are
> allowed.
> >> maxThreadCount : The maximum number of simultaneous merge threads that
> >> should be running at once
> >> Since one thread can only do 1 merge at any given point of time, how
> does
> >> maxMergeCount being greater than maxThreadCount help anyway? I am having
> >> difficulty wrapping my head around this, and would appreciate if you
> could
> >> help clear it for me.
> >
> > The maxMergeCount setting controls the number of merges that can be
> *scheduled* at the same time.  As soon as that number of merges is reached,
> the indexing thread(s) will be paused until the number of merges in the
> schedule drops below this number.  This ensures that no more merges will be
> scheduled.
> >
> > By setting maxMergeCount higher than the number of merges that are
> expected in the schedule, you can ensure that indexing will never be
> paused.  It would require very atypical merge policy settings for the
> number of scheduled merges to ever reach six.  On my own indexing, I
> reached three scheduled merges quite frequently.  The default setting for
> maxMergeCount is three.
> >
> > The maxThreadCount setting controls how many of the scheduled merges
> will be simultaneously executed. With index data on standard spinning
> disks, you do not want to increase this number beyond 1, or you will have a
> performance problem due to thrashing disk heads.  If your data is on SSD,
> you can make it larger than 1.
> >
> > Thanks,
> > Shawn
>
>

Re: SolrCloud indexing triggers merges and timeouts

2019-07-02 Thread Rahul Goswami

Hi Shawn,

Thank you for the detailed suggestions. Although, I would like to
understand the maxMergeCount and maxThreadCount params better. The
documentation
<https://lucene.apache.org/solr/guide/7_3/indexconfig-in-solrconfig.html#mergescheduler>
mentions
that

maxMergeCount : The maximum number of simultaneous merges that are allowed.
maxThreadCount : The maximum number of simultaneous merge threads that
should be running at once

Since one thread can only do 1 merge at any given point of time, how does
maxMergeCount being greater than maxThreadCount help anyway? I am having
difficulty wrapping my head around this, and would appreciate if you could
help clear it for me.

Thanks,
Rahul

On Thu, Jun 13, 2019 at 7:33 AM Shawn Heisey  wrote:

> On 6/6/2019 9:00 AM, Rahul Goswami wrote:
> > *OP Reply* : Total 48 GB per node... I couldn't see another software
> using
> > a lot of memory.
> > I am honestly not sure about the reason for change of directory factory
> to
> > SimpleFSDirectoryFactory. But I was told that with mmap at one point we
> > started to see the shared memory usage on Windows go up significantly,
> > intermittently freezing the system.
> > Could the choice of DirectoryFactory here be a factor for the long
> > updates/frequent merges?
>
> With about 24GB of RAM to cache 1.4TB of index data, you're never going
> to have good performance.  Any query you do is probably going to read
> more than 24GB of data from the index, which means that it cannot come
> from memory, some of it must come from disk, which is incredibly slow
> compared to memory.
>
> MMap is more efficient than "simple" filesystem access.  I do not know
> if you would see markedly better performance, but getting rid of the
> DirectoryFactory config and letting Solr choose its default might help.
>
> > How many total documents (maxDoc, not numDoc) are in that 1.4 TB of
> > space?
> > *OP Reply:* Also, there are nearly 12.8 million total docs (maxDoc, NOT
> > numDoc) in that 1.4 TB space
>
> Unless you're doing faceting or grouping on fields with extremely high
> cardinality, which I find to be rarely useful except for data mining,
> 24GB of heap for 12.8 million docs seems very excessive.  I was
> expecting this number to be something like 500 million or more ... that
> small document count must mean each document is HUGE.  Can you take
> steps to reduce the index size, perhaps by setting stored, indexed,
> and/or docValues to "false" on some of your fields, and having your
> application go to the system of record for full details on each
> document?  You will have to reindex after making changes like that.
>
> >> Can you share the GC log that Solr writes?
> > *OP Reply:*  Please find the GC logs and thread dumps at this location
> > https://drive.google.com/open?id=1slsYkAcsH7OH-7Pma91k6t5T72-tIPlw
>
> The larger GC log was unrecognized by both gcviwer and gceasy.io ... the
> smaller log shows heap usage about 10GB, but it only covers 10 minutes,
> so it's not really conclusive for diagnosis.  The first thing I can
> suggest to try is to reduce the heap size to 12GB ... but I do not know
> if that's actually going to work.  Indexing might require more memory.
> The idea here is to make more memory available to the OS disk cache ...
> with your index size, you're probably going to need to add memory to the
> system (not the heap).
>
> > Another observation is that the CPU usage reaches around 70% (through
> > manual monitoring) when the indexing starts and the merges are observed.
> It
> > is well below 50% otherwise.
>
> Indexing will increase load, and that increase is often very
> significant.  Adding memory to the system is your best bet for better
> performance.  I'd want 1TB of memory for a 1.4TB index ... but I know
> that memory sizes that high are extremely expensive, and for most
> servers, not even possible.  512GB or 256GB is more attainable, and
> would have better performance than 48GB.
>
> > Also, should something be altered with the mergeScheduler setting ?
> > "mergeScheduler":{
> >  "class":"org.apache.lucene.index.ConcurrentMergeScheduler",
> >  "maxMergeCount":2,
> >  "maxThreadCount":2},
>
> Do not configure maxThreadCount beyond 1 unless your data is on SSD.  It
> will slow things down a lot due to the fact that standard disks must
> move the disk head to read/write from different locations, and head
> moves take time.  SSD can do I/O from any location without pauses, so
> more threads would probably help performance rather than hurt it.
>
> Increase maxMergeCount to 6 -- at 2, large merges will probably stop
> indexing entirely.  With a larger number, Solr can keep indexing even
> when there's a huge segment merge happening.
>
> Thanks,
> Shawn
>

Re: Configuration recommendation for SolrCloud

2019-07-01 Thread Rahul Goswami

Hi Toke,

Thank you for following up. Reading back, I surely could have explained
better. Thanks for asking again.

>> What is a cluster? Is it a fully separate SolrCloud?
Yes, by cluster I mean a fully separate SolrCloud.

>> If so, does that mean you can divide your collection into (at least) 4
independent parts, where the indexing flow and the clients knows which
cluster to use?
So we can divide the documents across 4 SolrClouds each with multiple
nodes. The clients would know which SolrCloud to index to. So the answer to
your question is yes.

>>  Can it be divided further?
For the sake of maintainability and ease of configuration, we wouldn't want
to go beyond 4 SolrClouds. So at this point I would say no. But open to
ideas if you think it would be greatly advantageous.

So if we go with the 3rd configuration option we would be roughly indexing
1 billion documents (with an analyzed 'content' field possibly containing
large text) per SolrCloud.

Also I later got to know additional configurations and updated hardware
specs, so let me revise that. We would index with a replication factor of
2. Hence each SolrCloud would have 4x2=8 nodes and 1 billion x 2 =2 billion
documents indexed (with an analyzed 'content' field possibly containing
large text). We would have up to 12 GB heap space allocated per node. By
node I mean an individual Solr instance running on a certain port. Hence to
break down the specs :

For each SolrCloud:

8 nodes, each with 12 GB heap for Solr. Each node hosting 16 replicas
(cores).
2 billion documents (replication factor=2. So 1 billion unique documents)

Would SolrCloud scale well with the given configuration for a
moderate-heavy indexing and search load ?

Additional consideration: We have 4 beefy physical servers at disposal for
this deployment. If we go with 4 SolrClouds then we would have 4x8=32 nodes
(Solr instances) running across these 4 physical servers.

Any issues that you might see with this configuration or additional
considerations that I might be missing?

Thanks,
Rahul

On Sat, Jun 29, 2019 at 1:13 PM Toke Eskildsen  wrote:

> Rahul Goswami  wrote:
> > We are running Solr 7.2.1 and planning for a deployment which will grow
> to
> > 4 billion documents over time. We have 16 nodes at disposal.I am thinking
> > between 3 configurations:
> >
> > 1 cluster - 16 nodes
> > vs
> > 2 clusters - 8 nodes each
> > vs
> > 4 clusters -4 nodes each
>
> You haven't got any answers. Maybe because it is a bit unclear what you're
> asking. What is a cluster? Is it a fully separate SolrCloud? If so, does
> that mean you can divide your collection into (at least) 4 independent
> parts, where the indexing flow and the clients knows which cluster to use?
> Can it be divided further?
>
> - Toke Eskildsen
>

Configuration recommendation for SolrCloud

2019-06-25 Thread Rahul Goswami

Hello,
We are running Solr 7.2.1 and planning for a deployment which will grow to
4 billion documents over time. We have 16 nodes at disposal.I am thinking
between 3 configurations:

1 cluster - 16 nodes
vs
2 clusters - 8 nodes each
vs
4 clusters -4 nodes each

Irrespective of the configuration, each node would host 8 shards (eg: a
cluster with 16 nodes would have 16*8=128 shards; similarly, 32 shards in a
4 node cluster). These 16 nodes will be hosted across 4 beefy servers each
with 128 GB RAM. So we can allocate 32 GB RAM (not heap space) to each
node. what configuration would be most efficient for our use case
considering moderate-heavy indexing and search load? Would also like to
know the tradeoffs involved if any. Thanks in advance!

Regards,
Rahul

Re: SolrCloud: Configured socket timeouts not reflecting

2019-06-24 Thread Rahul Goswami

Hi Gus,

Have created a pull request for JIRA 12550
<https://issues.apache.org/jira/browse/SOLR-12550> and updated the affected
Solr version (7.2.1) in the comments. The provided fix is on branch_7_2. I
haven't tried reproducing the issue on the latest version, but see that the
code for this part is different on the master.

Regards,
Rahul

On Thu, Jun 20, 2019 at 8:22 PM Rahul Goswami  wrote:

> Hi Gus,
> Thanks for the response and referencing the umbrella JIRA for these kind
> of issues. I see that it won't solve the problem since the builder object
> which is used to instantiate a ConcurrentUpdateSolrClient itself doesn't
> contain the timeout values. I did create a local solr-core binary to try
> the patch nevertheless, but it didn't help as I anticipated. I'll update
> the JIRA and submit a patch.
>
> Thank you,
> Rahul
>
> On Thu, Jun 20, 2019 at 11:35 AM Gus Heck  wrote:
>
>> Hi Rahul,
>>
>> Did you try the patch int that issue? Also food for thought:
>> https://issues.apache.org/jira/browse/SOLR-13457
>>
>> -Gus
>>
>> On Tue, Jun 18, 2019 at 5:52 PM Rahul Goswami 
>> wrote:
>>
>> > Hello,
>> >
>> > I was looking into the code to try to get to the root of this issue.
>> Looks
>> > like this is an issue after all (as of 7.2.1 which is the version we are
>> > using), but wanted to confirm on the user list before creating a JIRA. I
>> > found that the soTimeout property of ConcurrentUpdateSolrClient class
>> (in
>> > the code referenced below) remains null and hence the default of 60
>> ms
>> > is set as the timeout in HttpPost class instance variable "method".
>> >
>> >
>> https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L334
>> >
>> >
>> > When the call is finally made in the below line, the Httpclient does
>> > contain the configured timeout (as in solr.xml or
>> -DdistribUpdateSoTimeout)
>> > but gets overriden by the hard default of 60 in the "method"
>> parameter
>> > of the execute call.
>> >
>> >
>> >
>> https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L348
>> >
>> >
>> > The hard default of 60 is set here:
>> >
>> >
>> https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L333
>> >
>> >
>> > I tried to create a local patch with the below fix which works fine:
>> >
>> >
>> https://github.com/apache/lucene-solr/blob/86fe24cbef238d2042d68494bd94e2362a2d996e/solr/core/src/java/org/apache/solr/update/StreamingSolrClients.java#L69
>> >
>> >
>> >
>> > client = new ErrorReportingConcurrentUpdateSolrClient.Builder(url, req,
>> > errors)
>> >   .withHttpClient(httpClient)
>> >   .withQueueSize(100)
>> >   .withSocketTimeout(getSocketTimeout(req))
>> >   .withThreadCount(runnerCount)
>> >   .withExecutorService(updateExecutor)
>> >   .alwaysStreamDeletes()
>> >   .build();
>> >
>> > private int getSocketTimeout(SolrCmdDistributor.Req req) {
>> > if(req==null) {
>> >   return UpdateShardHandlerConfig.DEFAULT_DISTRIBUPDATESOTIMEOUT;
>> > }
>> >
>> > return
>> >
>> >
>> req.cmd.req.getCore().getCoreContainer().getConfig().getUpdateShardHandlerConfig().getDistributedSocketTimeout();
>> >   }
>> >
>> > I found this open JIRA on this issue:
>> >
>> >
>> >
>> https://issues.apache.org/jira/browse/SOLR-12550?jql=text%20~%20%22distribUpdateSoTimeout%22
>> >
>> >
>> > Should I update the JIRA with this ?
>> >
>> > Thanks,
>> > Rahul
>> >
>> >
>> >
>> >
>> > On Thu, Jun 13, 2019 at 12:00 AM Rahul Goswami 
>> > wrote:
>> >
>> > > Hello,
>> > >
>> > > I am running Solr 7.2.1 in cloud mode. To overcome a setup hardware
>> > > bottleneck, I tried to configure distribUpdateSoTimeout and
>> socketTimeout
>> > > to a value greater than the default 10 mins. I did this by passing
>> these
>> > as
>> > > system properties at Solr start up time (-DdistribUpdateSoTimeout and
>> > > -DsocketTimeout  ). The Solr admin UI shows these values in the
>> Dashboard
>> > > args section. As a test, I tried setting each of them to one hour
>> > > (360). However I start seeing socket read timeouts within a few
>> mins.
>> > > Looks like the values are not taking effect. What am I missing? If
>> this
>> > is
>> > > a known issue, is there a JIRA for it ?
>> > >
>> > > Thanks,
>> > > Rahul
>> > >
>> >
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: SolrCloud: Configured socket timeouts not reflecting

2019-06-20 Thread Rahul Goswami

Hi Gus,
Thanks for the response and referencing the umbrella JIRA for these kind of
issues. I see that it won't solve the problem since the builder object
which is used to instantiate a ConcurrentUpdateSolrClient itself doesn't
contain the timeout values. I did create a local solr-core binary to try
the patch nevertheless, but it didn't help as I anticipated. I'll update
the JIRA and submit a patch.

Thank you,
Rahul

On Thu, Jun 20, 2019 at 11:35 AM Gus Heck  wrote:

> Hi Rahul,
>
> Did you try the patch int that issue? Also food for thought:
> https://issues.apache.org/jira/browse/SOLR-13457
>
> -Gus
>
> On Tue, Jun 18, 2019 at 5:52 PM Rahul Goswami 
> wrote:
>
> > Hello,
> >
> > I was looking into the code to try to get to the root of this issue.
> Looks
> > like this is an issue after all (as of 7.2.1 which is the version we are
> > using), but wanted to confirm on the user list before creating a JIRA. I
> > found that the soTimeout property of ConcurrentUpdateSolrClient class (in
> > the code referenced below) remains null and hence the default of 60
> ms
> > is set as the timeout in HttpPost class instance variable "method".
> >
> >
> https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L334
> >
> >
> > When the call is finally made in the below line, the Httpclient does
> > contain the configured timeout (as in solr.xml or
> -DdistribUpdateSoTimeout)
> > but gets overriden by the hard default of 60 in the "method"
> parameter
> > of the execute call.
> >
> >
> >
> https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L348
> >
> >
> > The hard default of 60 is set here:
> >
> >
> https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L333
> >
> >
> > I tried to create a local patch with the below fix which works fine:
> >
> >
> https://github.com/apache/lucene-solr/blob/86fe24cbef238d2042d68494bd94e2362a2d996e/solr/core/src/java/org/apache/solr/update/StreamingSolrClients.java#L69
> >
> >
> >
> > client = new ErrorReportingConcurrentUpdateSolrClient.Builder(url, req,
> > errors)
> >   .withHttpClient(httpClient)
> >   .withQueueSize(100)
> >   .withSocketTimeout(getSocketTimeout(req))
> >   .withThreadCount(runnerCount)
> >   .withExecutorService(updateExecutor)
> >   .alwaysStreamDeletes()
> >   .build();
> >
> > private int getSocketTimeout(SolrCmdDistributor.Req req) {
> > if(req==null) {
> >   return UpdateShardHandlerConfig.DEFAULT_DISTRIBUPDATESOTIMEOUT;
> > }
> >
> > return
> >
> >
> req.cmd.req.getCore().getCoreContainer().getConfig().getUpdateShardHandlerConfig().getDistributedSocketTimeout();
> >   }
> >
> > I found this open JIRA on this issue:
> >
> >
> >
> https://issues.apache.org/jira/browse/SOLR-12550?jql=text%20~%20%22distribUpdateSoTimeout%22
> >
> >
> > Should I update the JIRA with this ?
> >
> > Thanks,
> > Rahul
> >
> >
> >
> >
> > On Thu, Jun 13, 2019 at 12:00 AM Rahul Goswami 
> > wrote:
> >
> > > Hello,
> > >
> > > I am running Solr 7.2.1 in cloud mode. To overcome a setup hardware
> > > bottleneck, I tried to configure distribUpdateSoTimeout and
> socketTimeout
> > > to a value greater than the default 10 mins. I did this by passing
> these
> > as
> > > system properties at Solr start up time (-DdistribUpdateSoTimeout and
> > > -DsocketTimeout  ). The Solr admin UI shows these values in the
> Dashboard
> > > args section. As a test, I tried setting each of them to one hour
> > > (360). However I start seeing socket read timeouts within a few
> mins.
> > > Looks like the values are not taking effect. What am I missing? If this
> > is
> > > a known issue, is there a JIRA for it ?
> > >
> > > Thanks,
> > > Rahul
> > >
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: SolrCloud: Configured socket timeouts not reflecting

2019-06-18 Thread Rahul Goswami

Hello,

I was looking into the code to try to get to the root of this issue. Looks
like this is an issue after all (as of 7.2.1 which is the version we are
using), but wanted to confirm on the user list before creating a JIRA. I
found that the soTimeout property of ConcurrentUpdateSolrClient class (in
the code referenced below) remains null and hence the default of 60 ms
is set as the timeout in HttpPost class instance variable "method".
https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L334

When the call is finally made in the below line, the Httpclient does
contain the configured timeout (as in solr.xml or -DdistribUpdateSoTimeout)
but gets overriden by the hard default of 60 in the "method" parameter
of the execute call.

https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L348

The hard default of 60 is set here:
https://github.com/apache/lucene-solr/blob/e6f6f352cfc30517235822b3deed83df1ee144c6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java#L333

I tried to create a local patch with the below fix which works fine:
https://github.com/apache/lucene-solr/blob/86fe24cbef238d2042d68494bd94e2362a2d996e/solr/core/src/java/org/apache/solr/update/StreamingSolrClients.java#L69

client = new ErrorReportingConcurrentUpdateSolrClient.Builder(url, req,
errors)
  .withHttpClient(httpClient)
  .withQueueSize(100)
  .withSocketTimeout(getSocketTimeout(req))
  .withThreadCount(runnerCount)
  .withExecutorService(updateExecutor)
  .alwaysStreamDeletes()
  .build();

private int getSocketTimeout(SolrCmdDistributor.Req req) {
if(req==null) {
  return UpdateShardHandlerConfig.DEFAULT_DISTRIBUPDATESOTIMEOUT;
}

return
req.cmd.req.getCore().getCoreContainer().getConfig().getUpdateShardHandlerConfig().getDistributedSocketTimeout();
  }

I found this open JIRA on this issue:

https://issues.apache.org/jira/browse/SOLR-12550?jql=text%20~%20%22distribUpdateSoTimeout%22

Should I update the JIRA with this ?

Thanks,
Rahul

On Thu, Jun 13, 2019 at 12:00 AM Rahul Goswami 
wrote:

> Hello,
>
> I am running Solr 7.2.1 in cloud mode. To overcome a setup hardware
> bottleneck, I tried to configure distribUpdateSoTimeout and socketTimeout
> to a value greater than the default 10 mins. I did this by passing these as
> system properties at Solr start up time (-DdistribUpdateSoTimeout and
> -DsocketTimeout  ). The Solr admin UI shows these values in the Dashboard
> args section. As a test, I tried setting each of them to one hour
> (360). However I start seeing socket read timeouts within a few mins.
> Looks like the values are not taking effect. What am I missing? If this is
> a known issue, is there a JIRA for it ?
>
> Thanks,
> Rahul
>

SolrCloud: Configured socket timeouts not reflecting

2019-06-12 Thread Rahul Goswami

Hello,

I am running Solr 7.2.1 in cloud mode. To overcome a setup hardware
bottleneck, I tried to configure distribUpdateSoTimeout and socketTimeout
to a value greater than the default 10 mins. I did this by passing these as
system properties at Solr start up time (-DdistribUpdateSoTimeout and
-DsocketTimeout  ). The Solr admin UI shows these values in the Dashboard
args section. As a test, I tried setting each of them to one hour
(360). However I start seeing socket read timeouts within a few mins.
Looks like the values are not taking effect. What am I missing? If this is
a known issue, is there a JIRA for it ?

Thanks,
Rahul

Re: SolrCloud indexing triggers merges and timeouts

2019-06-12 Thread Rahul Goswami

Updating the thread with further findings:

So turns out that the nodes hosting Solr are VMs with Virtual disks.
Additionally, a Windows system process (the infamous PID 4) is hogging a
lot of disk. This is indicated by disk reponse times in excess of 100 ms
and a disk drive queue length of 5 which would be considered very high. The
indexing is running in two parallel threads each sending a batch of 50 docs
per request. I would like to believe this is not too high (?). The docs are
not too heavy either, only containing metadata fields.So the disk IO seems
to be the bottleneck at this point, causing commits and merges to take more
time than they should. This is causing update routing to the leader replica
to take more than 10 mins, resulting into read time outs, and eventually
failed updates.
I could not find anything alarming in the GC logs I shared earlier.

Will update the thread with more findings as I have them and the attempted
solutions. At this point I am considering increasing the Jetty timeout and
increasing the distribUpdateConnTimeout to a higher value to let the
indexing proceed slowly but successfully. In the meantime, would greatly
appreciate any other ideas/measures.

Thanks,
Rahul


On Thu, Jun 6, 2019 at 11:00 AM Rahul Goswami  wrote:

> Thank you for your responses. Please find additional details about the
> setup below:
>
> We are using Solr 7.2.1
>
> > I have a solrcloud setup on Windows server with below config:
> > 3 nodes,
> > 24 shards with replication factor 2
> > Each node hosts 16 cores.
>
> 16 CPU cores, or 16 Solr cores?  The info may not be all that useful
> either way, but just in case, it should be clarified.
>
> *OP Reply:* 16 Solr cores (i.e. replicas)
>
> > Index size is 1.4 TB per node
> > Xms 8 GB , Xmx 24 GB
> > Directory factory used is SimpleFSDirectoryFactory
>
> How much total memory in the server?  Is there other software using
> significant levels of memory?
>
> *OP Reply* : Total 48 GB per node... I couldn't see another software
> using a lot of memory.
> I am honestly not sure about the reason for change of directory factory to
> SimpleFSDirectoryFactory. But I was told that with mmap at one point we
> started to see the shared memory usage on Windows go up significantly,
> intermittently freezing the system.
> Could the choice of DirectoryFactory here be a factor for the long
> updates/frequent merges?
>
> > How many total documents (maxDoc, not numDoc) are in that 1.4 TB of
> space?
> *OP Reply:* Also, there are nearly 12.8 million total docs (maxDoc, NOT
> numDoc) in that 1.4 TB space
>
> > Can you share the GC log that Solr writes?
> *OP Reply:*  Please find the GC logs and thread dumps at this location
> https://drive.google.com/open?id=1slsYkAcsH7OH-7Pma91k6t5T72-tIPlw
>
> Another observation is that the CPU usage reaches around 70% (through
> manual monitoring) when the indexing starts and the merges are observed. It
> is well below 50% otherwise.
>
> Also, should something be altered with the mergeScheduler setting ?
> "mergeScheduler":{
> "class":"org.apache.lucene.index.ConcurrentMergeScheduler",
> "maxMergeCount":2,
> "maxThreadCount":2},
>
> Thanks,
> Rahul
>
>
> On Wed, Jun 5, 2019 at 4:24 PM Shawn Heisey  wrote:
>
>> On 6/5/2019 9:39 AM, Rahul Goswami wrote:
>> > I have a solrcloud setup on Windows server with below config:
>> > 3 nodes,
>> > 24 shards with replication factor 2
>> > Each node hosts 16 cores.
>>
>> 16 CPU cores, or 16 Solr cores?  The info may not be all that useful
>> either way, but just in case, it should be clarified.
>>
>> > Index size is 1.4 TB per node
>> > Xms 8 GB , Xmx 24 GB
>> > Directory factory used is SimpleFSDirectoryFactory
>>
>> How much total memory in the server?  Is there other software using
>> significant levels of memory?
>>
>> Why did you opt to change the DirectoryFactory away from Solr's default?
>>   The default is chosen with care ... any other choice will probably
>> result in lower performance.  The default in recent versions of Solr is
>> NRTCachingDirectoryFactory, which uses MMap for file access.
>>
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>
>> The screenshot described here might become useful for more in-depth
>> troubleshooting:
>>
>>
>> https://wiki.apache.org/solr/SolrPerformanceProblems#Process_listing_on_Windows
>>
>> How many total documents (maxDoc, not numDoc) are in that 1.4 TB of space?
>>
>> > The cloud is all nice and green for the most part. Only when we start
>&

Re: SolrCloud indexing triggers merges and timeouts

2019-06-06 Thread Rahul Goswami

Thank you for your responses. Please find additional details about the
setup below:

We are using Solr 7.2.1

> I have a solrcloud setup on Windows server with below config:
> 3 nodes,
> 24 shards with replication factor 2
> Each node hosts 16 cores.

16 CPU cores, or 16 Solr cores?  The info may not be all that useful
either way, but just in case, it should be clarified.

*OP Reply:* 16 Solr cores (i.e. replicas)

> Index size is 1.4 TB per node
> Xms 8 GB , Xmx 24 GB
> Directory factory used is SimpleFSDirectoryFactory

How much total memory in the server?  Is there other software using
significant levels of memory?

*OP Reply* : Total 48 GB per node... I couldn't see another software using
a lot of memory.
I am honestly not sure about the reason for change of directory factory to
SimpleFSDirectoryFactory. But I was told that with mmap at one point we
started to see the shared memory usage on Windows go up significantly,
intermittently freezing the system.
Could the choice of DirectoryFactory here be a factor for the long
updates/frequent merges?

> How many total documents (maxDoc, not numDoc) are in that 1.4 TB of
space?
*OP Reply:* Also, there are nearly 12.8 million total docs (maxDoc, NOT
numDoc) in that 1.4 TB space

> Can you share the GC log that Solr writes?
*OP Reply:*  Please find the GC logs and thread dumps at this location
https://drive.google.com/open?id=1slsYkAcsH7OH-7Pma91k6t5T72-tIPlw

Another observation is that the CPU usage reaches around 70% (through
manual monitoring) when the indexing starts and the merges are observed. It
is well below 50% otherwise.

Also, should something be altered with the mergeScheduler setting ?
"mergeScheduler":{
"class":"org.apache.lucene.index.ConcurrentMergeScheduler",
"maxMergeCount":2,
"maxThreadCount":2},

Thanks,
Rahul


On Wed, Jun 5, 2019 at 4:24 PM Shawn Heisey  wrote:

> On 6/5/2019 9:39 AM, Rahul Goswami wrote:
> > I have a solrcloud setup on Windows server with below config:
> > 3 nodes,
> > 24 shards with replication factor 2
> > Each node hosts 16 cores.
>
> 16 CPU cores, or 16 Solr cores?  The info may not be all that useful
> either way, but just in case, it should be clarified.
>
> > Index size is 1.4 TB per node
> > Xms 8 GB , Xmx 24 GB
> > Directory factory used is SimpleFSDirectoryFactory
>
> How much total memory in the server?  Is there other software using
> significant levels of memory?
>
> Why did you opt to change the DirectoryFactory away from Solr's default?
>   The default is chosen with care ... any other choice will probably
> result in lower performance.  The default in recent versions of Solr is
> NRTCachingDirectoryFactory, which uses MMap for file access.
>
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> The screenshot described here might become useful for more in-depth
> troubleshooting:
>
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Process_listing_on_Windows
>
> How many total documents (maxDoc, not numDoc) are in that 1.4 TB of space?
>
> > The cloud is all nice and green for the most part. Only when we start
> > indexing, within a few seconds, I start seeing Read timeouts and socket
> > write errors and replica recoveries thereafter. We are indexing in 2
> > parallel threads in batches of 50 docs per update request. After
> examining
> > the thread dump, I see segment merges happening. My understanding is that
> > this is the cause, and the timeouts and recoveries are the symptoms. Is
> my
> > understanding correct? If yes, what steps could I take to help the
> > situation. I do see that the difference between "Num Docs" and "Max Docs"
> > is about 20%.
>
> Segment merges are a completely normal part of Lucene's internal
> operation.  They should never cause problems like you have described.
>
> My best guess is that a 24GB heap is too small.  Or possibly WAY too
> large, although with the index size you have mentioned, that seems
> unlikely.
>
> Can you share the GC log that Solr writes?  The problem should occur
> during the timeframe covered by the log, and the log should be as large
> as possible.  You'll need to use a file sharing site -- attaching it to
> an email is not going to work.
>
> What version of Solr?
>
> Thanks,
> Shawn
>

SolrCloud indexing triggers merges and timeouts

2019-06-05 Thread Rahul Goswami

Hello,
I have a solrcloud setup on Windows server with below config:
3 nodes,
24 shards with replication factor 2
Each node hosts 16 cores.

Index size is 1.4 TB per node
Xms 8 GB , Xmx 24 GB
Directory factory used is SimpleFSDirectoryFactory

The cloud is all nice and green for the most part. Only when we start
indexing, within a few seconds, I start seeing Read timeouts and socket
write errors and replica recoveries thereafter. We are indexing in 2
parallel threads in batches of 50 docs per update request. After examining
the thread dump, I see segment merges happening. My understanding is that
this is the cause, and the timeouts and recoveries are the symptoms. Is my
understanding correct? If yes, what steps could I take to help the
situation. I do see that the difference between "Num Docs" and "Max Docs"
is about 20%.

Would appreciate your help.

Thanks,
Rahul

Re: Graph query extremely slow

2019-06-01 Thread Rahul Goswami

Hi Toke,

Thanks for the sharing the sanity check results. I am setting rows=100. The
graph fq in my case gives a numFound of a little over 1 million. The total
number of docs is ~4 million.
I am using the graph query in an fq. Could the performance differ between
having it in an fq vs q ? Also, since the parameters of this fq don't
change shouldn't I expect to gain any advantage out of using the
filterCache?

Thanks,
Rahul

On Wed, May 22, 2019 at 7:40 AM Toke Eskildsen  wrote:

> On Wed, 2019-05-15 at 21:37 -0400, Rahul Goswami wrote:
> > fq={!graph from=from_field to=to_field returnRoot=false}
> >
> > Executing _only_ the graph filter query takes about 64.5 seconds. The
> > total number of documents from this filter query is a little over 1
> > million.
>
> I tried building an index in Solr 7.6 with 4M simple records with every
> 4th record having a from_field and a to_field, each containing a random
> number from 0-65535 as a String.
>
>
> Asking for the first 10 results:
>
> time curl -s '
>
> http://localhost:8983/solr/gettingstarted/select?rows=10={!graph+from=from_field+to=to_field+returnRoot=true}+from_field:*
> <http://localhost:8983/solr/gettingstarted/select?rows=10=%7B!graph+from=from_field+to=to_field+returnRoot=true%7D+from_field:*>
> '
>  | jq .response.numFound
> 100
>
> real0m0.018s
> user0m0.011s
> sys 0m0.005s
>
>
> Asking for 1M results (ignoring that export or streaming should be used
> for exports of that size):
>
> time curl -s '
>
> http://localhost:8983/solr/gettingstarted/select?rows=100={!graph+from=from_field+to=to_field+returnRoot=true}+from_field:*
> <http://localhost:8983/solr/gettingstarted/select?rows=100=%7B!graph+from=from_field+to=to_field+returnRoot=true%7D+from_field:*>
> '
>  | jq .response.numFound
> 100
>
> real0m10.101s
> user0m3.344s
> sys 0m0.419s
>
> > Is this performance expected out of graph query ?
>
> As the sanity check above shows, there is a huge difference between
> evaluating a graph query (any query really) and asking for 1M results
> to be returned. With that in mind, what do you set rows to?
>
>
> - Toke Eskildsen, Royal Danish Library
>
>
>

Solr exception while retrieving documents

2019-05-31 Thread Mandava, Rahul

Hi,


I am using SOLR 6.6.0 and real-time get to retrieve documents. Randomly I am 
seeing nullpointer exceptions in solr log files which in turn breaks the 
application workflow. Below is the stack trace

I am thinking this could be related to real-time get, when transforming child 
documents during get, may be child elements are not available due to any 
pending / uncommitted transactions to SOLR index. The same retrieval works fine 
after some time, which makes me think that this could be related to committing 
the changes to index or may be a transaction issue. But couldn't figure out 
exactly where the bottle neck is. Has anyone faced similar issue or familiar 
with the exception ??

The same exception is being logged by two different loggers HttpSolrCall and 
RequestHandlerBase in solr log file at the same time stamp.



null:java.lang.NullPointerException at 
org.apache.solr.response.transform.ChildDocTransformer.transform(ChildDocTransformerFactory.java:136)
 at 
org.apache.solr.handler.component.RealTimeGetComponent.process(RealTimeGetComponent.java:253)
 at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:296)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) at 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) 
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) 
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) 
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) 
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
 at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) 
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) 
at org.eclipse.jetty.server.Server.handle(Server.java:534) at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
 at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) 
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) 
at java.lang.Thread.run(Thread.java:748)

NOTE: I added additional parameter of softCommit=true when I save / update 
documents to index, after this change the frequency and volume of these errors 
has decreased, but still seeing very few (1-2) errors, but still seeing the 
same exception in Solr log files. I am thinking that seeing error in log files 
doesn't hurt as long as the updates and get's work fine, but still would like 
to know how to eradicate these errors from happening.


Thanks
Rahul Mandava

Re: Graph query extremely slow

2019-05-19 Thread Rahul Goswami

Hello experts,

Just following up in case my previous email got lost in the big stack of
queries. Would appreciate any help on optimizing a graph query. Or any
pointers on  the direction to investigate.

Thanks,
Rahul

On Wed, May 15, 2019 at 9:37 PM Rahul Goswami  wrote:

> Hello,
>
> I am running Solr 7.2.1 in standalone mode with 8GB heap. I have an index
> with ~4 million documents. Not too big. I am using a graph query parser to
> filter out some documents as below:
>
> fq={!graph from=from_field to=to_field returnRoot=false}
>
> Both from_field and to_field are indexed and of type string. This is part
> of a bigger query which is taking around 65 seconds to execute. Executing
> _only_ the graph filter query takes about 64.5 seconds. The total number of
> documents from this filter query is a little over 1 million.
>
> Is this performance expected out of graph query ? Any optimizations that I
> could try?
>
>
> Thanks,
> Rahul
>

Graph query extremely slow

2019-05-15 Thread Rahul Goswami

Hello,

I am running Solr 7.2.1 in standalone mode with 8GB heap. I have an index
with ~4 million documents. Not too big. I am using a graph query parser to
filter out some documents as below:

fq={!graph from=from_field to=to_field returnRoot=false}

Both from_field and to_field are indexed and of type string. This is part
of a bigger query which is taking around 65 seconds to execute. Executing
_only_ the graph filter query takes about 64.5 seconds. The total number of
documents from this filter query is a little over 1 million.

Is this performance expected out of graph query ? Any optimizations that I
could try?


Thanks,
Rahul

Re: Delay searches till log replay finishes

2019-03-21 Thread Rahul Goswami

Eric,Shawn,

Apologies for the late update on this thread and thank you for your inputs.
My assumption about the number of segments increasing was out of incomplete
understanding of the TieredMergePolicy, but I get it now. Another concern
was slowing indexing rate due to constant merges. This is from reading the
documentation:
"Choosing the best merge factors is generally a trade-off of indexing speed
vs. searching speed. Having fewer segments in the index generally
accelerates searches, because there are fewer places to look. It also can
also result in fewer physical files on disk. But to keep the number of
segments low, merges will occur more often, which can add load to the
system and slow down updates to the index"

Taking your suggestions, we have reduced hard commit interval
(openSearcher=false) from 10 mins to 1 min to begin with. Also, our servers
are on Windows so that could also be a cause of the service getting killed
before being able to gracefully shutdown. The cascading effect is stale
results while tlogs are being played on startup. I understand that although
not foolproof, reducing the autoCommit interval should help mitigate the
problem and we'll continue to monitor this for now.

Thanks,
Rahul

On Fri, Mar 8, 2019 at 2:14 PM Erick Erickson 
wrote:

> (1) no, and Shawn’s comments are well taken.
>
> (2) bq.  is the number of segments would drastically increase
>
> Not true. First of all, TieredMergePolicy will take care of merging “like
> sized” segments for you. You’ll have the same number (or close) no matter
> how short the autocommit interval. Second, new segments are created
> whenever the internal indexing buffer is filled up, default 100M anyway so
> just because you have a long autocommit interval doesn’t say much about the
> number of segments that are created.
>
> This is really not something you should be concerned about, certainly not
> something you should accept other problems because. Solr runs quite well
> with 15 second autocommit and very high indexing rates, why do you think
> your situation is different? Do you have any evidence that would be a
> problem at all?
>
> Best,
> Erick
>
>
> > On Mar 8, 2019, at 11:05 AM, Shawn Heisey  wrote:
> >
> > On 3/8/2019 10:44 AM, Rahul Goswami wrote:
> >> 1) Is there currently a configuration setting in Solr that will trigger
> the
> >> first option you mentioned ? Which is to not serve any searches until
> tlogs
> >> are played. If not, since instances shutting down abruptly is not very
> >> uncommon, would a JIRA to implement this configuration be warranted?
> >
> > In what setup is an abrupt shutdown *expected*?  If that's really
> common, then your setup is, in my opinion, very broken.  It is our intent
> that abrupt death of the Solr process should be quite rare.  We do still
> have a problem on Windows where the wait for clean shutdown is only five
> seconds -- nowhere near enough.  The Windows script still needs a lot of
> work, but most of us are not adept at Windows scripting.
> >
> > There is an issue for the timeout interval in bin\solr.cmd on Windows:
> >
> > https://issues.apache.org/jira/browse/SOLR-9698
> >
> >> 2) We have a setup with moderate indexing rate and moderate search rate.
> >> Currently the auto commit interval is 10 mins. What should be a
> recommended
> >> hard commit interval for such a setup? Our concern with going too low on
> >> that autoCommit interval (with openSearcher=false) is the number of
> >> segments that would drastically increase, eventually causing
> merges,slower
> >> searches etc.
> >
> > Solr has shipped with a 15 second autoCommit, where openSearcher is set
> to false, for a while now.  This is a setting that works quite well.  As
> long as you're not opening a new searcher, commits are quite fast.  I
> personally would use 60 seconds, but 15 seconds does work well.  It is
> usually autoSoftCommit where you need to be concerned about short
> intervals, because a soft commit opens a searcher.
> >
> > Thanks,
> > Shawn
>
>

Re: Delay searches till log replay finishes

2019-03-08 Thread Rahul Goswami

Eric,

Thanks for the detailed response...I have two follow up questions :

1) Is there currently a configuration setting in Solr that will trigger the
first option you mentioned ? Which is to not serve any searches until tlogs
are played. If not, since instances shutting down abruptly is not very
uncommon, would a JIRA to implement this configuration be warranted?
2) We have a setup with moderate indexing rate and moderate search rate.
Currently the auto commit interval is 10 mins. What should be a recommended
hard commit interval for such a setup? Our concern with going too low on
that autoCommit interval (with openSearcher=false) is the number of
segments that would drastically increase, eventually causing merges,slower
searches etc.

Thanks,
Rahul

On Fri, Mar 8, 2019 at 12:08 PM Erick Erickson 
wrote:

> Yes, you’ll get stale values. There’s no way I know of to change that,
> it’s a fundamental result of Lucene’s design.
>
> There’s a “segment_n” file that contains pointers to the current valid
> segments. When a commit happens, segments are closed and the very last
> operation is to update that file.
>
> In the abnormal termination case, that file has not been updated with docs
> that are in the tlog. So when Solr opens a searcher, it has no record at
> all of any new segments created since the last hard commit. So there are
> only two choices:
>
> 1> refuse to seve any searches at all
> 2> allow searches on the last snapshot of the index while the tlog replays
>
> The latter is the choice we’ve made and I agree with it. While
> theoretically you could refuse to open a searcher while the tlog was
> replaying, I’d rather get some results than none at all.
>
> Especially when this only happens when Solr is abnormally terminated.
>
> You can mitigate the time frame here by setting your hard commit interval
> to, say, 15 seconds, which should be the upper bound of getting stale docs
> when the tlog is replayed.
>
> It’s also good practice to have the autocommit interval relatively short
> for a variety of reasons, not the least of which is that it’ll grow
> infinitely until a hard commit happens.
>
> Best,
> Erick
>
> > On Mar 8, 2019, at 8:48 AM, Rahul Goswami  wrote:
> >
> > What I am observing is that Solr is fully started up even before it has
> > finished playing the tlog. In the logs I see that a searcher is
> registered
> > first and the "Log replay finished" appears later. During that time if I
> > search, I do get stale values. Below are the log lines that I captured :
> >
> > WARN  - 2019-03-08 16:33:42.126; [   x:techproducts]
> > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
> >
> tlog{file=C:\Work\Solr\solr-7.2.1\Installation\solr-7.2.1\example\techproducts\solr\techproducts\data\tlog\tlog.009
> > refcount=2} active=false starting pos=0 inSortedOrder=false
> > INFO  - 2019-03-08 16:33:42.141; [   x:techproducts]
> > org.apache.solr.core.SolrCore; [techproducts]  webapp=null path=null
> >
> params={q=static+firstSearcher+warming+in+solrconfig.xml=false=firstSearcher}
> > hits=3 status=0 QTime=37
> > INFO  - 2019-03-08 16:33:42.157; [   x:techproducts]
> > org.apache.solr.core.QuerySenderListener; QuerySenderListener done.
> > INFO  - 2019-03-08 16:33:42.157; [   x:techproducts]
> >
> org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener;
> > Loading spell index for spellchecker: default
> > INFO  - 2019-03-08 16:33:42.157; [   x:techproducts]
> >
> org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener;
> > Loading spell index for spellchecker: wordbreak
> > INFO  - 2019-03-08 16:33:42.157; [   x:techproducts]
> > org.apache.solr.core.SolrCore; [techproducts] Registered new searcher
> > Searcher@63c5631[techproducts]
> >
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(7.2.1):C32/1:delGen=1)
> > Uninverting(_5(7.2.1):C1)))}
> > INFO  - 2019-03-08 16:34:07.373; [   x:techproducts]
> > org.apache.solr.core.SolrCore; [techproducts]  webapp=/solr path=/select
> > params={q=id:SP2514N=json&_=1552062352063} hits=1 status=0 QTime=2
> > INFO  - 2019-03-08 16:34:08.818; [   x:techproducts]
> > org.apache.solr.core.SolrCore; [techproducts]  webapp=/solr path=/select
> > params={q=id:SP2514N=json&_=1552062352063} hits=1 status=0 QTime=1
> > INFO  - 2019-03-08 16:34:14.422; [   x:techproducts]
> > org.apache.solr.update.DirectUpdateHandler2; start
> >
> commit{flags=2,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > INFO  - 2019-03-08 16:34:16.353; [   x:techproducts]
> >

Re: Delay searches till log replay finishes

2019-03-08 Thread Rahul Goswami

What I am observing is that Solr is fully started up even before it has
finished playing the tlog. In the logs I see that a searcher is registered
first and the "Log replay finished" appears later. During that time if I
search, I do get stale values. Below are the log lines that I captured :

WARN  - 2019-03-08 16:33:42.126; [   x:techproducts]
org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
tlog{file=C:\Work\Solr\solr-7.2.1\Installation\solr-7.2.1\example\techproducts\solr\techproducts\data\tlog\tlog.009
refcount=2} active=false starting pos=0 inSortedOrder=false
INFO  - 2019-03-08 16:33:42.141; [   x:techproducts]
org.apache.solr.core.SolrCore; [techproducts]  webapp=null path=null
params={q=static+firstSearcher+warming+in+solrconfig.xml=false=firstSearcher}
hits=3 status=0 QTime=37
INFO  - 2019-03-08 16:33:42.157; [   x:techproducts]
org.apache.solr.core.QuerySenderListener; QuerySenderListener done.
INFO  - 2019-03-08 16:33:42.157; [   x:techproducts]
org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener;
Loading spell index for spellchecker: default
INFO  - 2019-03-08 16:33:42.157; [   x:techproducts]
org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener;
Loading spell index for spellchecker: wordbreak
INFO  - 2019-03-08 16:33:42.157; [   x:techproducts]
org.apache.solr.core.SolrCore; [techproducts] Registered new searcher
Searcher@63c5631[techproducts]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(7.2.1):C32/1:delGen=1)
Uninverting(_5(7.2.1):C1)))}
INFO  - 2019-03-08 16:34:07.373; [   x:techproducts]
org.apache.solr.core.SolrCore; [techproducts]  webapp=/solr path=/select
params={q=id:SP2514N=json&_=1552062352063} hits=1 status=0 QTime=2
INFO  - 2019-03-08 16:34:08.818; [   x:techproducts]
org.apache.solr.core.SolrCore; [techproducts]  webapp=/solr path=/select
params={q=id:SP2514N=json&_=1552062352063} hits=1 status=0 QTime=1
INFO  - 2019-03-08 16:34:14.422; [   x:techproducts]
org.apache.solr.update.DirectUpdateHandler2; start
commit{flags=2,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2019-03-08 16:34:16.353; [   x:techproducts]
org.apache.solr.core.SolrCore; [techproducts]  webapp=/solr path=/select
params={q=id:SP2514N=json&_=1552062352063} hits=1 status=0 QTime=1
INFO  - 2019-03-08 16:34:18.948; [   x:techproducts]
org.apache.solr.update.SolrIndexWriter; Calling setCommitData with
IW:org.apache.solr.update.SolrIndexWriter@266e1192 commitCommandVersion:0
INFO  - 2019-03-08 16:34:19.040; [   x:techproducts]
org.apache.solr.search.SolrIndexSearcher; Opening
[Searcher@5c6044f1[techproducts]
main]
INFO  - 2019-03-08 16:34:19.040; [   x:techproducts]
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2019-03-08 16:34:19.040; [   x:techproducts]
org.apache.solr.core.QuerySenderListener; QuerySenderListener sending
requests to Searcher@5c6044f1[techproducts]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(7.2.1):C32/1:delGen=1)
Uninverting(_6(7.2.1):C1)))}
INFO  - 2019-03-08 16:34:19.040; [   x:techproducts]
org.apache.solr.core.QuerySenderListener; QuerySenderListener done.
INFO  - 2019-03-08 16:34:19.040; [   x:techproducts]
org.apache.solr.core.SolrCore; [techproducts] Registered new searcher
Searcher@5c6044f1[techproducts]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(7.2.1):C32/1:delGen=1)
Uninverting(_6(7.2.1):C1)))}
INFO  - 2019-03-08 16:34:19.056; [   x:techproducts]
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor;
[techproducts] {add=[SP2514N (1627455755076501504)]} 0 36923
WARN  - 2019-03-08 16:34:19.056; [   x:techproducts]
org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished.
recoveryInfo=RecoveryInfo{adds=1 deletes=0 deleteByQuery=0 errors=0
positionOfStart=0}
INFO  - 2019-03-08 16:34:23.523; [   x:techproducts]
org.apache.solr.core.SolrCore; [techproducts]  webapp=/solr path=/select
params={q=id:SP2514N=json&_=1552062352063} hits=1 status=0 QTime=1

On Thu, Mar 7, 2019 at 11:36 PM Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Do you mean that when you startup Solr, it will automatically do the search
> request even before the Solr is fully started up?
>
> Regards,
> Edwin
>
>
> On Fri, 8 Mar 2019 at 10:13, Rahul Goswami  wrote:
>
> > Hello Solr gurus,
> >
> > I am using Solr 7.2.1 (non-SolrCloud). I have a situation where Solr got
> > killed before it could commit updates to the disk resulting in log replay
> > on startup. During this interval, I observe that a searcher is opened
> even
> > before log replay has finished, resulting in some stale results, which in
> > turn has a cascading effect on other parts of the application. Is there a
> > setting in Solr which would prevent Solr from serving search requests
> > before log replay has finished?
> >
> > Thanks,
> > Rahul
> >
>

Delay searches till log replay finishes

2019-03-07 Thread Rahul Goswami

Hello Solr gurus,

I am using Solr 7.2.1 (non-SolrCloud). I have a situation where Solr got
killed before it could commit updates to the disk resulting in log replay
on startup. During this interval, I observe that a searcher is opened even
before log replay has finished, resulting in some stale results, which in
turn has a cascading effect on other parts of the application. Is there a
setting in Solr which would prevent Solr from serving search requests
before log replay has finished?

Thanks,
Rahul

Re: Full index replication upon service restart

2019-02-21 Thread Rahul Goswami

Eric,
Thanks for the insight. We are looking at tuning the architecture. We are
also stopping the indexing application before we bring down the Solr nodes
for maintenance. However when both nodes are up, and one replica is falling
behind too much we want to throttle the requests. Is there an API in Solr
to know whether a replica is falling behind from the leader ?

Thanks,
Rahul

On Mon, Feb 11, 2019 at 10:28 PM Erick Erickson 
wrote:

> bq. To answer your question about index size on
> disk, it is 3 TB on every node. As mentioned it's a 32 GB machine and I
> allocated 24GB to Java heap.
>
> This is massively undersized in terms of RAM in my experience. You're
> trying to cram 3TB of index into 32GB of memory. Frankly, I don't think
> there's much you can do to increase stability in this situation, too many
> things are going on. In particular, you're indexing during node restart.
>
> That means that
> 1> you'll almost inevitably get a full sync on start given your update
>  rate.
> 2> while you're doing the full sync, all new updates are sent to the
>   recovering replica and put in the tlog.
> 3> When the initial replication is done, the documents sent to the
>  tlog while recovering are indexed. This is 7 hours of accumulated
>  updates.
> 4> If much goes wrong in this situation, then you're talking another full
>  sync.
> 5> rinse, repeat.
>
> There are no magic tweaks here. You really have to rethink your
> architecture. I'm actually surprised that your queries are performant.
> I expect you're getting a _lot_ of I/O, that is the relevant parts of your
> index are swapping in and out of the OS memory space. A _lot_.
> Or you're only using a _very_ small bit of your index.
>
> Sorry to be so negative, but this is not a situation that's amenable to
> a quick fix.
>
> Best,
> Erick
>
>
>
>
> On Mon, Feb 11, 2019 at 4:10 PM Rahul Goswami 
> wrote:
> >
> > Thanks for the response Eric. To answer your question about index size on
> > disk, it is 3 TB on every node. As mentioned it's a 32 GB machine and I
> > allocated 24GB to Java heap.
> >
> > Further monitoring the recovery, I see that when the follower node is
> > recovering, the leader node (which is NOT recovering) almost freezes with
> > 100% CPU usage and 80%+ memory usage. Follower node's memory usage is
> 80%+
> > but CPU is very healthy. Also Follower node's log is filled up with
> updates
> > forwarded from the leader ("...PRE_UPDATE FINISH
> > {update.distrib=FROMLEADER=...") and replication starts much
> > afterwards.
> > There have been instances when complete recovery took 10+ hours. We have
> > upgraded to a 4 Gbps NIC between the nodes to see if it helps.
> >
> > Also, a few followup questions:
> >
> > 1) Is  there a configuration which would start throttling update requests
> > if the replica falls behind a certain number of updates so as to not
> > trigger an index replication later?  If not, would it be a worthy
> > enhancement?
> > 2) What would be a recommended hard commit interval for this kind of
> setup
> > ?
> > 3) What are some of the improvements in 7.5 with respect to recovery as
> > compared to 7.2.1?
> > 4) What do the below peersync failure logs lines mean?  This would help
> me
> > better understand the reasons for peersync failure and maybe devise some
> > alert mechanism to start throttling update requests from application
> > program if feasible.
> >
> > *PeerSync Failure type 1*:
> > --
> > 2019-02-04 20:43:50.018 INFO
> > (recoveryExecutor-4-thread-2-processing-n:indexnode1:2_solr
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> > s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> > [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> > org.apache.solr.update.PeerSync Fingerprint comparison: 1
> >
> > 2019-02-04 20:43:50.018 INFO
> > (recoveryExecutor-4-thread-2-processing-n:indexnode1:2_solr
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> > s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> > [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> > org.apache.solr.update.PeerSync Other fingerprint:
> > {maxVersionSpecified=1624579878580912128,
> > maxVersionEncountered=1624579893816721408, maxInHash=1624579878580912128

Re: Full index replication upon service restart

2019-02-11 Thread Rahul Goswami

Thanks for the response Eric. To answer your question about index size on
disk, it is 3 TB on every node. As mentioned it's a 32 GB machine and I
allocated 24GB to Java heap.

Further monitoring the recovery, I see that when the follower node is
recovering, the leader node (which is NOT recovering) almost freezes with
100% CPU usage and 80%+ memory usage. Follower node's memory usage is 80%+
but CPU is very healthy. Also Follower node's log is filled up with updates
forwarded from the leader ("...PRE_UPDATE FINISH
{update.distrib=FROMLEADER=...") and replication starts much
afterwards.
There have been instances when complete recovery took 10+ hours. We have
upgraded to a 4 Gbps NIC between the nodes to see if it helps.

Also, a few followup questions:

1) Is  there a configuration which would start throttling update requests
if the replica falls behind a certain number of updates so as to not
trigger an index replication later?  If not, would it be a worthy
enhancement?
2) What would be a recommended hard commit interval for this kind of setup
?
3) What are some of the improvements in 7.5 with respect to recovery as
compared to 7.2.1?
4) What do the below peersync failure logs lines mean?  This would help me
better understand the reasons for peersync failure and maybe devise some
alert mechanism to start throttling update requests from application
program if feasible.

*PeerSync Failure type 1*:
--
2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:2_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync Fingerprint comparison: 1

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:2_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync Other fingerprint:
{maxVersionSpecified=1624579878580912128,
maxVersionEncountered=1624579893816721408, maxInHash=1624579878580912128,
versionsHash=-8308981502886241345, numVersions=32966082, numDocs=32966165,
maxDoc=1828452}, Our fingerprint: {maxVersionSpecified=1624579878580912128,
maxVersionEncountered=1624579975760838656, maxInHash=1624579878580912128,
versionsHash=4017509388564167234, numVersions=32966066, numDocs=32966165,
maxDoc=1828452}

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:2_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync PeerSync:
core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 url=
http://indexnode1:8983/solr DONE. sync failed

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:8983_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.cloud.RecoveryStrategy PeerSync Recovery was not successful
- trying replication.

*PeerSync Failure type 1*:
-
2019-02-02 20:26:56.256 WARN
(recoveryExecutor-4-thread-11-processing-n:indexnode1:2_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46
s:shard12 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node49)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard12 r:core_node49
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46]
org.apache.solr.update.PeerSync PeerSync:
core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46 url=
http://indexnode1:2/solr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates

Regards,
Rahul

On Thu, Feb 7, 2019 at 12:59 PM Erick Erickson 
wrote:

> bq. We have a heavy indexing load of about 10,000 documents every 150
> seconds.
> Not so heavy query load.
>
> It's unlikely that changing numRecordsToKeep will help all that much if
> your
> maintenance window is very large. Rather, that number would have to be
> _very_
> high.
>
> 7 hours is huge. How big are your indexes on disk? You're essentially
> going to get a
> full copy from the

Full index replication upon service restart

2019-02-05 Thread Rahul Goswami

Hello Solr gurus,

So I have a scenario where on Solr cluster restart the replica node goes
into full index replication for about 7 hours. Both replica nodes are
restarted around the same time for maintenance. Also, during usual times,
if one node goes down for whatever reason, upon restart it again does index
replication. In certain instances, some replicas just fail to recover.

*SolrCloud 7.2.1 *cluster configuration*:*

16 shards - replication factor=2

Per server configuration:
==
32GB machine - 16GB heap space for Solr
Index size : 3TB per server

autoCommit (openSearcher=false) of 3 minutes

We have a heavy indexing load of about 10,000 documents every 150 seconds.
Not so heavy query load.

Reading through some of the threads on similar topic, I suspect it would be
the disparity between the number of updates(>100) between the replicas that
is causing this (courtesy our indexing load). One of the suggestions I saw
was using numRecordsToKeep.
However as Erick mentioned in one of the threads, that's a bandaid measure
and I am trying to eliminate some of the fundamental issues that might
exist.

1) Is the heap too less for that index size? If yes, what would be a
recommended max heap size?
2) Is there a general guideline to estimate the required max heap based on
index size on disk?
3) What would be a recommended autoCommit and autoSoftCommit interval ?
4) Any configurations that would help improve the restart time and avoid
full replication?
5) Does Solr retain "numRecordsToKeep" number of  documents in tlog *per
replica*?
6) The reasons for peersync from below logs are not completely clear to me.
Can someone please elaborate?

*PeerSync fails with* :

Failure type 1:
-
2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:2_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync Fingerprint comparison: 1

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:2_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync Other fingerprint:
{maxVersionSpecified=1624579878580912128,
maxVersionEncountered=1624579893816721408, maxInHash=1624579878580912128,
versionsHash=-8308981502886241345, numVersions=32966082, numDocs=32966165,
maxDoc=1828452}, Our fingerprint: {maxVersionSpecified=1624579878580912128,
maxVersionEncountered=1624579975760838656, maxInHash=1624579878580912128,
versionsHash=4017509388564167234, numVersions=32966066, numDocs=32966165,
maxDoc=1828452}

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:2_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync PeerSync:
core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 url=
http://indexnode1:8983/solr DONE. sync failed

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:8983_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.cloud.RecoveryStrategy PeerSync Recovery was not successful
- trying replication.


Failure type 2:
--
2019-02-02 20:26:56.256 WARN
(recoveryExecutor-4-thread-11-processing-n:indexnode1:2_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46
s:shard12 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node49)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard12 r:core_node49
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46]
org.apache.solr.update.PeerSync PeerSync:
core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46 url=
http://indexnode1:2/solr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates


Thanks,
Rahul

Re: SPLITSHARD not working as expected

2019-01-30 Thread Rahul Goswami

Hello,
I have a followup question on SPLITSHARD behavior. I understand that after
a split, the leader replicas of the sub shards would reside on the same
node as the leader of the parent. However, is there an expected behavior
for the follower replicas of the sub shards as to where they will be
created post split?

Regards,
Rahul



On Wed, Jan 30, 2019 at 1:18 AM Rahul Goswami  wrote:

> Thanks for the reply Jan. I have been referring to documentation for
> SPLISHARD on 7.2.1
> <https://lucene.apache.org/solr/guide/7_2/collections-api.html#splitshard> 
> which
> seems to be missing some important information present in 7.6
> <https://lucene.apache.org/solr/guide/7_6/collections-api.html#splitshard>.
> Especially these two pieces of information.:
> "When using splitMethod=rewrite (default) you must ensure that the node
> running the leader of the parent shard has enough free disk space i.e.,
> more than twice the index size, for the split to succeed "
>
> "The first replicas of resulting sub-shards will always be placed on the
> shard leader node"
>
> The idea of having an entire shard (both the replicas of it) present on
> the same node did come across as an unexpected behavior at the beginning.
> Anyway, I guess I am going to have to take care of the rebalancing with
> MOVEREPLICA following a SPLITSHARD.
>
> Thanks for the clarification.
>
>
> On Mon, Jan 28, 2019 at 3:40 AM Jan Høydahl  wrote:
>
>> This is normal. Please read
>> https://lucene.apache.org/solr/guide/7_6/collections-api.html#splitshard
>> PS: Images won't make it to the list, but don't think you need a
>> screenshot here, what you describe is the default behaviour.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>> > 28. jan. 2019 kl. 09:05 skrev Rahul Goswami :
>> >
>> > Hello,
>> > I am using Solr 7.2.1. I created a two node example collection on the
>> same machine. Two shards with two replicas each. I then called SPLITSHARD
>> on shard2 and expected the split shards to have one replica on each node.
>> However I see that for shard2_1, both replicas reside on the same node. Is
>> this a valid behavior?  Unless I am missing something, this could be
>> potentially fatal.
>> >
>> > Here's the query and the cluster state post split:
>> >
>> http://localhost:8983/solr/admin/collections?action=SPLITSHARD=gettingstarted=shard2=true
>> <
>> http://localhost:8983/solr/admin/collections?action=SPLITSHARD=gettingstarted=shard2=true>
>>
>> >
>> >
>> >
>> > Thanks,
>> > Rahul
>>
>>

Re: Error using collapse parser with /export

2019-01-29 Thread Rahul Goswami

I checked again and looks like all documents with the same "id_field"
reside on the same shard, in which case I would expect collapse parser to
work. Here is my complete query:

http://localhost:8983/solr/mycollection/stream/?expr=search(mycollection
,sort="field1 asc,field2
asc",fl="fileld1,field2,field3",qt="/export",q="*:*",fq="((field4:1)
OR (field4:2))",fq="{!collapse field=id_field sort='field3 desc'}")

The same query with "select" handler does return the collapse result fine.
Looks like this might be a bug afterall (while working with /export)?

Thanks,
Rahul


On Sun, Jan 27, 2019 at 9:55 PM Rahul Goswami  wrote:

> Hi Joel,
>
> Thanks for responding to the query.
>
> Answers to your questions:
> 1) After collapsing is it not possible to use the /select handler?  - The
> collapsing itself is causing the failure (or did I not understand your
> question right?)
> 2) After exporting is it possible to unique the records using the
> unique  Streaming Expression?   (This can't be done since we require the
> unique document in a group subject to a sort order as in the query above.
> Looking at the Streaming API, 'unique' streaming expression doesn't give
> the capability to sort within a group. Or is there a way to do this?)
>
> I re-read the documentation
> <https://lucene.apache.org/solr/guide/7_2/collapse-and-expand-results.html>
> :
> "In order to use these features with SolrCloud, the documents must be
> located on the same shard."
>
> Looks like the "id_field"  in the collapse criteria above is coming from
> documents not present in the same shard. I'll verify this tomorrow and
> update the thread.
>
> Thanks,
> Rahul
>
> On Mon, Jan 21, 2019 at 2:26 PM Joel Bernstein  wrote:
>
>> I haven't had time to look into the details of this issue but it's not
>> clear that these two features will be able to be used together. Although
>> that it would be nice if the could.
>>
>> A couple of questions about your use case:
>>
>> 1) After collapsing is it not possible to use the /select handler?
>> 2) After exporting is it possible to unique the records using the unique
>> Streaming Expression?
>>
>> Either of those cases would be the typical uses of these features.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>>
>> On Sun, Jan 20, 2019 at 10:13 PM Rahul Goswami 
>> wrote:
>>
>> > Hello,
>> >
>> > Following up on my query. I know this might be too specific an issue.
>> But I
>> > just want to know that it's a legitimate bug and the supported
>> operation is
>> > allowed with the /export handler. If someone has an idea about this and
>> > could confirm, that would be great.
>> >
>> > Thanks,
>> > Rahul
>> >
>> > On Thu, Jan 17, 2019 at 4:58 PM Rahul Goswami 
>> > wrote:
>> >
>> > > Hello,
>> > >
>> > > I am using SolrCloud on Solr 7.2.1.
>> > > I get the NullPointerException in the Solr logs (in ExportWriter.java)
>> > > when the /stream handler is invoked with a search() streaming
>> expression
>> > > with qt="/export" containing fq="{!collapse field=id_field sort="time
>> > > desc"} (among other fq's. I tried eliminating one fq at a time to find
>> > the
>> > > problematic one. The one with collapse parser is what makes it fail).
>> > >
>> > >
>> > > I see an open JIRA for this issue (with a submitted patch which has
>> not
>> > > yet been accepted):
>> > >
>> > > https://issues.apache.org/jira/browse/SOLR-8291
>> > >
>> > >
>> > >
>> > > In my case useFilterForSortedQuery=false
>> > >
>> > > org.apache.solr.servlet.HttpSolrCall
>> null:java.lang.NullPointerException
>> > > at
>> org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:61)
>> > > at
>> org.apache.solr.handler.ExportWriter.writeDocs(ExportWriter.java:243)
>> > > at
>> > >
>> org.apache.solr.handler.ExportWriter.lambda$null$1(ExportWriter.java:222)
>> > > at
>> > >
>> >
>> org.apache.solr.response.JSONWriter.writeIterator(JSONResponseWriter.java:523)
>> > > at
>> > >
>> >
>> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:180)
>> > > at
>> org.apache.solr.respon

Re: SPLITSHARD not working as expected

2019-01-29 Thread Rahul Goswami

Thanks for the reply Jan. I have been referring to documentation for
SPLISHARD on 7.2.1
<https://lucene.apache.org/solr/guide/7_2/collections-api.html#splitshard>
which
seems to be missing some important information present in 7.6
<https://lucene.apache.org/solr/guide/7_6/collections-api.html#splitshard>.
Especially these two pieces of information.:
"When using splitMethod=rewrite (default) you must ensure that the node
running the leader of the parent shard has enough free disk space i.e.,
more than twice the index size, for the split to succeed "

"The first replicas of resulting sub-shards will always be placed on the
shard leader node"

The idea of having an entire shard (both the replicas of it) present on the
same node did come across as an unexpected behavior at the beginning.
Anyway, I guess I am going to have to take care of the rebalancing with
MOVEREPLICA following a SPLITSHARD.

Thanks for the clarification.

On Mon, Jan 28, 2019 at 3:40 AM Jan Høydahl  wrote:

> This is normal. Please read
> https://lucene.apache.org/solr/guide/7_6/collections-api.html#splitshard
> PS: Images won't make it to the list, but don't think you need a
> screenshot here, what you describe is the default behaviour.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 28. jan. 2019 kl. 09:05 skrev Rahul Goswami :
> >
> > Hello,
> > I am using Solr 7.2.1. I created a two node example collection on the
> same machine. Two shards with two replicas each. I then called SPLITSHARD
> on shard2 and expected the split shards to have one replica on each node.
> However I see that for shard2_1, both replicas reside on the same node. Is
> this a valid behavior?  Unless I am missing something, this could be
> potentially fatal.
> >
> > Here's the query and the cluster state post split:
> >
> http://localhost:8983/solr/admin/collections?action=SPLITSHARD=gettingstarted=shard2=true
> <
> http://localhost:8983/solr/admin/collections?action=SPLITSHARD=gettingstarted=shard2=true>
>
> >
> >
> >
> > Thanks,
> > Rahul
>
>

SPLITSHARD not working as expected

2019-01-28 Thread Rahul Goswami

Hello,
I am using Solr 7.2.1. I created a two node example collection on the same
machine. Two shards with two replicas each. I then called SPLITSHARD on
shard2 and expected the split shards to have one replica on each node.
However I see that for shard2_1, both replicas reside on the same node. Is
this a valid behavior?  Unless I am missing something, this could be
potentially fatal.

Here's the query and the cluster state post split:
http://localhost:8983/solr/admin/collections?action=SPLITSHARD=gettingstarted=shard2=true


[image: image.png]

Thanks,
Rahul

Re: Error using collapse parser with /export

2019-01-27 Thread Rahul Goswami

Hi Joel,

Thanks for responding to the query.

Answers to your questions:
1) After collapsing is it not possible to use the /select handler?  - The
collapsing itself is causing the failure (or did I not understand your
question right?)
2) After exporting is it possible to unique the records using the
unique  Streaming Expression?   (This can't be done since we require the
unique document in a group subject to a sort order as in the query above.
Looking at the Streaming API, 'unique' streaming expression doesn't give
the capability to sort within a group. Or is there a way to do this?)

I re-read the documentation
<https://lucene.apache.org/solr/guide/7_2/collapse-and-expand-results.html>:
"In order to use these features with SolrCloud, the documents must be
located on the same shard."

Looks like the "id_field"  in the collapse criteria above is coming from
documents not present in the same shard. I'll verify this tomorrow and
update the thread.

Thanks,
Rahul

On Mon, Jan 21, 2019 at 2:26 PM Joel Bernstein  wrote:

> I haven't had time to look into the details of this issue but it's not
> clear that these two features will be able to be used together. Although
> that it would be nice if the could.
>
> A couple of questions about your use case:
>
> 1) After collapsing is it not possible to use the /select handler?
> 2) After exporting is it possible to unique the records using the unique
> Streaming Expression?
>
> Either of those cases would be the typical uses of these features.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Sun, Jan 20, 2019 at 10:13 PM Rahul Goswami 
> wrote:
>
> > Hello,
> >
> > Following up on my query. I know this might be too specific an issue.
> But I
> > just want to know that it's a legitimate bug and the supported operation
> is
> > allowed with the /export handler. If someone has an idea about this and
> > could confirm, that would be great.
> >
> > Thanks,
> > Rahul
> >
> > On Thu, Jan 17, 2019 at 4:58 PM Rahul Goswami 
> > wrote:
> >
> > > Hello,
> > >
> > > I am using SolrCloud on Solr 7.2.1.
> > > I get the NullPointerException in the Solr logs (in ExportWriter.java)
> > > when the /stream handler is invoked with a search() streaming
> expression
> > > with qt="/export" containing fq="{!collapse field=id_field sort="time
> > > desc"} (among other fq's. I tried eliminating one fq at a time to find
> > the
> > > problematic one. The one with collapse parser is what makes it fail).
> > >
> > >
> > > I see an open JIRA for this issue (with a submitted patch which has not
> > > yet been accepted):
> > >
> > > https://issues.apache.org/jira/browse/SOLR-8291
> > >
> > >
> > >
> > > In my case useFilterForSortedQuery=false
> > >
> > > org.apache.solr.servlet.HttpSolrCall
> null:java.lang.NullPointerException
> > > at org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:61)
> > > at
> org.apache.solr.handler.ExportWriter.writeDocs(ExportWriter.java:243)
> > > at
> > >
> org.apache.solr.handler.ExportWriter.lambda$null$1(ExportWriter.java:222)
> > > at
> > >
> >
> org.apache.solr.response.JSONWriter.writeIterator(JSONResponseWriter.java:523)
> > > at
> > >
> >
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:180)
> > > at
> org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
> > > at
> > >
> org.apache.solr.handler.ExportWriter.lambda$null$2(ExportWriter.java:222)
> > > at
> > >
> org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
> > > at
> > >
> >
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:198)
> > > at
> org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
> > > at
> > >
> >
> org.apache.solr.handler.ExportWriter.lambda$write$3(ExportWriter.java:220)
> > > at
> > >
> org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
> > > at org.apache.solr.handler.ExportWriter.write(ExportWriter.java:218)
> > > at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2627)
> > > at
> > >
> >
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49)
> > > at
> > >
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:788)
> > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:525)
> > >
> > >
> > > Above is a smaller trace; I can provide the complete stacktrace if it
> > > helps. Before considering making a fix in ExportWriter.java and
> > rebuilding
> > > Solr as a last resort, I want to make sure I am not using something
> which
> > > is not supported on SolrCloud. Can anybody please help?
> > >
> > >
> > >
> >
>

Re: Error using collapse parser with /export

2019-01-20 Thread Rahul Goswami

Hello,

Following up on my query. I know this might be too specific an issue. But I
just want to know that it's a legitimate bug and the supported operation is
allowed with the /export handler. If someone has an idea about this and
could confirm, that would be great.

Thanks,
Rahul

On Thu, Jan 17, 2019 at 4:58 PM Rahul Goswami  wrote:

> Hello,
>
> I am using SolrCloud on Solr 7.2.1.
> I get the NullPointerException in the Solr logs (in ExportWriter.java)
> when the /stream handler is invoked with a search() streaming expression
> with qt="/export" containing fq="{!collapse field=id_field sort="time
> desc"} (among other fq's. I tried eliminating one fq at a time to find the
> problematic one. The one with collapse parser is what makes it fail).
>
>
> I see an open JIRA for this issue (with a submitted patch which has not
> yet been accepted):
>
> https://issues.apache.org/jira/browse/SOLR-8291
>
>
>
> In my case useFilterForSortedQuery=false
>
> org.apache.solr.servlet.HttpSolrCall null:java.lang.NullPointerException
> at org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:61)
> at org.apache.solr.handler.ExportWriter.writeDocs(ExportWriter.java:243)
> at
> org.apache.solr.handler.ExportWriter.lambda$null$1(ExportWriter.java:222)
> at
> org.apache.solr.response.JSONWriter.writeIterator(JSONResponseWriter.java:523)
> at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:180)
> at org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
> at
> org.apache.solr.handler.ExportWriter.lambda$null$2(ExportWriter.java:222)
> at
> org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
> at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:198)
> at org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
> at
> org.apache.solr.handler.ExportWriter.lambda$write$3(ExportWriter.java:220)
> at
> org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
> at org.apache.solr.handler.ExportWriter.write(ExportWriter.java:218)
> at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2627)
> at
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49)
> at
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:788)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:525)
>
>
> Above is a smaller trace; I can provide the complete stacktrace if it
> helps. Before considering making a fix in ExportWriter.java and rebuilding
> Solr as a last resort, I want to make sure I am not using something which
> is not supported on SolrCloud. Can anybody please help?
>
>
>

Error using collapse parser with /export

2019-01-17 Thread Rahul Goswami

Hello,

I am using SolrCloud on Solr 7.2.1.
I get the NullPointerException in the Solr logs (in ExportWriter.java) when
the /stream handler is invoked with a search() streaming expression with
qt="/export" containing fq="{!collapse field=id_field sort="time desc"}
(among other fq's. I tried eliminating one fq at a time to find the
problematic one. The one with collapse parser is what makes it fail).


I see an open JIRA for this issue (with a submitted patch which has not yet
been accepted):

https://issues.apache.org/jira/browse/SOLR-8291



In my case useFilterForSortedQuery=false

org.apache.solr.servlet.HttpSolrCall null:java.lang.NullPointerException
at org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:61)
at org.apache.solr.handler.ExportWriter.writeDocs(ExportWriter.java:243)
at org.apache.solr.handler.ExportWriter.lambda$null$1(ExportWriter.java:222)
at
org.apache.solr.response.JSONWriter.writeIterator(JSONResponseWriter.java:523)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:180)
at org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
at org.apache.solr.handler.ExportWriter.lambda$null$2(ExportWriter.java:222)
at org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:198)
at org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
at
org.apache.solr.handler.ExportWriter.lambda$write$3(ExportWriter.java:220)
at org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
at org.apache.solr.handler.ExportWriter.write(ExportWriter.java:218)
at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2627)
at
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49)
at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:788)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:525)


Above is a smaller trace; I can provide the complete stacktrace if it
helps. Before considering making a fix in ExportWriter.java and rebuilding
Solr as a last resort, I want to make sure I am not using something which
is not supported on SolrCloud. Can anybody please help?

Re: Able to search with indexed=false and docvalues=true

2018-11-20 Thread Rahul Goswami

Erick and Toke,

Thank you for the replies. I am surprised there already isn’t a JIRA for
this. In my opinion, this should be an error condition on search or
alternatively should simply be giving zero results. That would be a defined
behavior as opposed to now, where the searches are not particularly
functional for any industry size load anyway.

Thanks,
Rahul

On Tue, Nov 20, 2018 at 3:37 AM Toke Eskildsen  wrote:

> On Mon, 2018-11-19 at 22:19 -0500, Rahul Goswami wrote:
> > I am using SolrCloud 7.2.1. My understanding is that setting
> > docvalues=true would optimize faceting, grouping and sorting; but for
> > a field to be searchable it needs to be indexed=true.
>
> Erick explained the search thing, so I'll just note that faceting on a
> DocValues=true indexed=false field on a multi-shard index also has a
> performance penalty as the field will be slow-searched (using the
> DocValues) in the secondary fine-counting phase.
>
> - Toke Eskildsen, Royal Danish Library
>
>
>

Re: Error:Missing Required Fields for Atomic Updates

2018-11-19 Thread Rahul Goswami

What is the Router name for your collection? Is it "implicit"  (You can
know this from the "Overview" of you collection in the admin UI)  ? If yes,
what is the router.field parameter the collection was created with?

Rahul


On Mon, Nov 19, 2018 at 11:19 PM Rajeswari Kolluri <
rajeswari.koll...@oracle.com> wrote:

>
> Hi Rahul
>
> Below is part of schema ,   entityid is my unique id field.  Getting
> exception missing required field for  "category"  during atomic updates.
>
>
> entityid
>  required="true" multiValued="false" />
>  required="false" multiValued="false" />
>  stored="true" required="false" multiValued="false" />
>  stored="true" required="false" multiValued="false" />
>  stored="true" required="false" multiValued="false" />
>  stored="true" required="false" multiValued="false" />
>  stored="true" required="false" multiValued="false" />
>  required="true" docValues="true" />
>  required="false" multiValued="true" />
>
>
>
> Thanks
> Rajeswari
>
> -Original Message-
> From: Rahul Goswami [mailto:rahul196...@gmail.com]
> Sent: Tuesday, November 20, 2018 9:33 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Error:Missing Required Fields for Atomic Updates
>
> What’s your update query?
>
> You need to provide the unique id field of the document you are updating.
>
> Rahul
>
> On Mon, Nov 19, 2018 at 10:58 PM Rajeswari Kolluri <
> rajeswari.koll...@oracle.com> wrote:
>
> > Hi,
> >
> >
> >
> >
> >
> > Using Solr 7.5.0.  While performing atomic updates on a document  on
> > Solr Cloud using SolJ  getting exceptions "Missing Required Field".
> >
> >
> >
> > Please let me know  the solution, would not want to update the
> > required fields during atomic updates.
> >
> >
> >
> >
> >
> > Thanks
> >
> > Rajeswari
> >
>

Re: Error:Missing Required Fields for Atomic Updates

2018-11-19 Thread Rahul Goswami

What’s your update query?

You need to provide the unique id field of the document you are updating.

Rahul

On Mon, Nov 19, 2018 at 10:58 PM Rajeswari Kolluri <
rajeswari.koll...@oracle.com> wrote:

> Hi,
>
>
>
>
>
> Using Solr 7.5.0.  While performing atomic updates on a document  on Solr
> Cloud using SolJ  getting exceptions "Missing Required Field".
>
>
>
> Please let me know  the solution, would not want to update the required
> fields during atomic updates.
>
>
>
>
>
> Thanks
>
> Rajeswari
>

Able to search with indexed=false and docvalues=true

2018-11-19 Thread Rahul Goswami

I am using SolrCloud 7.2.1. My understanding is that setting docvalues=true
would optimize faceting, grouping and sorting; but for a field to be
searchable it needs to be indexed=true. However I was dumbfounded today
when I executed a successful search on a field with below configuration:

However the searches don't always complete and often time out.

My question is...
Is searching on docValues=true and indexed=false fields supported? If yes,
in which cases?
What are the pitfalls (as I see that searches, although sometimes
successful are atrociously slow and quite often time out)?

Re: Explode kind of function in Solr

2018-09-14 Thread Rahul Singh

https://github.com/bazaarvoice/jolt

On Thu, Sep 13, 2018 at 9:18 AM Joel Bernstein  wrote:

> Solr Streaming Expressions allow you to do this with the cartesianProduct
> function:
>
>
> http://lucene.apache.org/solr/guide/7_4/stream-decorator-reference.html#cartesianproduct
>
> The structure of the expression is:
>
> cartesianProduct(search(...))
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Sep 13, 2018 at 6:21 AM Rushikesh Garadade <
> rushikeshgarad...@gmail.com> wrote:
>
> > Hello All,
> > Is there any functionality in solr that can convert (explode) results
> from
> > 1 document to many docuement.
> > *Example: *
> > Lets say I have doc:
> > {
> > id:1,
> > phone: [11,22,33]
> > }
> >
> > when I query to solr with id=1 I want result as below:
> > [{
> > id:1,
> > phone:11
> > },
> > {
> > id:1,
> > phone:22
> > },
> > {
> > d:1,
> > phone:33
> > }]
> >
> > Please let me know if this is possible in Solr , if Yes how?
> >
> > Thanks,
> > Rushikesh Garadade
> >
>

Re: 20180913 - Clarification about Limitation

2018-09-13 Thread Rahul Singh

Depends on whether you are using Solr or solrcloud. Solrcloud distributes data 
into shards so it increases overall capacity.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 13, 2018, 4:50 AM -0400, Rekha , wrote:
> Hi Solr Team,
> I am new to SOLR. I need following clarification from you.
> How many documents can be stored in one core? Is there any limit for number 
> of fields per document? How many Cores can be created in on SOLR? Is 
> there any other limitation is there based on the Disk storage size? I mean 
> some of the database has the 10 GM limit, I have asked like that. Can we use 
> SOLR as a database?
> Thanks,Rekha Karthick

Re: parent/child rows in solr

2018-09-13 Thread Rahul Singh

What’s your SLA? It seems that you have two problems - finding correlated 
information that’s in a hierarchy and potentially displaying it.

I feel your desire to conflate the two is forcing you down a specific path. 
Often times in complex scenarios I’ve found that an index like Solr is better 
for the search and not necessarily the storage or display.

The question I have is : what’s your application workflow? Who is querying this 
data? How are they expecting to see it? How fast do they need that data?

I understand what you’ve described (which seems to be a non functional 
requirement ) of what you want to do but in order to help it would be helpful 
for me at least to know how the data is ingested,
Enhanced, and retrieved.

In terms of data volume, you may consider indexing all data , but not storing 
all of it. This makes sure you aren’t duplicating data that isn’t an awful 
waste of space.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 11, 2018, 11:23 PM -0400, John Smith , wrote:
> On Tue, Sep 11, 2018 at 11:05 PM Walter Underwood 
> wrote:
>
> > Have you tried modeling it with multivalued fields?
> >
> >
> That's an interesting idea, but I don't think that would work. We would
> lose the concept of "rows". So let's say child1 has col "a" and col "b",
> both are turned into multi-value fields in the solr index. Normally in sql
> we can query for a specific value in col "a", and then see what the
> associated value in col "b" would be, but we can't do that if we stuff the
> col values in multi-value; we can no longer see which value from col "a"
> corresponds to which value in col "b". I'm probably explaining that poorly,
> but I just don't see how that would work.

Re: Boost only first 10 records

2018-09-03 Thread Rahul Singh

I agree , the tow query solution is the simplest to implement and you have much 
more control on the UI as well. It seems you want to have a “featured” set of 
results above and separate from the organic results from the index.

You could choose to request only specific fields in the “featured” query.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 3, 2018, 6:29 AM -0400, Emir Arnautović , 
wrote:
> Hi,
> The requirement is not 100% clear or logical. If user selects filter 
> type:comedy, it does not make sense to show anything else. You might have 
> “Other categories relavant results” and that can be done as a separate query. 
> It seems that you want to prefer comedy, but you have an issue with boosting 
> it too much results in only comedy top results and boosting it too little 
> does not result in comedy being top hit all the time. Boosting is usually 
> used to prefer one type if there are similar results but that does not 
> guaranty that they will be top all the time. Your options are:
> 1. tune boost parameter so the results are as expected in most times (it will 
> never be all the times)
> 2. use collapse (group) feature to make sure you get results from all 
> categories
> 3. have two queries and combine results on UI side
> 4. use faceting in combination with query and let user choose genre.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Sep 2018, at 08:48, mama  wrote:
> >
> > Hi
> > We have requirement to boost only first few records & rest of result should
> > be as per search.
> > e.g. if i have books of different genre & if user search for some book
> > (intrested in genere : comedy) then
> > we want to show say first 3 records of genre:comedy and rest of results
> > should be of diff genre .
> > Reason for this is , we have lots of books in db , if we boost comedy genre
> > then first 100s of records will be comedy and user may not be aware of other
> > books.
> > is it possible ?
> >
> > Query for boosting genre comedy
> > genre:comedy^0.5
> >
> > can someone help with requirement of limiting boost to first few records ?
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Metrics for a healthy Solr cluster

2018-08-17 Thread Rahul Singh

I wrote something related to this topic a while ago.

https://www.google.com/amp/s/blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/amp/

Rahul
On Aug 16, 2018, 3:35 PM -0700, Jan Høydahl , wrote:
> Check out the Reference Guide chapter on monitoring with open source 
> Prometheus and Grafana.
> https://lucene.apache.org/solr/guide/7_4/monitoring-solr-with-prometheus-and-grafana.html
>  
> <https://lucene.apache.org/solr/guide/7_4/monitoring-solr-with-prometheus-and-grafana.html>
>
> You'll get a nice dashboard with key metrics and be able to tweak thresholds, 
> alerts etc.
> Note that Solr now has a rich REST based metrics API so you don't need JMX 
> anymore.
> Also, solr has now got some Metrics History capabilities built-in, see 
> https://lucene.apache.org/solr/guide/7_4/metrics-history.html 
> <https://lucene.apache.org/solr/guide/7_4/metrics-history.html>
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 16. aug. 2018 kl. 17:24 skrev Greenhorn Techie :
> >
> > Hi,
> >
> > Solr provides numerous JMX metrics for monitoring the health of the
> > cluster. We are setting up a SolrCloud cluster and hence wondering what are
> > the important parameters / metrics to look into, to ascertain that the
> > cluster health is good. Obvious things comes to my mind are CPU utilisation
> > and memory utilisation.
> >
> > However, wondering what are the other parameters to look into from the
> > health of the cluster? Are there any best practices?
> >
> > Thanks
>

Re: Recipe for moving to solr cloud without reindexing

2018-08-07 Thread Rahul Singh

Bjarke,

I am imagining that at some point you may need to shard that data if it grows. 
Or do you imagine this data to remain stagnant?

Generally you want to add solrcloud to do two things : 1. Increase availability 
with replicas 2. Increase available data via shards 3. Increase fault tolerance 
with leader and replicas being spread around the cluster.

You would be bypassing general High availability / distributed computing 
processes by trying to not reindex.

Rahul
On Aug 7, 2018, 7:06 AM -0400, Bjarke Buur Mortensen , 
wrote:
> Hi List,
>
> is there a cookbook recipe for moving an existing solr core to a solr cloud
> collection.
>
> We currently have a single machine with a large core (~150gb), and we would
> like to move to solr cloud.
>
> I haven't been able to find anything that reuses an existing index, so any
> pointers much appreciated.
>
> Thanks,
> Bjarke

RE: create collection from existing managed-schema

2018-07-26 Thread Rahul Chhiber

Hi,

If you want to share schema and/or other configurations between collections, 
you need to create a configset. Then, specify this configset while creating any 
collections.

Any changes made to that configset or schema will reflect in all collections 
that are using it.

By default, Solr has the _default configset for any collections created without 
explicit configset.

Regards,
Rahul Chhiber

-Original Message-
From: Chuming Chen [mailto:chumingc...@gmail.com] 
Sent: Thursday, July 26, 2018 11:35 PM
To: solr-user@lucene.apache.org
Subject: create collection from existing managed-schema

Hi All,

>From Solr Admin interface, I have created a collection and added field 
>definitions. I can get its managed-schema from the Admin interface. 

Can I use this managed-schema to create a new collection? How?

Thanks,

Chuming

Re: Silk from LucidWorks

2018-07-15 Thread Rahul Singh

Their commercial offering still has something like it. You can always try 
Grafana

Rahul
On Jul 13, 2018, 9:59 AM -0400, rgummadi , wrote:
> Is SiLK from LucidWorks still an acitve project. I looked at their github and
> it does not seem to be active. If so are there any alternative solutions.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Text Similarity

2018-07-15 Thread Rahul Singh

How do you define similarity? There are various different methods that work for 
different methods. In solr depending on which index time analyzer / tokenizer 
you are using, it will treat one company name as similar in one scenario and 
not in another.

This seems like a case of data deduplication — the join I’m pretty sure works 
on exact matches.

Consider creating a “identity” collection where you map the different names to 
a unique identity key. This could then be technically be joined on two datasets 
and then those could be joined again.

Rahul
On Jul 11, 2018, 4:42 PM -0400, Aroop Ganguly 
, wrote:
> Hi Team
>
> This is what I want to do:
> 1. I have 2 datasets of the schema id-number and company-name
> 2. I want to ultimately be able to link (join or any other means) the 2 data 
> sets based on the similarity between the company-name fields of the 2 data 
> set.
>
> Example:
>
> Dataset 1
> 
> Id | Company Name
> —| —
> 1 | Aroop Inc
> 2 | Ganguly & Ganguly Corp
>
>
> Dataset 2
> 
> Yo Revenue | Company Name
> — — |
> 1K | aroop and sons
> 2K | Ganguly Corp
> 3K | Ganguly and Ganguly
> 2K | Aroop Inc.
> 6K | Ganguly Corporation
>
>
>
> I want to be able to get a join in the end, based on a smart similarity score 
> between the company names in the 2 data sets.
>
> Final Dataset
> —--- | —| |— |
> Id | Company Name | Revenue | Matched Company Name from Dataset2 | Similarity 
> Score
> —--- | —---—| — 
> |———
> 1 | Aroop Inc | 2K | Aroop Inc. | 99%
> 2 | Ganguly & Ganguly Corp | 3K | Ganguly and Ganguly | 75%
> —--- | —| |—--- |
>
> How should I proceed? (I have preprocessed the data sets to lowercase it and 
> remove non essential words like pronouns and acronyms like LTD or Co. )
>
> Thanks
> Aroop

Regarding pdf indexing issue

2018-07-11 Thread Rahul Prasad Dwivedi

Hello Team,

I am using the Solr for indexing and searching for pdf document

I have go through with your website document and installed solr but unable
to index and search the document.

For example: Suppose we have a PDF file which have no of paragraph with
separate heading.

So If I search for the title on indexed pdf the result should be contain
the paragraph from where the title belongs.

I am unable to perform this task.

I have run the below command for upload the pdf

*bin/post -c gettingstarted pdf-sample.pdf*

and for searching I am running the command

*curl http://localhost:8983/solr/gettingstarted/select?q='*
<http://localhost:8983/solr/gettingstarted/select?q='*>'*

Please suggest me anything and let me know if I am missing anything

Thanks,

Rahul

Re: Delta import not working with Oracle in Solr

2018-07-10 Thread Rahul Singh

Agreed. DIH is not an industrial grade ETL tool.. may want to consider other 
options. May want to look into Kafka Connect as an alternative. It has 
connectors for JDBC into Kafka, and from Kafka into Solr.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jul 9, 2018, 6:14 AM -0500, Alexandre Rafalovitch , 
wrote:
> I think you are moving so fast it is hard to understand where you need help.
>
> Can you setup one clean smallest issue (maybe as test) and try our original
> suggestions.
>
> Otherwise, nobody has enough attention energy to figure out what is
> happening.
>
> And even then, this list is voluntary help, we are just trying to give you
> pointers the best we can. It is quite possible you have outgrown DIH and
> need to move up to a propper stand alone ETL tool.
>
> Regards,
> Alex
>
> On Sun, Jul 8, 2018, 11:49 PM shruti suri,  wrote:
>
> > Still not working, same issue documents are not getting pushed to index.
> >
> >
> >
> > -
> > Regards
> > Shruti
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >

RE: cmd to enable debug logs

2018-07-09 Thread Rahul Chhiber

Use -v option in the bin/solr start command.

Regards,
Rahul Chhiber

-Original Message-
From: Prateek Jain J [mailto:prateek.j.j...@ericsson.com] 
Sent: Monday, July 09, 2018 4:26 PM
To: solr-user@lucene.apache.org
Subject: cmd to enable debug logs

Hi All,

What's the command (from CLI) to enable debug logs for a core in solr? To be 
precise, I am using solr 4.8.1. I looked into admin guide and it talks about 
how to do it from UI but nothing from CLI perspective.  Any help pointers will 
be of help.

Note: I can't update solrconfig.xml.

Regards,
Prateek Jain

Re: How to know the name(url) of documents that data import handler skipped

2018-07-08 Thread Rahul Singh

Have you tried changing the log level
https://lucene.apache.org/solr/guide/7_2/configuring-logging.html


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jul 8, 2018, 8:54 PM -0500, Yasufumi Mizoguchi , 
wrote:
> Hi,
>
> I am trying to indexing files into Solr 7.2 using data import handler with
> onError=skip option.
> But, I am struggling with determining the skipped documents as logs do not
> tell which file was bad.
> So, how can I know those files?
>
> Thanks,
> Yasufumi

Resources for Monitoring Cassandra, Spark, Solr

2018-07-02 Thread Rahul Singh

Folks,
We often get questions on monitoring here so I assembled this post with 
articles from those in the community as well as links to the component tools to 
give folks a more comprehensive listing.

https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/
This is a work in progress and I'll update this with screenshots as well as 
with links from other contributors.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

Re: Drive Change for Solr Setup

2018-06-21 Thread Rahul Singh

If it’s windows it may be using a tool called NSSM to manage the solr service.

Look at windows services and task scheduler and understand if solr services are 
being managed by windows via services or the task scheduler — or just .batch 
files.

Rahul
On Jun 20, 2018, 11:34 AM -0400, Shawn Heisey , wrote:
> On 6/20/2018 5:03 AM, Srinivas Muppu (US) wrote:
> > Hi Solr Team,My Solr project installation setup and instances(including
> > clustered solr, zk services and indexing jobs schedulers) is available in
> > Windows 'E:\ ' drive in production environment. As business needs to remove
> > the E:\ drive, going forward D:\ drive will be used and operational.Is
> > there any possible solution/steps for the moving solr installation setup
> > from 'E' drive to 'D' Drive without any impact to the existing
> > application(it should not create re indexing again)
>
> Exactly what needs to be done will be highly dependent on how you
> installed Solr on your system.  The project doesn't have any specific
> installation steps for Windows, so we have absolutely no idea what you
> have done.  Whoever set up your Solr install is going to know a LOT more
> about it than we ever can.
>
> At a high level, without any information specific to your setup, here's
> the steps you need:
>
>  * Stop Solr
>  * Move or copy files to the new location
>  * Change the solr home and possibly other config
>  * Start Solr.
>
> Thanks,
> Shawn
>

Re: Solr Cloud 7.3.1 backups

2018-05-31 Thread Rahul Singh

Greg,

Is SolR your main system of record or is it a secondary index to a primary data 
store?

Depending on the answer to that question I would recommend different options.

If primary, then I would ask what is the underlying compute infrastructure. Is 
it container, VM , or bare metal.

There are some decent distributed shared file system services that could be 
leveraged depending on the number of compute nodes.

Shared file system is the best way to keep it consistent but it comes with its 
draw backs. You can always backup locally and asynchronously sync to shared FS 
too.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On May 30, 2018, 5:16 PM -0400, Greg Roodt , wrote:
> Thanks for the confirmation Shawn. Distributed systems are hard, so this
> makes sense.
>
> I have a large, stable cluster (stable in terms of leadership and
> performance) with a single shard. The cluster scales up and down with
> additional PULL replicas over the day with the traffic curve.
>
> It's going to take a bit of coordination to get all nodes to mount a shared
> volume when we take a backup and then unmount when done.
>
> Any idea what happens if a node joins or leaves during a backup?
>
>
>
>
>
>
>
>
>
> On Thu, 31 May 2018 at 06:14, Shawn Heisey  wrote:
>
> > On 5/29/2018 3:01 PM, Greg Roodt wrote:
> > > What is the best way to perform a backup of a Solr Cloud cluster? Is
> > there
> > > a way to backup only the leader? From my tests with the collections admin
> > > BACKUP command, all nodes in the cluster need to have access to a shared
> > > filesystem. Surely that isn't necessary if you are backing up the leader
> > or
> > > TLOG replica?
> >
> > If you have more than one Solr instance in your cloud, then all of those
> > instances must have access to the same filesystem accessed from the same
> > mount point. Together, they will write the entire collection to various
> > subdirectories in that location.
> >
> > I can't find any mention of whether backups are load balanced across the
> > cloud, or if they always use leaders. I would assume the former. If
> > that's how it works, then you don't know which machine is going to do
> > the backup of a given shard. Even if the backup always uses leaders,
> > you can't always be sure of where a leader is. It can change from
> > moment to moment, especially if you're having stability problems with
> > your cloud.
> >
> > At restore time, there's a similar situation. You don't know which
> > machine(s) in the cloud are going to be actually loading index data from
> > the backup location. So they all need to have access to the same data.
> >
> > Thanks,
> > Shawn
> >
> >

Re: How to do parallel indexing on files (not on HDFS)

2018-05-24 Thread Rahul Singh

Right,
That’s why you need a place to persist the task list / graph. If you use a 
table, you can set “processed” / “unprocessed” value … or a queue, then its 
delivered only once .. otherwise you have to check indexed date from solr, and 
waste a solr call.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On May 24, 2018, 12:54 PM -0500, Adhyan Arizki <a.ari...@gmail.com>, wrote:
> You will still need to devise a way to partition the data source even if you 
> are scheduling multiple jobs otherwise, you might end up digesting the same 
> data again and again.
>
> > On Fri, May 25, 2018 at 12:46 AM, Raymond Xie <xie3208...@gmail.com> wrote:
> > > Thank you all for the suggestions. I'm now tending to not using a
> > > traditional parallel indexing my data are json files with meta data
> > > extracted from raw data received and archived into our data server 
> > > cluster.
> > > Those data come in various flows and reside in their respective folders,
> > > splitting them might introduce unnecessary extra work and could end up 
> > > with
> > > trouble. So instead of that, maybe it would be easier to simply schedule
> > > multiple indexing jobs separately.?
> > >
> > > Thanks.
> > >
> > > Raymond
> > >
> > >
> > > Rahul Singh <rahul.xavier.si...@gmail.com> 于 2018年5月24日周四 上午11:23写道：
> > >
> > > > Resending to list to help more people..
> > > >
> > > > This is an architectural pattern to solve the same issue that arises 
> > > > over
> > > > and over again.. The queue can be anything — a table in a database, 
> > > > even a
> > > > collection solr.
> > > >
> > > > And yes I have implemented it —  I did it in C# before using a SQL 
> > > > Server
> > > > table based queue -- (http://github.com/appleseed/search-stack) — and
> > > > then made the indexer be able to write to lucene, elastic or solr 
> > > > depending
> > > > config. Im not actively maintaining this right now ,but will consider
> > > > porting it to Kafka + Spark + Kafka Connect based system when I find 
> > > > time.
> > > >
> > > > In Kafka however, you have a lot of potential with Kafka Connect . Here 
> > > > is
> > > > an example using Cassandra..
> > > > But the premise is the same Kafka Connect has libraries of connectors 
> > > > for
> > > > different source / sinks … may not work for files but for pure raw data,
> > > > Kafka Connect is good.
> > > >
> > > > Here’s a project that may guide you best.
> > > >
> > > >
> > > > http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/
> > > >
> > > > I dont know where this guys code went.. but the content is there with 
> > > > code
> > > > samples.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > On May 23, 2018, 8:37 PM -0500, Raymond Xie <xie3208...@gmail.com>, 
> > > > wrote:
> > > >
> > > > Thank you Rahul despite that's very high level.
> > > >
> > > > With no offense, do you have a successful implementation or it is just
> > > > your unproven idea? I never used Rabbit nor Kafka before but would be 
> > > > very
> > > > interested in knowing more detail on the Kafka idea as Kafka is 
> > > > available
> > > > in my environment.
> > > >
> > > > Thank you again and look forward to hearing more from you or anyone in
> > > > this Solr community.
> > > >
> > > >
> > > > **
> > > > *Sincerely yours,*
> > > >
> > > >
> > > > *Raymond*
> > > >
> > > > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh 
> > > > <rahul.xavier.si...@gmail.com
> > > > > wrote:
> > > >
> > > >> Enumerate the file locations (map) , put them in a queue like rabbit or
> > > >> Kafka (Persist the map), have a bunch of threads , workers, containers,
> > > >> whatever pop off the queue , process the item (reduce).
> > > >>
> > > >>
> > > >> --
> > > >> Rahul Singh
> > > >> rahul.si...@anant.us
> > > >>
> > > >> Anant Corporation
> > > >>
> > > >> On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>,
> > > >> wrote:
> > > >>
> > > >> I know how to do indexing on file system like single file or folder, 
> > > >> but
> > > >> how do I do that in a parallel way? The data I need to index is of huge
> > > >> volume and can't be put on HDFS.
> > > >>
> > > >> Thank you
> > > >>
> > > >> **
> > > >> *Sincerely yours,*
> > > >>
> > > >>
> > > >> *Raymond*
> > > >>
> > > >>
> > > >
>
>
>
> --
>
> Best regards,
> Adhyan Arizki

Re: How to do parallel indexing on files (not on HDFS)

2018-05-24 Thread Rahul Singh

Resending to list to help more people..

This is an architectural pattern to solve the same issue that arises over and 
over again.. The queue can be anything — a table in a database, even a 
collection solr.

And yes I have implemented it —  I did it in C# before using a SQL Server table 
based queue -- (http://github.com/appleseed/search-stack) — and then made the 
indexer be able to write to lucene, elastic or solr depending config. Im not 
actively maintaining this right now ,but will consider porting it to Kafka + 
Spark + Kafka Connect based system when I find time.

In Kafka however, you have a lot of potential with Kafka Connect . Here is an 
example using Cassandra..
But the premise is the same Kafka Connect has libraries of connectors for 
different source / sinks … may not work for files but for pure raw data, Kafka 
Connect is good.

Here’s a project that may guide you best.

http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/

I dont know where this guys code went.. but the content is there with code 
samples.

--

On May 23, 2018, 8:37 PM -0500, Raymond Xie <xie3208...@gmail.com>, wrote:
> Thank you Rahul despite that's very high level.
>
> With no offense, do you have a successful implementation or it is just your 
> unproven idea? I never used Rabbit nor Kafka before but would be very 
> interested in knowing more detail on the Kafka idea as Kafka is available in 
> my environment.
>
> Thank you again and look forward to hearing more from you or anyone in this 
> Solr community.
>
>
> 
> Sincerely yours,
>
>
> Raymond
>
> > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <rahul.xavier.si...@gmail.com> 
> > wrote:
> > > Enumerate the file locations (map) , put them in a queue like rabbit or 
> > > Kafka (Persist the map), have a bunch of threads , workers, containers, 
> > > whatever pop off the queue , process the item (reduce).
> > >
> > >
> > > --
> > > Rahul Singh
> > > rahul.si...@anant.us
> > >
> > > Anant Corporation
> > >
> > > On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>, wrote:
> > > > I know how to do indexing on file system like single file or folder, but
> > > > how do I do that in a parallel way? The data I need to index is of huge
> > > > volume and can't be put on HDFS.
> > > >
> > > > Thank you
> > > >
> > > > **
> > > > *Sincerely yours,*
> > > >
> > > >
> > > > *Raymond*
>

Re: How to do parallel indexing on files (not on HDFS)

2018-05-23 Thread Rahul Singh

Enumerate the file locations (map) , put them in a queue like rabbit or Kafka 
(Persist the map), have a bunch of threads , workers, containers, whatever pop 
off the queue , process the item (reduce).

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>, wrote:
> I know how to do indexing on file system like single file or folder, but
> how do I do that in a parallel way? The data I need to index is of huge
> volume and can't be put on HDFS.
>
> Thank you
>
> **
> *Sincerely yours,*
>
>
> *Raymond*

Re: Multi threading indexing

2018-05-16 Thread Rahul Singh

Can try to leverage Spark to index. Or Kafka Connect with SolR.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On May 14, 2018, 2:03 AM -0500, Mikhail Khludnev <m...@apache.org>, wrote:
> A few years ago I provided server side concurrency "booster"
> https://issues.apache.org/jira/browse/SOLR-3585.
> But now, I'd rather suppose it's client-side (or ETL) duty.
>
> On Mon, May 14, 2018 at 6:39 AM, Raymond Xie <xie3208...@gmail.com> wrote:
>
> > Hello,
> >
> > I have a huge amount of data (TB level) to be indexed, I am wondering if
> > anyone can share your idea/code to do the multithreading indexing?
> >
> > **
> > *Sincerely yours,*
> >
> >
> > *Raymond*
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev

Re: SolrCloud

2018-05-16 Thread Rahul Singh

Having concurrent DIH for example from the same source on different cluster 
nodes may cause duplicate work. But yes the ZK is what distributes the conf.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On May 16, 2018, 4:55 AM -0500, Jon Morisi <jon.mor...@hsc.utah.edu>, wrote:
> Hi All,
> I'm looking for additional information on how to configure an encrypted 
> password for the DIH Configuration File, when using solrcloud:
> https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#configuring-the-dih-configuration-file
>
> Is this compatible with solrcloud / zookeeper?
> What values are to be used for encryptKeyFile when running in SolrCloud?
> Is this a reference to a local directory?
> Is this a reference to a zookeeper directory?
> Should I put the file in my collections zookeeper conf dir, using the file 
> name only?
>
> Thanks,
> Jon

Re: Apache SOLR Design Query

2018-05-13 Thread Rahul Singh

This is a good start. Few things to consider.

1. Extract the contents via Tika externally or via Tika Server.
2. Create a canonical “Item” document schema which would have title, metadata, 
contents, imagePreview (something to consider) , etc.
3. Use the extracted Tika data to populate your index.
4. Unless you need highlighting, only index the actual contents, and store the 
rest of the fields.
5. Shared File storage is probably ok, but you may want to do with a caching 
later via Nginx and serve files through it. That way you don’t hit the disk 
every time.


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On May 12, 2018, 10:54 AM -0400, NetUser MSUser <msusernetarchit...@gmail.com>, 
wrote:
> Hi team,
>
>
> We have a business case like the below one.
>
>
> There are nearly 150 GB of docs(pdf/ppt/word/xl/msg) files which are in
> stored in a N/w Path as of now. To implement text search on these , we are
> planning to use solr search in these. Listed below is the plan.
>
> 1)Using a high configuration Windows server(16 GB RAM , 1 TB Disk etc)
> 2)Keep all the files in this server.
> 3)Index all the above docs to solr server(Solr installed in the same
> windows server). Will use solr post command to post documents to this
> server.
> 4)Using a Web application user can further add or remove files to/from
> shared path in this server.
> 5)Web UI to search the text from these docs. and display the file Names.
> User can click and download the files
>
> Listed are the queries what we have.
>
> 1)Since we cannot index fields here, (as search is across all text in the
> docs of various types. User can search for any text and it might be in XL
> or in DOC or in PPT or in .MSG files), whether querying(Rest API from the
> Web ) the search data will have any performance hit?
>
> 2)Is it a right decision to keep the physical files in the Shared folder of
> Server itself(as a shared drive) instead of storing it in a DB or any other
> storage?
>
>
> Regards,
> MS

Re: Team please help

2018-04-29 Thread Rahul Singh

Furthermore , Azure search is based in Elastic. You can always host your own 
SolR — which if you are doing with Apache SolR, it may be slightly different 
from Cloudera Search which I believe is a variant of Apache SolR on Hadoop / 
HDFS.

My recommendation augments Doug’s.

1. Decide on whether you will manage your SolR or use managed search index on 
Azure. ( there are options ) — the reason I bring this up is her HDI is a 
managed HortonWorks variant, and you probably want a managed option. Otherwise 
you’d have moved Cloudera onto Azure VMs.

2. Reverse engineer your sources and sinks from the morphlines config and take 
inventory of the map and the flows.

3. Map your sources and sinks into Azure components — beyond HDI itself — which 
may include their managed infrastructure component offerings.

4. Look into Kafka Connect, Spark, or maybe a HDI / Hadoop based ingestion 
pipeline.

Best,

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 29, 2018, 6:27 AM -0700, Doug Turnbull 
<dturnb...@opensourceconnections.com>, wrote:
> Morphlines is a cloudera specific tool. I suspect moving Solr platforms
> will require you to rework your indexing somewhat. You may need to step
> back and think about the requirements of what you’re doing and design how
> it would work with Solr/Azure tooling.
> On Sat, Apr 28, 2018 at 8:58 PM Erick Erickson <erickerick...@gmail.com
> wrote:
>
> > It's rather impolite to cross post in three different places, in this
> > case the dev list, the user's list and Solr's JIRA. Additionally this
> > question is much better directed at Cloudera's support system.
> >
> > Best,
> > Erick
> >
> > On Sat, Apr 28, 2018 at 11:43 AM, Sujeet Singh
> > <sujeet.si...@cloudmoyo.com> wrote:
> > > Team I am facing an issue right now. I am working ahead to migrate
> > cloudera to HDI Azure. Now cloudera has Solr implementation and using the
> > below jar
> > > search-mr-1.0.0-cdh5.7.0-job.jar
> > org.apache.solr.hadoop.MapReduceIndexerTool
> > >
> > > While looking into all option I found "solr-map-reduce-4.9.0.jar" and
> > tried using it with class "org.apache.solr.hadoop.MapReduceIndexerTool". I
> > tried adding lib details in solrconfig.xml but did not worked . Getting
> > error
> > > "Caused by: java.lang.ClassNotFoundException:
> > org.apache.solr.morphlines.solr.DocumentLoader"
> > >
> > > Please let me know the right way to use MapReduceIndexerTool class.
> > >
> > > Regards,
> > > 
> > > Sujeet Singh | Sr. Software Analyst | cloudmoyo | E.
> > sujeet.si...@cloudmoyo.com<mailto:sujeet.si...@cloudmoyo.com> | M. +91
> > 9860586055
> > >
> > > [CloudMoyo Logo]<
> > http://www.cloudmoyo.com/?utm_source=signature_medium=mail%20sign_campaign=mail%20sign
> > >
> > > [
> > https://icertisportalstorage.blob.core.windows.net/siteasset/icon-linkedin.png
> > ]<https://www.linkedin.com/company/cloudmoyo>[
> > https://icertisportalstorage.blob.core.windows.net/siteasset/icon-fb.png]<
> > https://www.facebook.com/gocloudmoyo>[
> > https://icertisportalstorage.blob.core.windows.net/siteasset/icon-twitter.png
> > ]<https://twitter.com/GoCloudMoyo
> > > www.cloudmoyo.com
> > >
> > >
> > >
> > >
> > >
> > > IMPORTANT NOTICE: This communication, including any attachment, contains
> > information that may be confidential or privileged, and is intended solely
> > for the entity or individual to whom it is addressed. If you are not the
> > intended recipient, you should delete this message and are hereby notified
> > that any disclosure, copying, or distribution of this message is strictly
> > prohibited. Nothing in this email, including any attachment, is intended to
> > be a legally binding signature.
> > >
> > >
> >
> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Rahul Singh

Lucene ( the major underlying Tech in SolR ) can handle any data, but it’s 
optimized to be an index , not a file store. Better to put that in another DB 
or file system like Cassandra, S3, etc. (better than SolR).

In our experience , leveraging the tika binary / microservice as a pre-index 
process can improve the overall stability of the SolR service.


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey <apa...@elyograg.org>, wrote:
> On 4/25/2018 4:02 AM, Lee Carroll wrote:
> > *We don't recommend using solr-cell for production indexing.*
> >
> > Ok. Are the reasons for:
> >
> > Performance. I think we have rather modest index requirement (1000 a day...
> > on a busy day)
> >
> > Security. The index workflow is, upload files to public facing server with
> > auth. Files written to disk, scanned and copied to internal server and
> > ingested into index via here.
> >
> > other reasons we should worry about ?
>
> Tika is the underlying technology in solr-cell.  Tika is a separate
> Apache product designed for parsing common rich-text formats, like
> Microsoft, PDF, etc.
>
> http://tika.apache.org/
>
> The problems that can result are related to running Tika inside of Solr,
> which is what solr-cell does.
>
> The Tika authors try very hard to make sure that Tika doesn't misbehave,
> but the very nature of what Tika does means it is somewhat prone to
> misbehaving.  Many of the file formats that Tika processes are
> undocumented, or any documentation that is available is not available to
> open source developers.  Also, sometimes documents in those formats will
> be constructed in a way that the Tika authors have never seen before, or
> they may completely violate what conventions the authors DO know about.
>
> Long story short -- Tika can encounter documents that can cause it to
> crash, or to consume all the memory in the system, or misbehave in other
> ways.  If Tika is running inside Solr, then when it has a problem, Solr
> itself can blow up and have a problem too.
>
> For this reason, and because Tika can sometimes use a lot of resources
> even when it is working correctly, we recommend running it outside of
> Solr in another program that takes its output and sends it to Solr.
> Ideally, it will be running on a completely different machine than Solr
> is running on.
>
> Thanks,
> Shawn
>

Re: DIH with huge data

2018-04-12 Thread Rahul Singh


CSV -> Spark -> SolR

https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc

If speed is not an issue there are other methods. Spring Batch / Spring Data 
might have all the tools you need to get speed without Spark.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar <sujaybawas...@gmail.com>, wrote:
> Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
> is around 100GB.
> I am not much familiar with spark but are you suggesting that we should
> create document by merging distinct RDBMS tables in using RDD?
>
> On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh <rahul.xavier.si...@gmail.com
> wrote:
>
> > How much data and what is the database source? Spark is probably the
> > fastest way.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar <sujaybawas...@gmail.com>,
> > wrote:
> > > Hi,
> > >
> > > We are using DIH with SortedMapBackedCache but as data size increases we
> > > need to provide more heap memory to solr JVM.
> > > Can we use multiple CSV file instead of database queries and later data
> > in
> > > CSV files can be joined using zipper? So bottom line is to create CSV
> > files
> > > for each of entity in data-config.xml and join these CSV files using
> > > zipper.
> > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > > memory on machine.
> > > Please suggest how can we handle use case of importing huge amount of
> > data
> > > into solr.
> > >
> > > --
> > > Thanks,
> > > Sujay P Bawaskar
> > > M:+91-77091 53669
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669

Re: DIH with huge data

2018-04-12 Thread Rahul Singh

If you want speed, Spark is the fastest easiest way. You can connect to 
relational tables directly and import or export to CSV / JSON and import from a 
distributed filesystem like S3 or HDFS.

Combining a dfs with spark and a highly available SolR - you are maximizing all 
threads.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar <sujaybawas...@gmail.com>, wrote:
> Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
> is around 100GB.
> I am not much familiar with spark but are you suggesting that we should
> create document by merging distinct RDBMS tables in using RDD?
>
> On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh <rahul.xavier.si...@gmail.com
> wrote:
>
> > How much data and what is the database source? Spark is probably the
> > fastest way.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar <sujaybawas...@gmail.com>,
> > wrote:
> > > Hi,
> > >
> > > We are using DIH with SortedMapBackedCache but as data size increases we
> > > need to provide more heap memory to solr JVM.
> > > Can we use multiple CSV file instead of database queries and later data
> > in
> > > CSV files can be joined using zipper? So bottom line is to create CSV
> > files
> > > for each of entity in data-config.xml and join these CSV files using
> > > zipper.
> > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > > memory on machine.
> > > Please suggest how can we handle use case of importing huge amount of
> > data
> > > into solr.
> > >
> > > --
> > > Thanks,
> > > Sujay P Bawaskar
> > > M:+91-77091 53669
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669

Re: DIH with huge data

2018-04-12 Thread Rahul Singh

How much data and what is the database source? Spark is probably the fastest 
way.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar <sujaybawas...@gmail.com>, wrote:
> Hi,
>
> We are using DIH with SortedMapBackedCache but as data size increases we
> need to provide more heap memory to solr JVM.
> Can we use multiple CSV file instead of database queries and later data in
> CSV files can be joined using zipper? So bottom line is to create CSV files
> for each of entity in data-config.xml and join these CSV files using
> zipper.
> We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> not good to use with MMapDirectoryFactory and causes to exhaust physical
> memory on machine.
> Please suggest how can we handle use case of importing huge amount of data
> into solr.
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669

Re: Text in images are not extracted and indexed to content

2018-04-10 Thread Rahul Singh

May need to extract outside SolR and index pure text with an external ingestion 
process. You have much more control over the Tika attributes and behaviors.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo <edwinye...@gmail.com>, 
wrote:
> Hi,
>
> Currently I am facing issue whereby the text in images file like jpg, bmp
> are not being extracted out and indexed. After the indexing, Tika did
> extract all the meta data out and index them under the fields attr_*.
> However, the content field is always empty for images file. For other types
> of document files like .doc, the content is extracted correctly.
>
> I have already updated the tika-parsers-1.17.jar, under
> \prg\apache\tika\parser\pdf\ for extractInlineImages to true.
>
>
> What could be the reason?
>
> I have just upgraded to Solr 7.3.0.
>
> Regards,
> Edwin

Re: Using Solr to build a product matcher, with learning to rank

2018-03-29 Thread Rahul Singh

Maybe overthinking this. There is a “more like this” feature at basically does 
this. Give that a try before digging deeper into the LTR methods. It may be 
good enough for rock and roll.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Mar 28, 2018, 12:25 PM -0400, Xavier Schepler 
<xavier.schep...@recommerce.com>, wrote:
> Hello,
>
> I'm considering using Solr with learning to rank to build a product matcher.
> For example, it should match the titles:
> - Apple iPhone 6 16 Gb,
> - iPhone 6 16 Gb,
> - Smartphone IPhone 6 16 Gb,
> - iPhone 6 black 16 Gb,
> to the same internal reference, an unique identifier.
>
> With Solr, each document would then have a field for the product title and
> one for its class, which is the unique identifier of the product.
> Solr would then be used to perform matching as follows.
>
> 1. A search is performed with a given product title.
> 2. The first three results are considered (this requires an initial
> product title database).
> 3. The most frequent identifier is returned.
>
> This method corresponds roughly to a k-Nearest Neighbor approach with the
> cosine metric, k = 3, and a TF-IDF model.
>
> I've done some preliminary tests with Sci-kit learn and the results are
> good, but not as good as the ones of more sophisticated learning algorithms.
>
> Then, I noticed that there exists learning to rank with Solr.
>
> First, do you think that such an use of Solr makes sense?
> Second, is there a relatively simple way to build a learning model using a
> sparse representation of the query TF-IDF vector?
>
> Kind regards,
>
> Xavier Schepler

RE: Solr or Elasticsearch

2018-03-22 Thread Rahul Singh

I have the same experience as Daphne. I’ve used SolR for more “document” / 
“content” / “Knowledge” search and Elastic as a Log store or Mongo replacement. 
SolR has more ways to return/injest data such as XML, JSON, or even CSV which 
is appealing. The binary protocol in SolrJ is also appealing because the 
updates / selects are fast.

Ultimately I think SolR is like a 18 wheel tractor trailer and Elastic is like 
a uhaul trucks and you can chain a bunch of them up to do what SolR does.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Mar 22, 2018, 9:04 AM -0500, Liu, Daphne <daphne@cevalogistics.com>, 
wrote:
> I used Solr + Cassandra for Document search. Solr works very well with 
> document indexing.
> For big data visualization, I use Elasticsearch + Grafana.
> As for today, Grafana is not supporting Solr.
> Elasticseach is very friendly and easy to use on multi-dimensional Group by 
> and its real-time query performance is very good.
> Grafana dashboard solution can be viewed @ 
> https://grafana.com/dashboards/5204/edit
>
>
> Kind regards,
>
> Daphne Liu
> BI Architect Big Data - Matrix SCM
>
> CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
> USA / www.cevalogistics.com
> T 904.9281448 / F 904.928.1525 / daphne@cevalogistics.com
>
> Making business flow
>
> -Original Message-
> From: Steven White [mailto:swhite4...@gmail.com]
> Sent: Thursday, March 22, 2018 9:14 AM
> To: solr-user@lucene.apache.org
> Subject: Solr or Elasticsearch
>
> Hi everyone,
>
> There are some good write ups on the internet comparing the two and the one 
> thing that keeps coming up about Elasticsearch being superior to Solr is it's 
> analytic capability. However, I cannot find what those analytic capabilities 
> are and why they cannot be done using Solr. Can someone help me with this 
> question?
>
> Personally, I'm a Solr user and the thing that concerns me about 
> Elasticsearch is the fact that it is owned by a company that can any day 
> decide to stop making Elasticsearch avaialble under Apache license and even 
> completely close free access to it.
>
> So, this is a 2 part question:
>
> 1) What are the analytic capability of Elasticsearch that cannot be done 
> using Solr? I want to see a complete list if possible.
> 2) Should an Elasticsearch user be worried that Elasticsearch may close it's 
> open-source policy at anytime or that outsiders have no say about it's road 
> map?
>
> Thanks,
>
> Steve
>
> NVOCC Services are provided by CEVA as agents for and on behalf of Pyramid 
> Lines Limited trading as Pyramid Lines.
> This e-mail message is intended for the above named recipient(s) only. It may 
> contain confidential information that is privileged. If you are not the 
> intended recipient, you are hereby notified that any dissemination, 
> distribution or copying of this e-mail and any attachment(s) is strictly 
> prohibited. If you have received this e-mail by error, please immediately 
> notify the sender by replying to this e-mail and deleting the message 
> including any attachment(s) from your system. Thank you in advance for your 
> cooperation and assistance. Although the company has taken reasonable 
> precautions to ensure no viruses are present in this email, the company 
> cannot accept responsibility for any loss or damage arising from the use of 
> this email or attachments.

RE: Question liste solr

2018-03-20 Thread Rahul Singh

Parallel processing in any way will help, including Spark w/ a DFS like S3 or 
HDFS. Your three machines could end up being a bottleneck and you may need more 
nodes.

On Mar 20, 2018, 2:36 AM -0500, LOPEZ-CORTES Mariano-ext 
, wrote:
> CSV file is 5GB aprox. for 29 millions.
>
> As you say Christopher, at the beggining we thougth that reading chunk by 
> chunk from Oracle and writing to Solr
> was the best strategy.
>
> But, from our tests we've remarked:
>
> CSV creation via PL/SQL is really really fast. 40 minutes for the full 
> dataset (with bulk collect).
> Multiple SELECT calls from java slows down the process. I think Oracle is the 
> bottleneck here.
>
> Any other ideas/alternatives?
>
> Some other points to remark:
>
> We are going to enable autoCommit for every 10 minutes / 1 rows. No 
> commit from client.
> During indexing, whe call all the time a front-end load-balancer that 
> redirect calls to the 3-node cluster.
>
> Thanks in advance!!
>
> ==>Great maillist and really awesome tool!!
>
> -Message d'origine-
> De : Christopher Schultz [mailto:ch...@christopherschultz.net]
> Envoyé : lundi 19 mars 2018 18:05
> À : solr-user@lucene.apache.org
> Objet : Re: Question liste solr
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Mariano,
>
> On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote:
> > Hello
> >
> > We have an index Solr with 3 nodes, 1 shard et 2 replicas.
> >
> > Our goal is to index 42 millions rows. Indexing time is important.
> > The data source is an oracle database.
> >
> > Our indexing strategy is :
> >
> > * Reading from Oracle to a big CSV file.
> >
> > * Reading from 4 files (big file chunked) and injection via
> > ConcurrentUpdateSolrClient
> >
> > Is it the optimal way of injecting such mass of data into Solr ?
> >
> > For information, estimated time for our solution is 6h.
>
> How big are the CSV files? If most of the time is taken performing the 
> various SELECT operations, then it's probably a good strategy.
>
> However, you may find that using the disk as a buffer slows everything down 
> because disk-writes can be very slow.
>
> Why not perform your SELECT(s) and write directly to Solr using one of the 
> APIs (either a language-specific API, or through the HTTP API)?
>
> Hope that helps,
> - -chris
> -BEGIN PGP SIGNATURE-
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo
> cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE
> s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH
> I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3
> 6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+
> r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5
> BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6
> ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX
> ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey
> 85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg
> GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy
> tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD
> VH6PlwgqcrO28Jx799mJvpIotoE=
> =aMPk
> -END PGP SIGNATURE-

Re: Securying ONLY the web interface console

2018-03-19 Thread Rahul Singh

Use a proxy server that only gives access to the update / select handlers 
(URLs). Can do it with a numerous programming languages or with a simple proxy 
in nginx.

The whole web server running SolR is not supposed to be out in the open. You 
are opening yourself up to too many issues.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Mar 19, 2018, 12:19 PM -0500, Jesus Olivan <jesus.oli...@letgo.com>, wrote:
> hi!
>
> i'm trying to password protect only Solr web interface (not queries
> launched from my app). I'm currently using SolrCloud 6.6.0 with external
> zookeepers. I've read tons of Docs about it, but i couldn't find a proper
> way to secure ONLY the web admin console. Can anybody give me some light
> about it, please? =)
>
> Thanks in advance!

1 2 3 >

1 - 100 of 262 matches

Mail list logo