Re: Reindexing major upgrades

2020-10-06 Thread Bram Van Dam
On 05/10/2020 16:02, Rafael Sousa wrote:
> Having things reindexed from scratch is not
> an option, so, is there a way of creating a 8.6.2 index from a pre-existing
> 6.5 index or something like that?

Sadly there is no such way. If all your fields are stored you might be
able to whip up something which can read all the data from old Solr and
write it to new Solr without having to rereead all your documents. But
that's still pretty painful.

 - Bram


Java GC issue investigation

2020-10-06 Thread Karol Grzyb
Hi,

I'm involved in investigation of issue that involves huge GC overhead
that happens during performance tests on Solr Nodes. Solr version is
6.1. Last test were done on staging env, and we run into problems for
<100 requests/second.

The size of the index itself is ~200MB ~ 50K docs
Index has small updates every 15min.



Queries involve sorting and faceting.

I've gathered some heap dumps, I can see from them that most of heap
memory is retained because of object of following classes:

-org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
(>4G, 91% of heap)
-org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
-org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
-org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
(>3.7G 76% of heap)



Based on information above is there anything generic that can been
looked at as source of potential improvement without diving deeply
into schema and queries (which may be very difficlut to change at this
moment)? I don't see docvalues being enabled - could this help, as if
I get the docs correctly, it's specifically helpful when there are
many sorts/grouping/facets? Or I

Additionaly I see, that many threads are blocked on LRUCache.get,
should I recomend switching to FastLRUCache?

Also, I wonder if -Xmx12288m for java heap is not too much for 16G
memory? I see some (~5/s) page faults in Dynatrace during the biggest
traffic.

Thank you very much for any help,
Kind regards,
Karol


Re: Non Deterministic Results from /admin/luke

2020-10-06 Thread Andrzej Białecki
You may want to check the COLSTATUS collection command added in 8.1 
(https://lucene.apache.org/solr/guide/8_6/collection-management.html#colstatus 
).

This reports much of the information returned by /admin/luke but can also 
report this for all shard leaders in a collection.

> On 2 Oct 2020, at 01:06, Shawn Heisey  wrote:
> 
> On 10/1/2020 4:24 AM, Nussbaum, Ronen wrote:
>> We are using the Luke API in order to get all dynamic field names from our 
>> collection:
>> /solr/collection/admin/luke?wt=csv&numTerms=0
>> This worked fine in 6.2.1 but it's non deterministic anymore (8.6.1) - looks 
>> like it queries a random single shard.
>> I've tried using /solr/collection/select?q=*:*&wt=csv&rows=0&facet but it 
>> behaves the same.
>> Can it be configured to query all shards?
>> Is there another way to achieve this?
> 
> The Luke handler (usually at /admin/luke) is not SolrCloud aware.  It is 
> designed to operate on a single core.  So if you send the request to the 
> collection and not a specific core, Solr must forward the request to a core 
> in order for you to get ANY result.  The core selection will be random.
> 
> The software called Luke (which is where the Luke handler gets its name) 
> operates on a Lucene index -- each Solr core is based around a Lucene index.  
> It would be a LOT of work to make the handler SolrCloud aware.
> 
> Depending on how your collection is set up, you may need to query the Luke 
> handler on multiple cores in order to get a full picture of all fields 
> present in the Lucene indexes.  I am not aware of any other way to do it.
> 
> Thanks,
> Shawn
> 



Re: Java GC issue investigation

2020-10-06 Thread matthew sporleder
You have a 12G heap for a 200MB index?  Can you just try changing Xmx
to, like, 1g ?

On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
>
> Hi,
>
> I'm involved in investigation of issue that involves huge GC overhead
> that happens during performance tests on Solr Nodes. Solr version is
> 6.1. Last test were done on staging env, and we run into problems for
> <100 requests/second.
>
> The size of the index itself is ~200MB ~ 50K docs
> Index has small updates every 15min.
>
>
>
> Queries involve sorting and faceting.
>
> I've gathered some heap dumps, I can see from them that most of heap
> memory is retained because of object of following classes:
>
> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> (>4G, 91% of heap)
> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> (>3.7G 76% of heap)
>
>
>
> Based on information above is there anything generic that can been
> looked at as source of potential improvement without diving deeply
> into schema and queries (which may be very difficlut to change at this
> moment)? I don't see docvalues being enabled - could this help, as if
> I get the docs correctly, it's specifically helpful when there are
> many sorts/grouping/facets? Or I
>
> Additionaly I see, that many threads are blocked on LRUCache.get,
> should I recomend switching to FastLRUCache?
>
> Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> memory? I see some (~5/s) page faults in Dynatrace during the biggest
> traffic.
>
> Thank you very much for any help,
> Kind regards,
> Karol


Re: Java GC issue investigation

2020-10-06 Thread Karol Grzyb
Hi Matthew,

Thank you for the answer, I cannot reproduce the setup locally I'll
try to convince them to reduce Xmx, I guess they will rather not agree
to 1GB but something less than 12G for sure.
And have some proper dev setup because for now we could only test prod
or stage which are difficult to adjust.

Is being stuck in GC common behaviour when the index is small compared
to available heap during bigger load? I was more worried about the
ratio of heap to total host memory.

Regards,
Karol


wt., 6 paź 2020 o 14:39 matthew sporleder  napisał(a):
>
> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> to, like, 1g ?
>
> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
> >
> > Hi,
> >
> > I'm involved in investigation of issue that involves huge GC overhead
> > that happens during performance tests on Solr Nodes. Solr version is
> > 6.1. Last test were done on staging env, and we run into problems for
> > <100 requests/second.
> >
> > The size of the index itself is ~200MB ~ 50K docs
> > Index has small updates every 15min.
> >
> >
> >
> > Queries involve sorting and faceting.
> >
> > I've gathered some heap dumps, I can see from them that most of heap
> > memory is retained because of object of following classes:
> >
> > -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> > (>4G, 91% of heap)
> > -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> > -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> > -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> > (>3.7G 76% of heap)
> >
> >
> >
> > Based on information above is there anything generic that can been
> > looked at as source of potential improvement without diving deeply
> > into schema and queries (which may be very difficlut to change at this
> > moment)? I don't see docvalues being enabled - could this help, as if
> > I get the docs correctly, it's specifically helpful when there are
> > many sorts/grouping/facets? Or I
> >
> > Additionaly I see, that many threads are blocked on LRUCache.get,
> > should I recomend switching to FastLRUCache?
> >
> > Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> > memory? I see some (~5/s) page faults in Dynatrace during the biggest
> > traffic.
> >
> > Thank you very much for any help,
> > Kind regards,
> > Karol


RE: Master/Slave

2020-10-06 Thread Oakley, Craig (NIH/NLM/NCBI) [C]
> it better not ever be depreciated.  it has been the most reliable mechanism 
> for its purpose

I would like to know whether that is the consensus of Solr developers.

We had been scrambling to move from Master/Slave to CDCR based on the assertion 
that CDCR support would last far longer than Master/Slave support.

Can we now assume safely that this assertion is now completely moot? Can we now 
assume safely that Master/Slave is likely to be supported for the foreseeable 
future? Or are we forced to assume that Master/Slave support will evaporate 
shortly after the now-evaporated CDCR support?

-Original Message-
From: David Hastings  
Sent: Wednesday, September 30, 2020 3:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Master/Slave

>whether we should expect Master/Slave replication also to be deprecated

it better not ever be depreciated.  it has been the most reliable mechanism
for its purpose, solr cloud isnt going to replace standalone, if it does,
thats when I guess I stop upgrading or move to elastic

On Wed, Sep 30, 2020 at 2:58 PM Oakley, Craig (NIH/NLM/NCBI) [C]
 wrote:

> Based on the thread below (reading "legacy" as meaning "likely to be
> deprecated in later versions"), we have been working to extract ourselves
> from Master/Slave replication
>
> Most of our collections need to be in two data centers (a read/write copy
> in one local data center: the disaster-recovery-site SolrCloud could be
> read-only). We also need redundancy within each data center for when one
> host or another is unavailable. We implemented this by having different
> SolrClouds in the different data centers; with Master/Slave replication
> pulling data from one of the read/write replicas to each of the Slave
> replicas in the disaster-recovery-site read-only SolrCloud. Additionally,
> for some collections, there is a desire to have local read-only replicas
> remain unchanged for querying during the loading process: for these
> collections, there is a local read/write loading SolrCloud, a local
> read-only querying SolrCloud (normally configured for Master/Slave
> replication from one of the replicas of the loader SolrCloud to both
> replicas of the query SolrCloud, but with Master/Slave disabled when the
> load was in progress on the loader SolrCloud, and with Master/Slave resumed
> after the loaded data passes QA checks).
>
> Based on the thread below, we made an attempt to switch to CDCR. The main
> reason for wanting to change was that CDCR was said to be the supported
> mechanism, and the replacement for Master/Slave replication.
>
> After multiple unsuccessful attempts to get CDCR to work, we ended up with
> reproducible cases of CDCR loosing data in transit. In June, I initiated a
> thread in this group asking for clarification of how/whether CDCR could be
> made reliable. This seemed to me to be met with deafening silence until the
> announcement in July of the release of Solr8.6 and the deprecation of CDCR.
>
> So we are left with the question whether we should expect Master/Slave
> replication also to be deprecated; and if so, with what is it expected to
> be replaced (since not with CDCR)? Or is it now sufficiently safe to assume
> that Master/Slave replication will continue to be supported after all
> (since the assertion that it would be replaced by CDCR has been
> discredited)? In either case, are there other suggested implementations of
> having a read-only SolrCloud receive data from a read/write SolrCloud?
>
>
> Thanks
>
> -Original Message-
> From: Shawn Heisey 
> Sent: Tuesday, May 21, 2019 11:15 AM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud (7.3) and Legacy replication slaves
>
> On 5/21/2019 8:48 AM, Michael Tracey wrote:
> > Is it possible set up an existing SolrCloud cluster as the master for
> > legacy replication to a slave server or two?   It looks like another
> option
> > is to use Uni-direction CDCR, but not sure what is the best option in
> this
> > case.
>
> You're asking for problems if you try to combine legacy replication with
> SolrCloud.  The two features are not guaranteed to work together.
>
> CDCR is your best bet.  This replicates from one SolrCloud cluster to
> another.
>
> Thanks,
> Shawn
>


Re: Java GC issue investigation

2020-10-06 Thread matthew sporleder
Your index is so small that it should easily get cached into OS memory
as it is accessed.  Having a too-big heap is a known problem
situation.

https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?

On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb  wrote:
>
> Hi Matthew,
>
> Thank you for the answer, I cannot reproduce the setup locally I'll
> try to convince them to reduce Xmx, I guess they will rather not agree
> to 1GB but something less than 12G for sure.
> And have some proper dev setup because for now we could only test prod
> or stage which are difficult to adjust.
>
> Is being stuck in GC common behaviour when the index is small compared
> to available heap during bigger load? I was more worried about the
> ratio of heap to total host memory.
>
> Regards,
> Karol
>
>
> wt., 6 paź 2020 o 14:39 matthew sporleder  napisał(a):
> >
> > You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> > to, like, 1g ?
> >
> > On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
> > >
> > > Hi,
> > >
> > > I'm involved in investigation of issue that involves huge GC overhead
> > > that happens during performance tests on Solr Nodes. Solr version is
> > > 6.1. Last test were done on staging env, and we run into problems for
> > > <100 requests/second.
> > >
> > > The size of the index itself is ~200MB ~ 50K docs
> > > Index has small updates every 15min.
> > >
> > >
> > >
> > > Queries involve sorting and faceting.
> > >
> > > I've gathered some heap dumps, I can see from them that most of heap
> > > memory is retained because of object of following classes:
> > >
> > > -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> > > (>4G, 91% of heap)
> > > -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> > > -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> > > -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> > > (>3.7G 76% of heap)
> > >
> > >
> > >
> > > Based on information above is there anything generic that can been
> > > looked at as source of potential improvement without diving deeply
> > > into schema and queries (which may be very difficlut to change at this
> > > moment)? I don't see docvalues being enabled - could this help, as if
> > > I get the docs correctly, it's specifically helpful when there are
> > > many sorts/grouping/facets? Or I
> > >
> > > Additionaly I see, that many threads are blocked on LRUCache.get,
> > > should I recomend switching to FastLRUCache?
> > >
> > > Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> > > memory? I see some (~5/s) page faults in Dynatrace during the biggest
> > > traffic.
> > >
> > > Thank you very much for any help,
> > > Kind regards,
> > > Karol


Re: Java GC issue investigation

2020-10-06 Thread Erick Erickson
12G is not that huge, it’s surprising that you’re seeing this problem.

However, there are a couple of things to look at:

1> If you’re saying that you have 16G total physical memory and are allocating 
12G to Solr, that’s an anti-pattern. See: 
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
If at all possible, you should allocate between 25% and 50% of your physical 
memory to Solr...

2> what garbage collector are you using? G1GC might be a better choice.

> On Oct 6, 2020, at 10:44 AM, matthew sporleder  wrote:
> 
> Your index is so small that it should easily get cached into OS memory
> as it is accessed.  Having a too-big heap is a known problem
> situation.
> 
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?
> 
> On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb  wrote:
>> 
>> Hi Matthew,
>> 
>> Thank you for the answer, I cannot reproduce the setup locally I'll
>> try to convince them to reduce Xmx, I guess they will rather not agree
>> to 1GB but something less than 12G for sure.
>> And have some proper dev setup because for now we could only test prod
>> or stage which are difficult to adjust.
>> 
>> Is being stuck in GC common behaviour when the index is small compared
>> to available heap during bigger load? I was more worried about the
>> ratio of heap to total host memory.
>> 
>> Regards,
>> Karol
>> 
>> 
>> wt., 6 paź 2020 o 14:39 matthew sporleder  napisał(a):
>>> 
>>> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
>>> to, like, 1g ?
>>> 
>>> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
 
 Hi,
 
 I'm involved in investigation of issue that involves huge GC overhead
 that happens during performance tests on Solr Nodes. Solr version is
 6.1. Last test were done on staging env, and we run into problems for
 <100 requests/second.
 
 The size of the index itself is ~200MB ~ 50K docs
 Index has small updates every 15min.
 
 
 
 Queries involve sorting and faceting.
 
 I've gathered some heap dumps, I can see from them that most of heap
 memory is retained because of object of following classes:
 
 -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
 (>4G, 91% of heap)
 -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
 -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
 -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
 (>3.7G 76% of heap)
 
 
 
 Based on information above is there anything generic that can been
 looked at as source of potential improvement without diving deeply
 into schema and queries (which may be very difficlut to change at this
 moment)? I don't see docvalues being enabled - could this help, as if
 I get the docs correctly, it's specifically helpful when there are
 many sorts/grouping/facets? Or I
 
 Additionaly I see, that many threads are blocked on LRUCache.get,
 should I recomend switching to FastLRUCache?
 
 Also, I wonder if -Xmx12288m for java heap is not too much for 16G
 memory? I see some (~5/s) page faults in Dynatrace during the biggest
 traffic.
 
 Thank you very much for any help,
 Kind regards,
 Karol



Re: Order of applying tokens/filter

2020-10-06 Thread Walter Underwood
Synonyms only need to be done once. Generally, expand synonyms at index time 
only.

Also, consider the StandardTokeniizer. It is a bit smarter and that can be 
useful.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 5, 2020, at 9:08 PM, Jayadevan Maymala  
> wrote:
> 
>> 
>> ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
>> WhitespaceTokenizerFactory
>> SynonymGraphFilterFactory
>> FlattenGraphFilterFactory
>> KStemFilterFactory
>> RemoveDuplicatesFilterFactory
>> 
>> One doubt related to this. Ideally, the same sequence should be followed
> for indexing and querying, right?
> Regards,
> Jayadevan



timeAllowed default value

2020-10-06 Thread Steven White
Hi everyone,

What is the default value for timeAllowed to make it behave as if it is not
set?  Is it "-1" or some other number?

Rather than writing my code to include or not include timeAllowed in the
query parameter, I rather have it be part of my query all the time and only
change the value if it is set to a value so I get the behaviour I want:
wait indefinitely or give up based on the value timeAllowed is set to.

Thanks

Steven


Re: Using streaming expressions with shards filter

2020-10-06 Thread Joel Bernstein
There is a parameter in streaming expressions for this but it is not
available for use in every stream source. The search expression should
honor it though.

If you pass the .shard=shard1,shard2,shard3...

The search stream will honor this.

This work was originally done for supporting no-SolrCloud streaming
expressions but was not fully realized yet.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Oct 1, 2020 at 11:31 AM Gael Jourdan-Weil <
gael.jourdan-w...@kelkoogroup.com> wrote:

> Hello,
>
> I am trying to use a Streaming Expression to query only a subset of the
> shards of a collection.
> I expected to be able to use the "shards" parameter like on a regular
> query on "/select" for instance but this appear to not work or I don't know
> how to do it.
>
> Is this somehow a feature/restriction of Streaming expressions?
> Or am I missing something?
>
> Note that the Streaming Expression I use is actually using the "/export"
> request handler.
>
> Example of the streaming expression:
> curl -X POST -v --data-urlencode
> 'expr=search(myCollection,q="*:*",fl="id",sort="id asc",qt="/export")' '
> http://myserver/solr/myCollection/stream'
>
> Solr version: 8.4
>
> Best regards,
> Gaël


Re: Using streaming expressions with shards filter

2020-10-06 Thread Joel Bernstein
Actually it's:

.shards=shard1,shard2,shard3...



Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Oct 6, 2020 at 2:38 PM Joel Bernstein  wrote:

>
> There is a parameter in streaming expressions for this but it is not
> available for use in every stream source. The search expression should
> honor it though.
>
> If you pass the .shard=shard1,shard2,shard3...
>
> The search stream will honor this.
>
> This work was originally done for supporting no-SolrCloud streaming
> expressions but was not fully realized yet.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Oct 1, 2020 at 11:31 AM Gael Jourdan-Weil <
> gael.jourdan-w...@kelkoogroup.com> wrote:
>
>> Hello,
>>
>> I am trying to use a Streaming Expression to query only a subset of the
>> shards of a collection.
>> I expected to be able to use the "shards" parameter like on a regular
>> query on "/select" for instance but this appear to not work or I don't know
>> how to do it.
>>
>> Is this somehow a feature/restriction of Streaming expressions?
>> Or am I missing something?
>>
>> Note that the Streaming Expression I use is actually using the "/export"
>> request handler.
>>
>> Example of the streaming expression:
>> curl -X POST -v --data-urlencode
>> 'expr=search(myCollection,q="*:*",fl="id",sort="id asc",qt="/export")' '
>> http://myserver/solr/myCollection/stream'
>>
>> Solr version: 8.4
>>
>> Best regards,
>> Gaël
>
>


RE: Solr 7.7 - Few Questions

2020-10-06 Thread Manisha Rahatadkar
Hi All

First of all thanks to Shawn, Rahul and Charlie for taking time to reply my 
questions and valuable information.

I was very concerned about the size of the each document and on several follow 
ups got more information that the documents which have 0.5GB size are mp4 
documents and these are not synced to Solr.

@Shawn Heisey recommended NOT to use Windows because of windows license cost 
and service installer testing is done on Linux.
I agree with him. We are using NSSM tool to run solr as a service.

Are there any members here using Solr on Windows? I look forward to hear from 
them on:

1. What tool they use to run Solr as a service on windows.
2. How to set up the disaster recovery?
3. How to scale up the servers for the better performance?

Thanks in advance and looking forward to hear back your experiences on Solr 
Scale up.

Regards,
Manisha Rahatadkar

-Original Message-
From: Rahul Goswami 
Sent: Sunday, October 4, 2020 11:49 PM
To: ch...@opensourceconnections.com; solr-user@lucene.apache.org
Subject: Re: Solr 7.7 - Few Questions

Charlie,
Thanks for providing an alternate approach to doing this. It would be 
interesting to know how one  could go about organizing the docs in this case? 
(Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:

> Hi Rahul,
>
>
>
> In addition to the wise advice below: remember in Solr, a 'document'
> is
>
> just the name for the thing that would appear as one of the results
> when
>
> you search (analagous to a database record). It's not the same
>
> conceptually as a 'Word document' or a 'PDF document'. If your source
>
> documents are so big, consider how they might be broken into parts, or
>
> whether you really need to index all of them for retrieval purposes,
> or
>
> what parts of them need to be extracted as text. Thus, the Solr
>
> documents don't necessarily need to be as large as your source documents.
>
>
>
> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>
> probably shouldn't push all this data into a single Solr document, but
>
> you *could* index them as 11 separate Solr documents, but with
> metadata
>
> to indicate that one is an email and ten are PDFs, and a shared ID of
>
> some kind to indicate they're related. Then at query time there are
>
> various ways for you to group these together, so for example if the
>
> query hit one of the PDFs you could show the user the original email,
>
> plus the 9 other attachments, using the shared ID as a key.
>
>
>
> HTH,
>
>
>
> Charlie
>
>
>
> On 02/10/2020 01:53, Rahul Goswami wrote:
>
> > Manisha,
>
> > In addition to what Shawn has mentioned above, I would also like you
> > to
>
> > reevaluate your use case. Do you *need to* index the whole document ? eg:
>
> > If it's an email, the body of the email *might* be more important
> > than
> any
>
> > attachments, in which case you could choose to only index the email
> > body
>
> > and ignore (or only partially index) the text from attachments. If
> > you
>
> > could afford to index the documents partially, you could consider
> > Solr's
>
> > "Limit token count filter": See the link below.
>
> >
>
> >
> https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limi
> t-token-count-filter
>
> >
>
> > You'll need to configure it in the schema for the "index" analyzer
> > for
> the
>
> > data type of the field with large text.
>
> > Indexing documents of the order of half a GB will definitely come to
> > hurt
>
> > your operations, if not now, later (think OOM, extremely slow atomic
>
> > updates, long running merges etc.).
>
> >
>
> > - Rahul
>
> >
>
> >
>
> >
>
> > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:
>
> >
>
> >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>
> >>> We are using Apache Solr 7.7 on Windows platform. The data is
> >>> synced to
>
> >> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>
> >> The document size is very huge (~0.5GB average) and solr indexing
> >> is
> taking
>
> >> long time. Total document size is ~200GB. As the solr commit is
> >> done as
> a
>
> >> part of API, the API calls are failing as document indexing is not
>
> >> completed.
>
> >>
>
> >> A single document is five hundred megabytes?  What kind of
> >> documents do
>
> >> you have?  You can't even index something that big without tweaking
>
> >> configuration parameters that most people don't even know about.
>
> >> Assuming you can even get it working, there's no way that indexing
> >> a
>
> >> document like that is going to be fast.
>
> >>
>
> >>> 1.  What is your advise on syncing such a large volume of data
> >>> to
>
> >> Solr KB.
>
> >>
>
> >> What is "KB"?  I have never heard of this in relation to Solr.
>
> >>
>
> >>> 2.  Because of the search requirements, almost 8 fields are
> >>> defined
>
> >> as Text fields.
>
> >>
>
> >> I can't figure out what you are trying to say

Re: Solr 7.7 - Few Questions

2020-10-06 Thread Rahul Goswami
1. What tool they use to run Solr as a service on windows.
>> Look into procrun. Afterall. Solr runs inside Jetty. So you should have
a way to invoke Jetty’s Main class with required parameters and bundle that
as a procrun service

2. How to set up the disaster recovery?
>> You can back up your indexes at regular periods. This can be done by
taking snapshots and backing them up...and then using the appropriate
snapshot names to restore a certain commit point. For more details please
refer to this link:
https://lucene.apache.org/solr/guide/7_7/making-and-restoring-backups.html

3. How to scale up the servers for the better performance?
>> This is too open ended a question and depends on a lot of factors
specific to your environment and use-case :)

- Rahul


On Tue, Oct 6, 2020 at 4:26 PM Manisha Rahatadkar <
manisha.rahatad...@anjusoftware.com> wrote:

> Hi All
>
> First of all thanks to Shawn, Rahul and Charlie for taking time to reply
> my questions and valuable information.
>
> I was very concerned about the size of the each document and on several
> follow ups got more information that the documents which have 0.5GB size
> are mp4 documents and these are not synced to Solr.
>
> @Shawn Heisey recommended NOT to use Windows because of windows license
> cost and service installer testing is done on Linux.
> I agree with him. We are using NSSM tool to run solr as a service.
>
> Are there any members here using Solr on Windows? I look forward to hear
> from them on:
>
> 1. What tool they use to run Solr as a service on windows.
> 2. How to set up the disaster recovery?
> 3. How to scale up the servers for the better performance?
>
> Thanks in advance and looking forward to hear back your experiences on
> Solr Scale up.
>
> Regards,
> Manisha Rahatadkar
>
> -Original Message-
> From: Rahul Goswami 
> Sent: Sunday, October 4, 2020 11:49 PM
> To: ch...@opensourceconnections.com; solr-user@lucene.apache.org
> Subject: Re: Solr 7.7 - Few Questions
>
> Charlie,
> Thanks for providing an alternate approach to doing this. It would be
> interesting to know how one  could go about organizing the docs in this
> case? (Nested documents?) How would join queries perform on a large
> index(200 million+ docs)?
>
> Thanks,
> Rahul
>
>
>
> On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:
>
> > Hi Rahul,
> >
> >
> >
> > In addition to the wise advice below: remember in Solr, a 'document'
> > is
> >
> > just the name for the thing that would appear as one of the results
> > when
> >
> > you search (analagous to a database record). It's not the same
> >
> > conceptually as a 'Word document' or a 'PDF document'. If your source
> >
> > documents are so big, consider how they might be broken into parts, or
> >
> > whether you really need to index all of them for retrieval purposes,
> > or
> >
> > what parts of them need to be extracted as text. Thus, the Solr
> >
> > documents don't necessarily need to be as large as your source documents.
> >
> >
> >
> > Consider an email size 20kb with ten PDF attachments, each 20MB. You
> >
> > probably shouldn't push all this data into a single Solr document, but
> >
> > you *could* index them as 11 separate Solr documents, but with
> > metadata
> >
> > to indicate that one is an email and ten are PDFs, and a shared ID of
> >
> > some kind to indicate they're related. Then at query time there are
> >
> > various ways for you to group these together, so for example if the
> >
> > query hit one of the PDFs you could show the user the original email,
> >
> > plus the 9 other attachments, using the shared ID as a key.
> >
> >
> >
> > HTH,
> >
> >
> >
> > Charlie
> >
> >
> >
> > On 02/10/2020 01:53, Rahul Goswami wrote:
> >
> > > Manisha,
> >
> > > In addition to what Shawn has mentioned above, I would also like you
> > > to
> >
> > > reevaluate your use case. Do you *need to* index the whole document ?
> eg:
> >
> > > If it's an email, the body of the email *might* be more important
> > > than
> > any
> >
> > > attachments, in which case you could choose to only index the email
> > > body
> >
> > > and ignore (or only partially index) the text from attachments. If
> > > you
> >
> > > could afford to index the documents partially, you could consider
> > > Solr's
> >
> > > "Limit token count filter": See the link below.
> >
> > >
> >
> > >
> > https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limi
> > t-token-count-filter
> >
> > >
> >
> > > You'll need to configure it in the schema for the "index" analyzer
> > > for
> > the
> >
> > > data type of the field with large text.
> >
> > > Indexing documents of the order of half a GB will definitely come to
> > > hurt
> >
> > > your operations, if not now, later (think OOM, extremely slow atomic
> >
> > > updates, long running merges etc.).
> >
> > >
> >
> > > - Rahul
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey 
> wrote:
> >
> > >
> >
> > >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:

Daylight savings time issue using NOW in Solr 6.1.0

2020-10-06 Thread vishal patel
Hi

I am using Solr 6.1.0. My SOLR_TIMEZONE=UTC  in solr.in.cmd.
My current Solr server machine time zone is also UTC.

My one collection has below one field in schema.


Suppose my current Solr server machine time is 2020-10-01 10:00:00.000. I have 
one document in that collection and in that document action_date is 
2020-10-01T09:45:46Z.
When I search in Solr action_date:[2020-10-01T08:00:00Z TO NOW] , I cannot 
return that record. I check my solr log and found that time was different 
between Solr log time and solr server machine time.(almost 1 hours difference)

Why I cannot get the result? Why NOW is not taking the 2020-10-01T10:00:00Z?
"NOW" takes which time? Is there difference due to daylight saving 
time? How can I configure 
or change timezone which consider daylight saving time?


RE: Using streaming expressions with shards filter

2020-10-06 Thread Gael Jourdan-Weil
Thanks Joel.
I will try it in the future if I still need it (for now I went for another 
solution that fits my needs).

Gaël

Re: Daylight savings time issue using NOW in Solr 6.1.0

2020-10-06 Thread Bernd Fehling
Hi,

because you are using solr.in.cmd I guess you are using Windows OS.
I don't know much about Solr and Windows but you can check your
Windows, Jetty and Solr time by looking at your solr-8983-console.log
file after starting Solr.
First the timestamp of the file itself, then the timestamp of the
log message leading each message and finally the timestamp within the
log message reporting the "Start time:".

Regards
Bernd


Am 07.10.20 um 08:12 schrieb vishal patel:
> Hi
> 
> I am using Solr 6.1.0. My SOLR_TIMEZONE=UTC  in solr.in.cmd.
> My current Solr server machine time zone is also UTC.
> 
> My one collection has below one field in schema.
>  docValues="true"/>
>  positionIncrementGap="0"/>
> Suppose my current Solr server machine time is 2020-10-01 10:00:00.000. I 
> have one document in that collection and in that document action_date is 
> 2020-10-01T09:45:46Z.
> When I search in Solr action_date:[2020-10-01T08:00:00Z TO NOW] , I cannot 
> return that record. I check my solr log and found that time was different 
> between Solr log time and solr server machine time.(almost 1 hours difference)
> 
> Why I cannot get the result? Why NOW is not taking the 2020-10-01T10:00:00Z?
> "NOW" takes which time? Is there difference due to daylight saving 
> time? How can I configure 
> or change timezone which consider daylight saving time?
>