Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-11 Thread Zheng Lin Edwin Yeo
Hi,

Has anyone else faced the same issue before?
So far all the regex patterns that we tried in this thread are not able to
resolve the issue.

Regards,
Edwin

On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Sorry, I realized there is an extra ']' in the pattern provided, which is
> why there are so many  in the output.
>
> The output is exactly the same as previously (previous index result) if we
> remove the extra ']', as shown in the configuration below.
>
>  
>content
>[ \t\x0b\f]*\r?\n
>br
>true
>  
>  
>content
>(br[ \t\x0b\f]*){3,}
>brbr
>true
>  
>
> Regards,
> Edwin
>
>
>
> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Paul,
>>
>> Thanks for the reply.
>>
>> For the 2nd pattern, if we put this pattern > name="pattern">(br[ \t\x0b\f]]*){3,}, which is like the
>> configurations below:
>>
>> 
>>content
>>[ \t\x0b\f]*\r?\n
>>br
>>true
>> 
>> 
>>content
>>(br[ \t\x0b\f]]*){3,}
>>brbr
>>true
>> 
>>
>> It will not be able to change all those more than 3  to 2 .
>>
>> We will end up with many  in the output, like the example below:
>>
>>  http://www.concorded.com/  
>> 
>>  On Tue, Dec 18, 2018
>>
>>
>> Regards,
>> Edwin
>>
>>
>>
>>
>> On Thu, 7 Mar 2019 at 20:44,  wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>> I can’t understand why the pattern is not working and where the spaces
>>> between the  are coming from. It should be possible to allow for spaces
>>> between the  in the second match pattern however i.e. 2nd pattern
>>>
>>>
>>>
>>> (br[ \t\x0b\f]]*){3,}
>>>
>>>
>>>
>>> /Paul
>>>
>>>
>>>
>>> Gesendet von Mail für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo
>>> Gesendet: Mittwoch, 6. März 2019 16:28
>>> An: solr-user@lucene.apache.org
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Paul,
>>>
>>> I have tried with the first match pattern to be [
>>> \t\x0b\f]*\r?\n, like the configuration below:
>>>
>>> 
>>>content
>>>[ \t\x0b\f]*\r?\n
>>>br
>>>true
>>> 
>>> 
>>>content
>>>(br){3,}
>>>brbr
>>>true
>>> 
>>>
>>> However, the result is still the same as before (previous index results),
>>> with the 4 .
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>> On Wed, 6 Mar 2019 at 18:23,  wrote:
>>>
>>> > Hi Edwin
>>> >
>>> >
>>> >
>>> > You are correct  re the 2nd pattern – my bad. Looking at the 4 ,
>>> it’s
>>> > actually the sequence «  »? So perhaps the first match
>>> > pattern could be [ \t\x0b\f]*\r?\n
>>> >
>>> >
>>> >
>>> > i.e. [space tab vertical-tab formfeed]
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Paul
>>> >
>>> >
>>> >
>>> > Gesendet von Mail für
>>> > Windows 10
>>> >
>>> >
>>> >
>>> > Von: Zheng Lin Edwin Yeo
>>> > Gesendet: Mittwoch, 6. März 2019 07:44
>>> > An: solr-user@lucene.apache.org
>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>> >
>>> >
>>> >
>>> > Hi Paul,
>>> >
>>> > I have modified the second pattern to be (br){3,}, instead of
>>> > (brbr){3,}. This pattern of  (brbr){3,}
>>> > will actually look for 6 or more  instead of 3 ,  as we have
>>> put
>>> > the  two times in the pattern, which is the reason that there are
>>> more
>>> >  in the result, as cases where there are less than 6  are not
>>> being
>>> > replaced, so we ended up having up to 5  in the index.
>>> >
>>> > Modified configuration:
>>> >  
>>> >content
>>> >(br){3,}
>>> >brbr
>>> >true
>>> >  
>>> >
>>> > This will bring us back to the result of the previous index content,
>>> > meaning the issue of having the 4  is still there.
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> >
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> > On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo >> >
>>> > wrote:
>>> >
>>> > > Hi Paul,
>>> > >
>>> > > Further to my previous email, which there was an extra "}" in the
>>> > > configuration, I have changed to use the below configuration based on
>>> > your
>>> > > suggestion.
>>> > >
>>> > > 
>>> > >content
>>> > >[ \t]*\r?\n
>>> > >br
>>> > >true
>>> > > 
>>> > > 
>>> > >content
>>> > >(brbr){3,}
>>> > >brbr
>>> > >true
>>> > > 
>>> > >
>>> > > However, the result that I get still has more than 2 . In fact,
>>> the
>>> > > result become worse, as you can see from the comparison below.
>>> > >
>>> > > Example 1: The sentence that the regex pattern used to work
>>> correctly.
>>> > But
>>> > > with the latest pattern, it has now changed from 2  to become 5
>>> ,
>>> > > which is wrong.
>>> > > *Original content in EML file:*
>>> > > Dear Sir,
>>> > >
>>> > >
>>> > > I am terminating
>>> > > *Original content:*Dear Sir,  \n\n \n \n\n I am terminating
>>> > > *Previous Index content: *Dear Sir,  I am 

RE: ClassCastException in SolrJ 7.6+

2019-03-11 Thread Gerald Bonfiglio
public static void main(String[] args) {
try {
SolrQuery solrQuery = new SolrQuery("*:*");
solrQuery.setRows(0);
solrQuery.setParam("json.facet", 
"{grp_0:{field:evntnm,limit:-1,type:terms,mincount:1,sort:{index:asc}}}");

List solrHosts = new ArrayList(1);
solrHosts.add("http://localhost:8983/solr;);
CloudSolrClient solrServer = new 
CloudSolrClient.Builder(solrHosts).build();
solrServer.setIdField("_uniqueKey");

QueryResponse solrResult = solrServer.query("events", solrQuery);
NestableJsonFacet jsonFacets = solrResult.getJsonFacetingResponse();
System.out.println(jsonFacets.toString());
}
catch (Throwable e) {
e.printStackTrace();
}
}

-Original Message-
From: Jason Gerlowski [mailto:gerlowsk...@gmail.com]
Sent: Monday, March 11, 2019 1:24 PM
To: solr-user@lucene.apache.org
Subject: Re: ClassCastException in SolrJ 7.6+

Hi Gerald,

That looks like it might be a bug in SolrJ's JSON faceting support.
Do you have a small code snippet that reproduces the problem?  That'll
help us confirm it's a bug, and get us started on fixing it.

Best,

Jason

On Mon, Mar 11, 2019 at 10:29 AM Gerald Bonfiglio  wrote:
>
> I'm seeing the following Exception using JSON Facet API in SolrJ 7.6, 7.7, 
> 7.7.1:
>
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> java.lang.Integer
>   at 
> org.apache.solr.client.solrj.response.json.NestableJsonFacet.(NestableJsonFacet.java:52)
>   at 
> org.apache.solr.client.solrj.response.QueryResponse.extractJsonFacetingInfo(QueryResponse.java:200)
>   at 
> org.apache.solr.client.solrj.response.QueryResponse.getJsonFacetingResponse(QueryResponse.java:571)
>
>
>
>
>
> [Nastel  Technologies]
>
> The information contained in this e-mail and in any attachment is 
> confidential and
> is intended solely for the use of the individual or entity to which it is 
> addressed.
> Access, copying, disclosure or use of such information by anyone else is 
> unauthorized.
> If you are not the intended recipient, please delete the e-mail and refrain 
> from use of such information.




[Nastel  Technologies]

The information contained in this e-mail and in any attachment is confidential 
and
is intended solely for the use of the individual or entity to which it is 
addressed.
Access, copying, disclosure or use of such information by anyone else is 
unauthorized.
If you are not the intended recipient, please delete the e-mail and refrain 
from use of such information.


Re: ClassCastException in SolrJ 7.6+

2019-03-11 Thread Jason Gerlowski
Hi Gerald,

That looks like it might be a bug in SolrJ's JSON faceting support.
Do you have a small code snippet that reproduces the problem?  That'll
help us confirm it's a bug, and get us started on fixing it.

Best,

Jason

On Mon, Mar 11, 2019 at 10:29 AM Gerald Bonfiglio  wrote:
>
> I'm seeing the following Exception using JSON Facet API in SolrJ 7.6, 7.7, 
> 7.7.1:
>
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> java.lang.Integer
>   at 
> org.apache.solr.client.solrj.response.json.NestableJsonFacet.(NestableJsonFacet.java:52)
>   at 
> org.apache.solr.client.solrj.response.QueryResponse.extractJsonFacetingInfo(QueryResponse.java:200)
>   at 
> org.apache.solr.client.solrj.response.QueryResponse.getJsonFacetingResponse(QueryResponse.java:571)
>
>
>
>
>
> [Nastel  Technologies]
>
> The information contained in this e-mail and in any attachment is 
> confidential and
> is intended solely for the use of the individual or entity to which it is 
> addressed.
> Access, copying, disclosure or use of such information by anyone else is 
> unauthorized.
> If you are not the intended recipient, please delete the e-mail and refrain 
> from use of such information.


Apache Solr Reference Guide 7.7 Released

2019-03-11 Thread Jason Gerlowski
The Lucene PMC is pleased to announce that the Solr Reference Guide
for 7.7 is now available.

This 1,431-page PDF is the definitive guide to using Apache Solr, the
search server built on Lucene.

The PDF Guide can be downloaded from:
https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/apache-solr-ref-guide-7.7.pdf.

It is also available online at https://lucene.apache.org/solr/guide/7_7.


Re: solr 7 optimize with Tlog/Pull replicas

2019-03-11 Thread Erick Erickson
do _not_ turn of hard commits, even when bulk indexing. Set the OpenSeacher to 
false in your config. This is for two reasons:
1> the only time the transaction log is rolled over is when a hard commit 
happens. If you turn off commits it’ll grow to a very large size.
2> If, for any reason, the node restarts, it’ll replay the transaction log from 
the last hard commit point, potentially taking hours if you haven’t committed.

And you should probably open  a new searcher occasionally, even while bulk 
indexing. For Real Time Get there are some internal structures that grow in 
proportion to the docs indexed since the last searcher was opened.

And for your other quesitons:
<1> I believe so, try it and look at your solr log.

<2> Yes. Have you looked at Mike’s video (the third one down) here: 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html? 
TieredMergePolicy is the third video. The merge policy combines like-sized 
segments. It’s wasteful to rewrite, say, a 19G segment just to add a 1G so 
having multiple segments < 20G is perfectly normal.

Best,
Erick

> On Mar 10, 2019, at 10:36 PM, Wei  wrote:
> 
> A side question, for heavy bulk indexing, what's the recommended setting
> for auto commit? As there is no query needed during the bulking indexing
> process, I have auto soft commit disabled. Is there any side effect if I
> also disable auto commit?
> 
> On Sun, Mar 10, 2019 at 10:22 PM Wei  wrote:
> 
>> Thanks Erick.
>> 
>> 1> TLOG replicas shouldn’t optimize on the follower. They should optimize
>> on the leader then replicate the entire index to the follower.
>> 
>> Does that mean the follower will ignore the optimize request? Or shall I
>> send the optimize request only to one of the leaders?
>> 
>> 2> As of Solr 7.5, optimize should not optimize to a single segment
>> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
>> numSegments on the optimize command.
>> 
>> -- Is the 5G limit controlled by maxMegedSegmentMB setting? In
>> solrconfig.xml I used these settings:
>> 
>> 
>>   100
>>   10
>>   10
>>   20480
>> 
>> 
>> But in the end I see multiple segments much smaller than the 20GB limit.
>> In 7.6 is it required to explicitly set the number of segments to 1? e.g
>> shall I use
>> 
>> /update?optimize=true=false=1
>> 
>> Best,
>> Wei
>> 
>> 
>> On Fri, Mar 8, 2019 at 12:29 PM Erick Erickson 
>> wrote:
>> 
>>> This is very odd for at least two reasons:
>>> 
>>> 1> TLOG replicas shouldn’t optimize on the follower. They should optimize
>>> on the leader then replicate the entire index to the follower.
>>> 
>>> 2> As of Solr 7.5, optimize should not optimize to a single segment
>>> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
>>> numSegments on the optimize command.
>>> 
>>> So if you can reliably reproduce this, it’s probably worth a JIRA…...
>>> 
 On Mar 8, 2019, at 11:21 AM, Wei  wrote:
 
 Hi,
 
 RecentIy I encountered a strange issue with optimize in Solr 7.6. The
>>> cloud
 is created with 4 shards with 2 Tlog replicas per shard. After batch
>>> index
 update I issue an optimize command to a randomly picked replica in the
 cloud.  After a while when I check,  all the non-leader Tlog replicas
 finished optimization to a single segment, however all the leader
>>> replicas
 still have multiple segments.  Previously inn the all NRT replica
>>> cloud, I
 see optimization is triggered on all nodes.  Is the optimization process
 different with Tlog/Pull replicas?
 
 Best,
 Wei
>>> 
>>> 



ClassCastException in SolrJ 7.6+

2019-03-11 Thread Gerald Bonfiglio
I'm seeing the following Exception using JSON Facet API in SolrJ 7.6, 7.7, 
7.7.1:

Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
java.lang.Integer
  at 
org.apache.solr.client.solrj.response.json.NestableJsonFacet.(NestableJsonFacet.java:52)
  at 
org.apache.solr.client.solrj.response.QueryResponse.extractJsonFacetingInfo(QueryResponse.java:200)
  at 
org.apache.solr.client.solrj.response.QueryResponse.getJsonFacetingResponse(QueryResponse.java:571)





[Nastel  Technologies]

The information contained in this e-mail and in any attachment is confidential 
and
is intended solely for the use of the individual or entity to which it is 
addressed.
Access, copying, disclosure or use of such information by anyone else is 
unauthorized.
If you are not the intended recipient, please delete the e-mail and refrain 
from use of such information.


Re: Solrj, Json Facets, (Date) stats facets

2019-03-11 Thread Jason Gerlowski
Hi Andrea,

It looks like you've stumbled on a bug in NestableJsonFacet.  I
clearly wasn't thinking about Date stats when I first wrote it; it
looks like it doesn't detect/parse them correctly in the current
iteration.  I'll try to fix this in a subsequent release.  But in the
meantime, unfortunately your only option is to use the NamedList
structures directly to retrieve the stat value.

Thanks for bringing it to our attention.

Best,

Jason

On Fri, Mar 8, 2019 at 4:42 AM Andrea Gazzarini  wrote:
>
> Good morning guys, I have a questions about Solrj and JSON facets.
>
> I'm using Solr 7.7.1 and I'm sending a request like this:
>
> json.facet={x:'max(iterationTimestamp)'}
>
> where "iterationTimestamp" is a solr.DatePointField. The JSON response
> correctly includes what I'm expecting:
>
>  "facets": {
>  "count": 8,
>  "x": "1973-09-20T17:33:18.700Z"
>  }
>
> but Solrj doesn't. Specifically, the jsonFacetingResponse contains only
> the domainCount attribute (8).
> Looking at the code I see that in NestableJsonFacet a stats is taken in
> account only if the corresponding value is an instance of Number (and x
> in the example above is a java.util.Date).
>
> Is that expected? Is there a way (other than dealing with nested
> NamedLists) for retrieving that value?
>
> Cheers,
> Andrea


Re: child docs

2019-03-11 Thread Mikhail Khludnev
Hello, John.
The choice is guided by the form of search results. You need to ask what
you need to paginate, count numFound. You need to index as top level docs
what you need to find.

On Thu, Mar 7, 2019 at 9:08 PM John Blythe  wrote:

> hi all!
>
> curious about how child docs and performance interact.
>
> i'll have a bunch of transactions coming in from various entities. i'm
> debating nesting them all under a single, 'master' parent entity or to have
> the parent and children be entity specific.
>
> so either:
>
> [platonic ideal parent item]
> child1: {entity1, tranx1}
> child2: {entity1, tranx2}
> child3: {entity2, tranx3}
> child4: {entity3, tranx4}
>
> VS.
>
> [entity1's parent item]
> child1: {tranx1}
> child2: {tranx2}
> [entity2's parent item]
> child1: {tranx3}
> [entity3's parent item]
> child1: {tranx4}
>
> could be up to several hundred child docs per entity, though usually will
> be double digits only (per entity), sometimes as low as < 10.
>
> hope this makes sense. thanks for any insight!
>
> best,
> --
> John Blythe
>


-- 
Sincerely yours
Mikhail Khludnev