subject:"Re\: Question"

Re: Question on solr metrics

2020-10-27 Thread Emir Arnautović

Hi,
In order to see time range metrics, you’ll need to collect metrics periodically 
and send it to some storage and then query/visualise. Solr has exporters for 
some popular backends, or you can use some cloud based solution. One such 
solution is our: https://sematext.com/integrations/solr-monitoring/ and we’ve 
also just added Solr logs integration so you can collect/visualise/alert on 
both metrics and logs.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

> On 26 Oct 2020, at 22:08, yaswanth kumar  wrote:
> 
> Can we get the metrics for a particular time range? I know metrics history
> was not enabled, so that I will be having only from when the solr node is
> up and running last time, but even from it can we do a data range like for
> example on to see CPU usage on a particular time range?
> 
> Note: Solr version: 8.2
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com

Re: Question on metric values

2020-10-26 Thread Andrzej Białecki

The “requests” metric is a simple counter. Please see the documentation in the 
Reference Guide on the available metrics and their meaning. This counter is 
initialised when the replica starts up, and it’s not persisted (so if you 
restart this Solr node it will reset to 0).


If by “frequency” you mean rate of requests over a time period then the 1-, 5- 
and 15-min rates are available from “QUERY./select.requestTimes”

—

Andrzej Białecki

> On 26 Oct 2020, at 17:25, yaswanth kumar  wrote:
> 
> I am new to metrics api in solr , when I try to do
> solr/admin/metrics?prefix=QUERY./select.requests its throwing numbers
> against each collection that I have, I can understand those are the
> requests coming in against each collection, but for how much frequencies??
> Like are those numbers from the time the collection went live or are those
> like last n minutes or any config based?? also what's the default
> frequencies when we don't configure anything??
> 
> Note: I am using solr 8.2
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com

Re: Question about solr commits

2020-10-08 Thread Erick Erickson

This is a bit confused. There will be only one timer that starts at time T when
the first doc comes in. At T+ 15 seconds, all docs that have been received since
time T will be committed. The first doc to hit Solr _after_ T+15 seconds starts
a single new timer and the process repeats.

Best,
rick

> On Oct 8, 2020, at 2:26 PM, Rahul Goswami  wrote:
> 
> Shawn,
> So if the autoCommit interval is 15 seconds, and one update request arrives
> at t=0 and another at t=10 seconds, then will there be two timers one
> expiring at t=15 and another at t=25 seconds, but this would amount to ONLY
> ONE commit at t=15 since that one would include changes from both updates.
> Is this understanding correct ?
> 
> Thanks,
> Rahul
> 
> On Wed, Oct 7, 2020 at 11:39 PM yaswanth kumar 
> wrote:
> 
>> Thank you very much both Eric and Shawn
>> 
>> Sent from my iPhone
>> 
>>> On Oct 7, 2020, at 10:41 PM, Shawn Heisey  wrote:
>>> 
>>> On 10/7/2020 4:40 PM, yaswanth kumar wrote:
 I have the below in my solrconfig.xml
 

  ${solr.Data.dir:}


  ${solr.autoCommit.maxTime:6}
  false


  ${solr.autoSoftCommit.maxTime:5000}

  
 Does this mean even though we are always sending data with commit=false
>> on
 update solr api, the above should do the commit every minute (6 ms)
 right?
>>> 
>>> Assuming that you have not defined the "solr.autoCommit.maxTime" and/or
>> "solr.autoSoftCommit.maxTime" properties, this config has autoCommit set to
>> 60 seconds without opening a searcher, and autoSoftCommit set to 5 seconds.
>>> 
>>> So five seconds after any indexing begins, Solr will do a soft commit.
>> When that commit finishes, changes to the index will be visible to
>> queries.  One minute after any indexing begins, Solr will do a hard commit,
>> which guarantees that data is written to disk, but it will NOT open a new
>> searcher, which means that when the hard commit happens, any pending
>> changes to the index will not be visible.
>>> 
>>> It's not "every five seconds" or "every 60 seconds" ... When any changes
>> are made, Solr starts a timer.  When the timer expires, the commit is
>> fired.  If no changes are made, no commits happen, because the timer isn't
>> started.
>>> 
>>> Thanks,
>>> Shawn
>>

Re: Question about solr commits

2020-10-08 Thread Rahul Goswami

Shawn,
So if the autoCommit interval is 15 seconds, and one update request arrives
at t=0 and another at t=10 seconds, then will there be two timers one
expiring at t=15 and another at t=25 seconds, but this would amount to ONLY
ONE commit at t=15 since that one would include changes from both updates.
Is this understanding correct ?

Thanks,
Rahul

On Wed, Oct 7, 2020 at 11:39 PM yaswanth kumar 
wrote:

> Thank you very much both Eric and Shawn
>
> Sent from my iPhone
>
> > On Oct 7, 2020, at 10:41 PM, Shawn Heisey  wrote:
> >
> > On 10/7/2020 4:40 PM, yaswanth kumar wrote:
> >> I have the below in my solrconfig.xml
> >> 
> >> 
> >>   ${solr.Data.dir:}
> >> 
> >> 
> >>   ${solr.autoCommit.maxTime:6}
> >>   false
> >> 
> >> 
> >>   ${solr.autoSoftCommit.maxTime:5000}
> >> 
> >>   
> >> Does this mean even though we are always sending data with commit=false
> on
> >> update solr api, the above should do the commit every minute (6 ms)
> >> right?
> >
> > Assuming that you have not defined the "solr.autoCommit.maxTime" and/or
> "solr.autoSoftCommit.maxTime" properties, this config has autoCommit set to
> 60 seconds without opening a searcher, and autoSoftCommit set to 5 seconds.
> >
> > So five seconds after any indexing begins, Solr will do a soft commit.
> When that commit finishes, changes to the index will be visible to
> queries.  One minute after any indexing begins, Solr will do a hard commit,
> which guarantees that data is written to disk, but it will NOT open a new
> searcher, which means that when the hard commit happens, any pending
> changes to the index will not be visible.
> >
> > It's not "every five seconds" or "every 60 seconds" ... When any changes
> are made, Solr starts a timer.  When the timer expires, the commit is
> fired.  If no changes are made, no commits happen, because the timer isn't
> started.
> >
> > Thanks,
> > Shawn
>

Re: Question about solr commits

2020-10-07 Thread yaswanth kumar

Thank you very much both Eric and Shawn

Sent from my iPhone

> On Oct 7, 2020, at 10:41 PM, Shawn Heisey  wrote:
> 
> On 10/7/2020 4:40 PM, yaswanth kumar wrote:
>> I have the below in my solrconfig.xml
>> 
>> 
>>   ${solr.Data.dir:}
>> 
>> 
>>   ${solr.autoCommit.maxTime:6}
>>   false
>> 
>> 
>>   ${solr.autoSoftCommit.maxTime:5000}
>> 
>>   
>> Does this mean even though we are always sending data with commit=false on
>> update solr api, the above should do the commit every minute (6 ms)
>> right?
> 
> Assuming that you have not defined the "solr.autoCommit.maxTime" and/or 
> "solr.autoSoftCommit.maxTime" properties, this config has autoCommit set to 
> 60 seconds without opening a searcher, and autoSoftCommit set to 5 seconds.
> 
> So five seconds after any indexing begins, Solr will do a soft commit. When 
> that commit finishes, changes to the index will be visible to queries.  One 
> minute after any indexing begins, Solr will do a hard commit, which 
> guarantees that data is written to disk, but it will NOT open a new searcher, 
> which means that when the hard commit happens, any pending changes to the 
> index will not be visible.
> 
> It's not "every five seconds" or "every 60 seconds" ... When any changes are 
> made, Solr starts a timer.  When the timer expires, the commit is fired.  If 
> no changes are made, no commits happen, because the timer isn't started.
> 
> Thanks,
> Shawn

Re: Question about solr commits

2020-10-07 Thread Shawn Heisey


On 10/7/2020 4:40 PM, yaswanth kumar wrote:

I have the below in my solrconfig.xml


 
   ${solr.Data.dir:}
 
 
   ${solr.autoCommit.maxTime:6}
   false
 
 
   ${solr.autoSoftCommit.maxTime:5000}
 
   

Does this mean even though we are always sending data with commit=false on
update solr api, the above should do the commit every minute (6 ms)
right?


Assuming that you have not defined the "solr.autoCommit.maxTime" and/or 
"solr.autoSoftCommit.maxTime" properties, this config has autoCommit set 
to 60 seconds without opening a searcher, and autoSoftCommit set to 5 
seconds.


So five seconds after any indexing begins, Solr will do a soft commit. 
When that commit finishes, changes to the index will be visible to 
queries.  One minute after any indexing begins, Solr will do a hard 
commit, which guarantees that data is written to disk, but it will NOT 
open a new searcher, which means that when the hard commit happens, any 
pending changes to the index will not be visible.


It's not "every five seconds" or "every 60 seconds" ... When any changes 
are made, Solr starts a timer.  When the timer expires, the commit is 
fired.  If no changes are made, no commits happen, because the timer 
isn't started.


Thanks,
Shawn

Re: Question about solr commits

2020-10-07 Thread Erick Erickson

Yes.

> On Oct 7, 2020, at 6:40 PM, yaswanth kumar  wrote:
> 
> I have the below in my solrconfig.xml
> 
> 
>
>  ${solr.Data.dir:}
>
>
>  ${solr.autoCommit.maxTime:6}
>  false
>
>
>  ${solr.autoSoftCommit.maxTime:5000}
>
>  
> 
> Does this mean even though we are always sending data with commit=false on
> update solr api, the above should do the commit every minute (6 ms)
> right?
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com

Re: Question on sorting

2020-07-23 Thread Saurabh Sharma

Hi,
It is because field is string and numbers are getting sorted
lexicographically.It has nothing to do with number of digits.

Thanks
Saurabh


On Thu, Jul 23, 2020, 11:24 AM Srinivas Kashyap
 wrote:

> Hello,
>
> I have schema and field definition as shown below:
>
>  omitNorms="true"/>
>
>
>   />
>
> TRACK_ID field contains "NUMERIC VALUE".
>
> When I use sort on track_id (TRACK_ID desc) it is not working properly.
>
> ->I have below values in Track_ID
>
> Doc1: "84806"
> Doc2: "124561"
>
> Ideally, when I use sort command, query result should be
>
> Doc2: "124561"
> Doc1: "84806"
>
> But I'm getting:
>
> Doc1: "84806"
> Doc2: "124561"
>
> Is this because, field type is string and doc1 has 5 digits and doc2 has 6
> digits?
>
> Please provide solution for this.
>
> Thanks,
> Srinivas
>
>
> 
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender
> immediately by replying to the e-mail, and then delete it without making
> copies or using it in any way.
> No representation is made that this email or any attachments are free of
> viruses. Virus scanning is recommended and is the responsibility of the
> recipient.
>
> Disclaimer
>
> The information contained in this communication from the sender is
> confidential. It is intended solely for use by the recipient and others
> authorized to receive it. If you are not the recipient, you are hereby
> notified that any disclosure, copying, distribution or taking action in
> relation of the contents of this information is strictly prohibited and may
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been
> automatically archived by Mimecast Ltd, an innovator in Software as a
> Service (SaaS) for business. Providing a safer and more useful place for
> your human generated data. Specializing in; Security, archiving and
> compliance. To find out more visit the Mimecast website.
>

Re: Question regarding replica leader

2020-07-20 Thread Vishal Vaibhav

So how do we recover from such state ?  When I am trying addreplica , it
returns me 503. Also my node has multiple replicas out of them most are
dead. How do we make get rid of those dead replicas via script. ?is that a
possibility?

On Mon, 20 Jul 2020 at 11:00 AM, Radu Gheorghe 
wrote:

> Hi Vishal,
>
> I think that’s true, yes. The cluster has a leader (overseer), but this
> particular shard doesn’t seem to have a leader (yet). Logs should give you
> some pointers about why this happens (it may be, for example, that each
> replica is waiting for the other to become a leader, because each missed
> some updates).
>
> Best regards,
> Radu
> --
> Sematext Cloud - Full Stack Observability - https://sematext.com
> Solr and Elasticsearch Consulting, Training and Production Support
>
> > On 20 Jul 2020, at 04:17, Vishal Vaibhav  wrote:
> >
> > Hi any pointers on this ?
> >
> > On Wed, 15 Jul 2020 at 11:13 AM, Vishal Vaibhav 
> wrote:
> >
> >> Hi Solr folks,
> >>
> >> I am using solr cloud 8.4.1 . I am using*
> >> `/solr/admin/collections?action=CLUSTERSTATUS`*. Hitting this endpoint I
> >> get a list of replicas in which one is active but neither of them is
> >> leader. Something like this
> >>
> >> "core_node72": {"core": "rules_shard1_replica_n71","base_url": "node3,"
> >> node_name": "node3 base url","state": "active","type": "NRT","
> >> force_set_state": "false"},"core_node74": {"core":
> >> "rules_shard1_replica_n73","base_url": "node1","node_name":
> >> "node1_base_url","state": "down","type": "NRT","force_set_state":
> "false"}
> >> }}},"router": {"name": "compositeId"},"maxShardsPerNode": "1","
> >> autoAddReplicas": "false","nrtReplicas": "1","tlogReplicas": "0","
> >> znodeVersion": 276,"configName": "rules"}},"live_nodes":
> ["node1","node2",
> >> "node3","node4"] And when i see overseer status
> >> solr/admin/collections?action=OVERSEERSTATUS I get response like this
> which
> >> shows node 3 as leaderresponseHeader": {"status": 0,"QTime": 66},"leader
> >> ": "node 3","overseer_queue_size": 0,"overseer_work_queue_size": 0,"
> >> overseer_collection_queue_size": 2,"overseer_operations": ["addreplica",
> >>
> >> Does it mean the cluster is having a leader node but there is no leader
> >> replica as of now? And why the leader election is not happening?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
>

Re: Question regarding replica leader

2020-07-19 Thread Radu Gheorghe

Hi Vishal,

I think that’s true, yes. The cluster has a leader (overseer), but this 
particular shard doesn’t seem to have a leader (yet). Logs should give you some 
pointers about why this happens (it may be, for example, that each replica is 
waiting for the other to become a leader, because each missed some updates).

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 20 Jul 2020, at 04:17, Vishal Vaibhav  wrote:
> 
> Hi any pointers on this ?
> 
> On Wed, 15 Jul 2020 at 11:13 AM, Vishal Vaibhav  wrote:
> 
>> Hi Solr folks,
>> 
>> I am using solr cloud 8.4.1 . I am using*
>> `/solr/admin/collections?action=CLUSTERSTATUS`*. Hitting this endpoint I
>> get a list of replicas in which one is active but neither of them is
>> leader. Something like this
>> 
>> "core_node72": {"core": "rules_shard1_replica_n71","base_url": "node3,"
>> node_name": "node3 base url","state": "active","type": "NRT","
>> force_set_state": "false"},"core_node74": {"core":
>> "rules_shard1_replica_n73","base_url": "node1","node_name":
>> "node1_base_url","state": "down","type": "NRT","force_set_state": "false"}
>> }}},"router": {"name": "compositeId"},"maxShardsPerNode": "1","
>> autoAddReplicas": "false","nrtReplicas": "1","tlogReplicas": "0","
>> znodeVersion": 276,"configName": "rules"}},"live_nodes": ["node1","node2",
>> "node3","node4"] And when i see overseer status
>> solr/admin/collections?action=OVERSEERSTATUS I get response like this which
>> shows node 3 as leaderresponseHeader": {"status": 0,"QTime": 66},"leader
>> ": "node 3","overseer_queue_size": 0,"overseer_work_queue_size": 0,"
>> overseer_collection_queue_size": 2,"overseer_operations": ["addreplica",
>> 
>> Does it mean the cluster is having a leader node but there is no leader
>> replica as of now? And why the leader election is not happening?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>

Re: Question regarding replica leader

2020-07-19 Thread Vishal Vaibhav

Hi any pointers on this ?

On Wed, 15 Jul 2020 at 11:13 AM, Vishal Vaibhav  wrote:

> Hi Solr folks,
>
> I am using solr cloud 8.4.1 . I am using*
> `/solr/admin/collections?action=CLUSTERSTATUS`*. Hitting this endpoint I
> get a list of replicas in which one is active but neither of them is
> leader. Something like this
>
> "core_node72": {"core": "rules_shard1_replica_n71","base_url": "node3,"
> node_name": "node3 base url","state": "active","type": "NRT","
> force_set_state": "false"},"core_node74": {"core":
> "rules_shard1_replica_n73","base_url": "node1","node_name":
> "node1_base_url","state": "down","type": "NRT","force_set_state": "false"}
> }}},"router": {"name": "compositeId"},"maxShardsPerNode": "1","
> autoAddReplicas": "false","nrtReplicas": "1","tlogReplicas": "0","
> znodeVersion": 276,"configName": "rules"}},"live_nodes": ["node1","node2",
> "node3","node4"] And when i see overseer status
> solr/admin/collections?action=OVERSEERSTATUS I get response like this which
> shows node 3 as leaderresponseHeader": {"status": 0,"QTime": 66},"leader
> ": "node 3","overseer_queue_size": 0,"overseer_work_queue_size": 0,"
> overseer_collection_queue_size": 2,"overseer_operations": ["addreplica",
>
> Does it mean the cluster is having a leader node but there is no leader
> replica as of now? And why the leader election is not happening?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Question about Atomic Update

2020-06-15 Thread david . davila

Hi Erick,

Thank you for your answer.

Unfortunatelly our most important field is that text field, so, we need to 
index it. We will have to assume that big documents takes a long time to 
index.

Best,

David

David Dávila Atienza
AEAT - Departamento de Informática Tributaria
Subdirección de Tecnologías de Análisis de la Información e Investigación 
del Fraude
Teléfono: 915828763
Extensión: 36763

De: "Erick Erickson" 
Para:   solr-user@lucene.apache.org
Fecha:  15/06/2020 14:27
Asunto: Re: Question about Atomic Update

All Atomic Updates do is 
1> read all the stored fields from the record being updated
2> overlay your updates
3> re-index the document.

At <3> it?s exactly as though you sent the entire document
again, so your observation that the whole document is 
re-indexed is accurate.

If the fields you want to update are single-valued, docValues=true
numeric fields you can update those without the whole doc being
re-indexed. But if you need to search on those fields it?ll probably
be unacceptably slow. However, if you _do_ need to search,
sometimes you can get creative with function queries. OK, this
last is opaque but say you have a ?quantity? field and only want to
find docs that have quantity > 0. You can add a function query
to your query (either q or fq) that returns the value of that field,
which means the score is 0 for docs where quantity==0 and the
doc drops out of the result set.

It?s not clear whether you search the text field, but if not you can
store it somewhere else and only fetch it as needed.

Best,
Erick

> On Jun 15, 2020, at 7:55 AM, david.dav...@correo.aeat.es wrote:
> 
> Hi,
> 
> I have a question related with atomic update in Solr.
> 
> In our collection,  documents have a lot of fields, most of them small. 
> However, there is one of them that includes the text of the document. 
> Sometimes, not many fortunatelly, this text is very long, more than 3 or 
4 
> MB of plain text. We use different analyzers such as synonyms, etc. and 
> this causes that index time in that documents is long, about 15 seconds.
> 
> Sometimes, we should update some small fields, and it is a big problem 
for 
> us because of the time that it consumes. We have been testing with 
atomic 
> update, but time is exactly the same than sending the document again. We 

> expected that with atomic update only the updated fields were indexed 
and 
> time would reduce. But it seems that internally Solr gets the whole 
> document and reindex all the fields.
> 
> Does it works in that way? Am I worng, any advice?
> 
> We have tested with Solr 7.4 and Solr 4.10
> 
> Thanks,
> 
> David 

Este mensaje ha sido enviado desde un correo externo a la Agencia 
Tributaria. Por favor, no haga click en enlaces ni abra los documentos 
adjuntos a menos que reconozca al remitente del correo y la temática del 
mismo.

Re: Question about Atomic Update

2020-06-15 Thread Erick Erickson

All Atomic Updates do is 
1> read all the stored fields from the record being updated
2> overlay your updates
3> re-index the document.

At <3> it’s exactly as though you sent the entire document
again, so your observation that the whole document is 
re-indexed is accurate.

If the fields you want to update are single-valued, docValues=true
numeric fields you can update those without the whole doc being
re-indexed. But if you need to search on those fields it’ll probably
be unacceptably slow. However, if you _do_ need to search,
sometimes you can get creative with function queries. OK, this
last is opaque but say you have a “quantity” field and only want to
find docs that have quantity > 0. You can add a function query
to your query (either q or fq) that returns the value of that field,
which means the score is 0 for docs where quantity==0 and the
doc drops out of the result set.

It’s not clear whether you search the text field, but if not you can
store it somewhere else and only fetch it as needed.

Best,
Erick

> On Jun 15, 2020, at 7:55 AM, david.dav...@correo.aeat.es wrote:
> 
> Hi,
> 
> I have a question related with atomic update in Solr.
> 
> In our collection,  documents have a lot of fields, most of them small. 
> However, there is one of them that includes the text of the document. 
> Sometimes, not many fortunatelly, this text is very long, more than 3 or 4 
> MB of plain text. We use different analyzers such as synonyms, etc. and 
> this causes that index time in that documents is long, about 15 seconds.
> 
> Sometimes, we should update some small fields, and it is a big problem for 
> us because of the time that it consumes. We have been testing with atomic 
> update, but time is exactly the same than sending the document again. We 
> expected that with atomic update only the updated fields were indexed and 
> time would reduce. But it seems that internally Solr gets the whole 
> document and reindex all the fields.
> 
> Does it works in that way? Am I worng, any advice?
> 
> We have tested with Solr 7.4 and Solr 4.10
> 
> Thanks,
> 
> David

Re: question about setup for maximizing solr performance

2020-06-01 Thread Shawn Heisey


On 6/1/2020 9:29 AM, Odysci wrote:

Hi,
I'm looking for some advice on improving performance of our solr setup.




Does anyone have any insights on what would be better for maximizing
throughput on multiple searches being done at the same time?
thanks!


In almost all cases, adding memory will provide the best performance 
boost.  This is because memory is faster than disks, even SSD.  I have 
put relevant information on a wiki page so that it is easy for people to 
find and digest:


https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: Question about the max num of solr node

2020-01-03 Thread Jörn Franke

Why do you want to set up so many? What are your designs in terms of volumes / 
no of documents etc? 


> Am 03.01.2020 um 10:32 schrieb Hongxu Ma :
> 
> Hi community
> I plan to set up a 128 host cluster: 2 solr nodes on each host.
> But I have a little concern about whether solr can support so many nodes.
> 
> I searched on wiki and found:
> https://cwiki.apache.org/confluence/display/SOLR/2019-11+Meeting+on+SolrCloud+and+project+health
> "If you create thousands of collections, it’ll lock up and become inoperable. 
>  Scott reported that If you boot up a 100+ node cluster, SolrCloud won’t get 
> to a happy state; currently you need to start them gradually."
> 
> I wonder to know:
> Beside the quoted items, does solr have known issues in a big cluster?
> And does solr have a hard limit number of max node?
> 
> Thanks.

Re: Question about Luke

2019-11-20 Thread Tomoko Uchida

Hello,

> Is it different from checkIndex -exorcise option?
> (As far as I recently leaned, checkIndex -exorcise will delete unreadable 
> indices. )

If you mean desktop app Luke, "Repair" is just a wrapper of
CheckIndex.exorciseIndex(). There is no difference between doing
"Repair" from Luke GUI and calling "CheckIndex -exorcise" from CLI.


2019年11月11日(月) 20:36 Kayak28 :
>
> Hello, Community:
>
> I am using Solr7.4.0 currently, and I was testing how Solr actually behaves 
> when it has a corrupted index.
> And I used Luke to fix the broken index from GUI.
> I just came up with the following questions.
> Is it possible to use the repair index tool from CLI? (in the case, Solr was 
> on AWS for example.)
> Is it different from checkIndex -exorcise option?
> (As far as I recently leaned, checkIndex -exorcise will delete unreadable 
> indices. )
>
> If anyone gives me a reply, I would be very thankful.
>
> Sincerely,
> Kaya Ota

Re: Question about startup memory usage

2019-11-14 Thread Shawn Heisey


On 11/14/2019 1:46 AM, Hongxu Ma wrote:

Thank you @Shawn Heisey , you help me many times.

My -xms=1G
When restart solr, I can see the progress of memory increasing (from 1G to 9G, 
took near 10s).

I have a guess: maybe solr is loading some needed files into heap memory, e.g. 
*.tip : term index file. What's your thoughts?


Solr's basic operation involves quite a lot of Java memory allocation. 
Most of what gets allocated turns into garbage almost immediately, but 
Java does not reuse that memory right away ... it can only be reused 
after garbage collection on the appropriate memory region runs.


The algorithms in Java that decide between either grabbing more memory 
(up to the configured heap limit) or running garbage collection are 
beyond my understanding.  For programs with heavy memory allocation, 
like Solr, the preference does seem to lean towards allocating more 
memory if it's available than performing garbage collection.


I can imagine that initial loading of indexes containing billions of 
documents will require quite a bit of heap.  I do not know what data is 
stored in that memory.


Thanks,
Shawn

Re: Question about startup memory usage

2019-11-14 Thread Hongxu Ma

Thank you @Shawn Heisey<mailto:apa...@elyograg.org> , you help me many times.

My -xms=1G
When restart solr, I can see the progress of memory increasing (from 1G to 9G, 
took near 10s).

I have a guess: maybe solr is loading some needed files into heap memory, e.g. 
*.tip : term index file. What's your thoughts?

thanks.

From: Shawn Heisey 
Sent: Thursday, November 14, 2019 1:15
To: solr-user@lucene.apache.org 
Subject: Re: Question about startup memory usage

On 11/13/2019 2:03 AM, Hongxu Ma wrote:
> I have a solr-cloud cluster with a big collection, after startup (no any 
> search/index operations), its jvm memory usage is 9GB (via top: RES).
>
> Cluster and collection info:
> each host: total 64G mem, two solr nodes with -xmx=15G
> collection: total 9B billion docs (but each doc is very small: only some 
> bytes), total size 3TB.
>
> My question is:
> Is the 9G mem usage after startup normal? If so, I am worried that the follow 
> up index/search operations will cause an OOM error.
> And how can I reduce the memory usage? Maybe I should introduce more host 
> with nodes, but besides this, is there any other solution?

With the "-Xmx=15G" option, you've told Java that it can use up to 15GB
for heap.  It's total resident memory usage is eventually going to reach
a little over 15GB and probably never go down.  This is how Java works.

The amount of memory that Java allocates immediately on program startup
is related to the -Xms setting.  Normally Solr uses the same number for
both -Xms and -Xmx, but that can be changed if you desire.  We recommend
using the same number.  If -Xms is smaller than -Xmx, Java may allocate
less memory as soon as it starts, then Solr is going to run through its
startup procedure.  We will not know exactly how much memory allocation
is going to occur when that happens ... but with billions of documents,
it's not going to be small.

Thanks,
Shawn

Re: Question about startup memory usage

2019-11-13 Thread Shawn Heisey


On 11/13/2019 2:03 AM, Hongxu Ma wrote:

I have a solr-cloud cluster with a big collection, after startup (no any 
search/index operations), its jvm memory usage is 9GB (via top: RES).

Cluster and collection info:
each host: total 64G mem, two solr nodes with -xmx=15G
collection: total 9B billion docs (but each doc is very small: only some 
bytes), total size 3TB.

My question is:
Is the 9G mem usage after startup normal? If so, I am worried that the follow 
up index/search operations will cause an OOM error.
And how can I reduce the memory usage? Maybe I should introduce more host with 
nodes, but besides this, is there any other solution?


With the "-Xmx=15G" option, you've told Java that it can use up to 15GB 
for heap.  It's total resident memory usage is eventually going to reach 
a little over 15GB and probably never go down.  This is how Java works.


The amount of memory that Java allocates immediately on program startup 
is related to the -Xms setting.  Normally Solr uses the same number for 
both -Xms and -Xmx, but that can be changed if you desire.  We recommend 
using the same number.  If -Xms is smaller than -Xmx, Java may allocate 
less memory as soon as it starts, then Solr is going to run through its 
startup procedure.  We will not know exactly how much memory allocation 
is going to occur when that happens ... but with billions of documents, 
it's not going to be small.


Thanks,
Shawn

Re: Question about memory usage and file handling

2019-11-11 Thread Erick Erickson

(1) no. The internal Ram buffer will pretty much limit the amount of heap used 
however.

(2) You actually have several segments. “.cfs” stands for “Compound File”, see: 

https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/codecs/lucene70/package-summary.html
"An optional "virtual" file consisting of all the other index files for systems 
that frequently run out of file handles.”

IOW, _0.cfs is a complete segment. _1.cfs is a different, complete segment etc. 
The merge policy (TieredMergePolicy) controls when these are used .vs. the 
segment being kept in separate files.

New segments are created whenever the ram buffer is flushed or whenever you do 
a commit (closing the IW also creates a segment IIUC). However, under control 
of the merge policy, segments are merged. See: 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

You’re confusing closing a writer with merging segments. Essentially, every 
time a commit happens, the merge policy is called to determine if segments 
should be merged, see Mike’s blog above.

Additionally, you say "I was hoping there would be only _0.cfs file”. This’ll 
pretty much never happen. Segment names always increase, at best you’d have 
something like _ab.cfs, if not 10-15 _ab* files.

Lucene likes file handles, essentially when searching a file handle will be 
open for _every_ file in your index all the time.

All that said, counting the number of files seems like a waste of time. If 
you’re running on a *nix box, the usual (Solr I’ll admit, but I think it 
applies to Lucene as well) is to set the limit to 65K or so.

And if you’re truly concerned, and since you say this is an immutable, you can 
do a forceMerge. Prior to Lucene 7.5, the would by default form exactly one 
segment. For Lucene 7.5 and later, it’ll respect max segment size (a parameter 
in TMP, defaults to 5g) unless you specify a segment count of 1.

Best,
Erick

> On Nov 11, 2019, at 5:47 PM, Shawn Heisey  wrote:
> 
> On 11/11/2019 1:40 PM, siddharth teotia wrote:
>> I have a few questions about Lucene indexing and file handling. It would be
>> great if someone can help with these. I had earlier asked these questions
>> on gene...@lucene.apache.org but was asked to seek help here.
> 
> This mailing list (solr-user) is for Solr.  Questions about Lucene do not 
> belong on this list.
> 
> You should ask on the java-user mailing list, which is for questions related 
> to the core (Java) version of Lucene.
> 
> http://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
> 
> I have put the original sender address in the BCC field just in case you are 
> not subscribed here.
> 
> Thanks,
> Shawn

Re: Question about memory usage and file handling

2019-11-11 Thread Shawn Heisey


On 11/11/2019 1:40 PM, siddharth teotia wrote:

I have a few questions about Lucene indexing and file handling. It would be
great if someone can help with these. I had earlier asked these questions
on gene...@lucene.apache.org but was asked to seek help here.


This mailing list (solr-user) is for Solr.  Questions about Lucene do 
not belong on this list.


You should ask on the java-user mailing list, which is for questions 
related to the core (Java) version of Lucene.


http://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg

I have put the original sender address in the BCC field just in case you 
are not subscribed here.


Thanks,
Shawn

Re: Question regarding subqueries

2019-10-03 Thread Bram Biesbrouck

Hi Mikhail,

You're right, I'm probably over-complicating things. I was stuck trying to
combine a function in a regular query using a local variable, but Solr
doesn't seem to bend the way my mind did ;-)
Anyway, I worked around it using your suggestion and/or a slightly modified
prefix parser plugin.
Thanks for taking the time to reply, btw!

best,

b.

On Wed, Oct 2, 2019 at 9:05 PM Mikhail Khludnev  wrote:

> Hello, Bram.
>
> Something like that is possible in principle, but it will take enormous
> efforts to tackle exact syntax.
> Why not something like children.fq=-parent:true ?
>
> On Wed, Oct 2, 2019 at 8:52 PM Bram Biesbrouck <
> bram.biesbro...@reinvention.be> wrote:
>
> > Hi all,
> >
> > I'm struggling with a little period-sign difficulty and instead of
> pulling
> > out my hair, I wonder if any of you could help me out...
> >
> > Here's the query:
> > q=uri:"/en/blah"=id,uri,children:[subquery]={!prefix f=id
> v=$
> > row.id}=*
> >
> > It just searches for a document with the field "uri" set to "/en/blah".
> > For every hit (just one), it tries to manually fetch the subdocuments
> using
> > the id field of the hit since its children have id's like
> > ..
> > Note that I know this should be done with nested documents and the
> > ChildDocTransformer... this is just an exercise to train my brain...
> >
> > The query above works fine. However, it also returns the parent document,
> > because the prefix search includes it as well, of course. However, if I'm
> > changing the subquery to something along the lines of this:
> >
> > {!prefix f=id v=concat($row.id,".")}
> > or
> > {!prefix f=id v="$row.id\.")}
> > or
> > {!query defType=lucene v=concat("id:",$row.id,".")}
> >
> > I get no results back.
> >
> > I feel like I'm missing only a simple thing here, but can't seem to
> > pinpoint it.
> >
> > Any help?
> >
> > b.
> >  *We do video technology*
> > Visit our new website!  *Bram Biesbrouck*
> > bram.biesbro...@reinvention.be
> > +32 486 118280 <0032%20486%20118280>
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Question regarding subqueries

2019-10-02 Thread Mikhail Khludnev

Hello, Bram.

Something like that is possible in principle, but it will take enormous
efforts to tackle exact syntax.
Why not something like children.fq=-parent:true ?

On Wed, Oct 2, 2019 at 8:52 PM Bram Biesbrouck <
bram.biesbro...@reinvention.be> wrote:

> Hi all,
>
> I'm struggling with a little period-sign difficulty and instead of pulling
> out my hair, I wonder if any of you could help me out...
>
> Here's the query:
> q=uri:"/en/blah"=id,uri,children:[subquery]={!prefix f=id v=$
> row.id}=*
>
> It just searches for a document with the field "uri" set to "/en/blah".
> For every hit (just one), it tries to manually fetch the subdocuments using
> the id field of the hit since its children have id's like
> ..
> Note that I know this should be done with nested documents and the
> ChildDocTransformer... this is just an exercise to train my brain...
>
> The query above works fine. However, it also returns the parent document,
> because the prefix search includes it as well, of course. However, if I'm
> changing the subquery to something along the lines of this:
>
> {!prefix f=id v=concat($row.id,".")}
> or
> {!prefix f=id v="$row.id\.")}
> or
> {!query defType=lucene v=concat("id:",$row.id,".")}
>
> I get no results back.
>
> I feel like I'm missing only a simple thing here, but can't seem to
> pinpoint it.
>
> Any help?
>
> b.
>  *We do video technology*
> Visit our new website!  *Bram Biesbrouck*
> bram.biesbro...@reinvention.be
> +32 486 118280 <0032%20486%20118280>
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Question about "No registered leader" error

2019-09-19 Thread Hongxu Ma

@Shawn @Erick Thanks for your kindle help!

No OOM log and I confirm there was no OOM happened.

My ZK ticktime is set to 5000, so 5000*20 = 100s > 60s, and I checked solr 
code: the leader waiting time: 4000ms is a const variable, is not configurable. 
(why it isn't a configurable param?)

My solr version is 7.3.1, xmx = 3MB (via solr UI, peak memory is 22GB)
I have already used CMS GC tuning (param has a little difference from your wiki 
page).

I will try the following advice:

  *   lower heap size
  *   turn to G1 (the same param as wiki)
  *   try to restart one SOLR node when this error happens.

Thanks again.

From: Shawn Heisey 
Sent: Wednesday, September 18, 2019 20:21
To: solr-user@lucene.apache.org 
Subject: Re: Question about "No registered leader" error

On 9/18/2019 6:11 AM, Shawn Heisey wrote:
> On 9/17/2019 9:35 PM, Hongxu Ma wrote:
>> My questions:
>>
>>*   Is this error possible caused by "long gc pause"? my solr
>> zkClientTimeout=6
>
> It's possible.  I can't say for sure that this is the issue, but it
> might be.

A followup.  I was thinking about the interactions here.  It looks like
Solr only waits four seconds for the leader election, and both of the
pauses you mentioned are longer than that.

Four seconds is probably too short a time to wait, and I do not think
that timeout is configurable anywhere.

> What version of Solr do you have, and what is your max heap?  The CMS
> garbage collection that Solr 5.0 and later incorporate by default is
> pretty good.  My G1 settings might do slightly better, but the
> improvement won't be dramatic unless your existing commandline has
> absolutely no gc tuning at all.

That question will be important.  If you already have our CMS GC tuning,
switching to G1 probably is not going to solve this.  Lowering the max
heap might be the only viable solution in that case, and depending on
what you're dealing with, it will either be impossible or it will
require more servers.

Thanks,
Shawn

Re: Question about "No registered leader" error

2019-09-18 Thread Erick Erickson

Check whether the oom killer script was called. If so, there will be
log files obviously relating to that. I've seen nodes mysteriously
disappear as a result of this with no message in the regular solr
logs. If that's the case, you need to increase your heap.

Erick

On Wed, Sep 18, 2019 at 8:21 AM Shawn Heisey  wrote:
>
> On 9/18/2019 6:11 AM, Shawn Heisey wrote:
> > On 9/17/2019 9:35 PM, Hongxu Ma wrote:
> >> My questions:
> >>
> >>*   Is this error possible caused by "long gc pause"? my solr
> >> zkClientTimeout=6
> >
> > It's possible.  I can't say for sure that this is the issue, but it
> > might be.
>
> A followup.  I was thinking about the interactions here.  It looks like
> Solr only waits four seconds for the leader election, and both of the
> pauses you mentioned are longer than that.
>
> Four seconds is probably too short a time to wait, and I do not think
> that timeout is configurable anywhere.
>
> > What version of Solr do you have, and what is your max heap?  The CMS
> > garbage collection that Solr 5.0 and later incorporate by default is
> > pretty good.  My G1 settings might do slightly better, but the
> > improvement won't be dramatic unless your existing commandline has
> > absolutely no gc tuning at all.
>
> That question will be important.  If you already have our CMS GC tuning,
> switching to G1 probably is not going to solve this.  Lowering the max
> heap might be the only viable solution in that case, and depending on
> what you're dealing with, it will either be impossible or it will
> require more servers.
>
> Thanks,
> Shawn

Re: Question about "No registered leader" error

2019-09-18 Thread Shawn Heisey


On 9/18/2019 6:11 AM, Shawn Heisey wrote:

On 9/17/2019 9:35 PM, Hongxu Ma wrote:

My questions:

   *   Is this error possible caused by "long gc pause"? my solr 
zkClientTimeout=6


It's possible.  I can't say for sure that this is the issue, but it 
might be.


A followup.  I was thinking about the interactions here.  It looks like 
Solr only waits four seconds for the leader election, and both of the 
pauses you mentioned are longer than that.


Four seconds is probably too short a time to wait, and I do not think 
that timeout is configurable anywhere.


What version of Solr do you have, and what is your max heap?  The CMS 
garbage collection that Solr 5.0 and later incorporate by default is 
pretty good.  My G1 settings might do slightly better, but the 
improvement won't be dramatic unless your existing commandline has 
absolutely no gc tuning at all.


That question will be important.  If you already have our CMS GC tuning, 
switching to G1 probably is not going to solve this.  Lowering the max 
heap might be the only viable solution in that case, and depending on 
what you're dealing with, it will either be impossible or it will 
require more servers.


Thanks,
Shawn

Re: Question about "No registered leader" error

2019-09-18 Thread Shawn Heisey


On 9/17/2019 9:35 PM, Hongxu Ma wrote:

My questions:

   *   Is this error possible caused by "long gc pause"? my solr 
zkClientTimeout=6


It's possible.  I can't say for sure that this is the issue, but it 
might be.



   *   If so, how can I prevent this error happen? My thoughts: using G1 
collector (as 
https://cwiki.apache.org/confluence/display/SOLR/ShawnHeisey#ShawnHeisey-GCTuningforSolr)
 or enlarge zkClientTimeout again, what's your idea?


If your ZK server ticktime setting is the typical value of 2000, that 
means that the largest value you can use for the ZK timeout (which 
Solr's zkClientTimeout value ultimately gets used to set) is 40 seconds 
-- 20 times the ticktime is the biggest value ZK will allow.


So if your ZK server ticktime is 2000 milliseconds, you're not actually 
getting 60 seconds, and I don't know what happens when you try ... I 
would expect ZK to either just use its max value or ignore the setting 
entirely, and I do not know which it is.  That's something we should ask 
the ZK mailing list and/or do testing on.


Dealing with the the "no registered leader" problem probably will 
involve restarting at least one of the Solr server JVMs in the cloud, 
and if that doesn't work, restart all of them.


What version of Solr do you have, and what is your max heap?  The CMS 
garbage collection that Solr 5.0 and later incorporate by default is 
pretty good.  My G1 settings might do slightly better, but the 
improvement won't be dramatic unless your existing commandline has 
absolutely no gc tuning at all.


Thanks,
Shawn

Re: Question: Solr perform well with thousands of replicas?

2019-09-04 Thread Hongxu Ma

Hi Erick
Thanks for your help.

Before I visit wiki/maillist, I knew solr is unstable in 1000+ collections, and 
should be safe in 10~100 collections.
But in a specific env, what's the exact number which solr begin to become 
unstable? I don't know.

So I try to deploy a test cluster to get the number and try to optimize it 
bigger. (save my cost)
That's my purpose: quantitative analysis --> How many replicas can be supported 
in my env?
After get it, I will adjust my application: (when it's near the max number) 
prevent the creation of too many indexes or give a warning message to user.

From: Erick Erickson 
Sent: Monday, September 2, 2019 21:20
To: solr-user@lucene.apache.org 
Subject: Re: Question: Solr perform well with thousands of replicas?

> why so many collection/replica: it's our customer needs, for example: each 
> database table mappings a collection.

I always cringe when I see statements like this. What this means is that your 
customer doesn’t understand search and needs guidance in the proper use of any 
search technology, Solr included.

Solr is _not_ an RDBMS. Simply mapping the DB tables onto collections will 
almost certainly result in a poor experience. Next the customer will want to 
ask Solr to do the same thing a DB does, i.e. run a join across 10 tables etc., 
which will be abysmal. Solr isn’t designed for that. Some brilliant RDBMS 
people have spent many years making DBs to what they do and do it well.

That said, RDBMSs have poor search capabilities, they aren’t built to solve the 
search problem.

I suspect the time you spend making Solr load a thousand cores will be wasted. 
Once you do get them loaded, performance will be horrible. IMO you’d be far 
better off helping the customer define their problem so they properly model 
their search problem. This may mean that the result will be a hybrid where Solr 
is used for the free-text search and the RDBMS uses the results of the search 
to do something. Or vice versa.

FWIW
Erick

> On Sep 2, 2019, at 5:55 AM, Hongxu Ma  wrote:
>
> Thanks @Jörn and @Erick
> I enlarged my JVM memory, so far it's stable (but used many memory).
> And I will check lower-level errors according to your suggestion if error 
> happens.
>
> About my scenario:
>
>  *   why so many collection/replica: it's our customer needs, for example: 
> each database table mappings a collection.
>  *   this env is just a test cluster: I want to verify the max collection 
> number solr can support stably.
>
>
> 
> From: Erick Erickson 
> Sent: Friday, August 30, 2019 20:05
> To: solr-user@lucene.apache.org 
> Subject: Re: Question: Solr perform well with thousands of replicas?
>
> “no registered leader” is the effect of some problem usually, not the root 
> cause. In this case, for instance, you could be running out of file handles 
> and see other errors like “too many open files”. That’s just one example.
>
> One common problem is that Solr needs a lot of file handles and the system 
> defaults are too low. We usually recommend you start with 65K file handles 
> (ulimit) and bump up the number of processes to 65K too.
>
> So to throw some numbers out. With 1,000 replicas, and let’s say you have 50 
> segments in the index in each replica. Each segment consists of multiple 
> files (I’m skipping “compound files” here as an advanced topic), so each 
> segment has, let’s say, 10 segments. 1,000 * 50 * 10 would require 500,000 
> file handles on your system.
>
> Bottom line: look for other, lower-level errors in the log to try to 
> understand what limit you’re running into.
>
> All that said, there’ll be a number of “gotchas” when running that many 
> replicas on a particular node, I second Jörn;’s question...
>
> Best,
> Erick
>
>> On Aug 30, 2019, at 3:18 AM, Jörn Franke  wrote:
>>
>> What is the reason for this number of replicas? Solr should work fine, but 
>> maybe it is worth to consolidate some collections to avoid also 
>> administrative overhead.
>>
>>> Am 29.08.2019 um 05:27 schrieb Hongxu Ma :
>>>
>>> Hi
>>> I have a solr-cloud cluster, but it's unstable when collection number is 
>>> big: 1000 replica/core per solr node.
>>>
>>> To solve this issue, I have read the performance guide:
>>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>>>
>>> I noted there is a sentence on solr-cloud section:
>>> "Recent Solr versions perform well with thousands of replicas."
>>>
>>> I want to know does it mean a single solr node can handle thousands of 
>>> replicas? or a solr cluster can (if so, what's the size of the cluster?)
>>>
>>> My solr version is 7.3.1 and 6.6.2 (looks they are the same in performance)
>>>
>>> Thanks for you help.
>>>
>

Re: Question: Solr perform well with thousands of replicas?

2019-09-02 Thread Erick Erickson

> why so many collection/replica: it's our customer needs, for example: each 
> database table mappings a collection.

I always cringe when I see statements like this. What this means is that your 
customer doesn’t understand search and needs guidance in the proper use of any 
search technology, Solr included.

Solr is _not_ an RDBMS. Simply mapping the DB tables onto collections will 
almost certainly result in a poor experience. Next the customer will want to 
ask Solr to do the same thing a DB does, i.e. run a join across 10 tables etc., 
which will be abysmal. Solr isn’t designed for that. Some brilliant RDBMS 
people have spent many years making DBs to what they do and do it well. 

That said, RDBMSs have poor search capabilities, they aren’t built to solve the 
search problem.

I suspect the time you spend making Solr load a thousand cores will be wasted. 
Once you do get them loaded, performance will be horrible. IMO you’d be far 
better off helping the customer define their problem so they properly model 
their search problem. This may mean that the result will be a hybrid where Solr 
is used for the free-text search and the RDBMS uses the results of the search 
to do something. Or vice versa.

FWIW
Erick

> On Sep 2, 2019, at 5:55 AM, Hongxu Ma  wrote:
> 
> Thanks @Jörn and @Erick
> I enlarged my JVM memory, so far it's stable (but used many memory).
> And I will check lower-level errors according to your suggestion if error 
> happens.
> 
> About my scenario:
> 
>  *   why so many collection/replica: it's our customer needs, for example: 
> each database table mappings a collection.
>  *   this env is just a test cluster: I want to verify the max collection 
> number solr can support stably.
> 
> 
> 
> From: Erick Erickson 
> Sent: Friday, August 30, 2019 20:05
> To: solr-user@lucene.apache.org 
> Subject: Re: Question: Solr perform well with thousands of replicas?
> 
> “no registered leader” is the effect of some problem usually, not the root 
> cause. In this case, for instance, you could be running out of file handles 
> and see other errors like “too many open files”. That’s just one example.
> 
> One common problem is that Solr needs a lot of file handles and the system 
> defaults are too low. We usually recommend you start with 65K file handles 
> (ulimit) and bump up the number of processes to 65K too.
> 
> So to throw some numbers out. With 1,000 replicas, and let’s say you have 50 
> segments in the index in each replica. Each segment consists of multiple 
> files (I’m skipping “compound files” here as an advanced topic), so each 
> segment has, let’s say, 10 segments. 1,000 * 50 * 10 would require 500,000 
> file handles on your system.
> 
> Bottom line: look for other, lower-level errors in the log to try to 
> understand what limit you’re running into.
> 
> All that said, there’ll be a number of “gotchas” when running that many 
> replicas on a particular node, I second Jörn;’s question...
> 
> Best,
> Erick
> 
>> On Aug 30, 2019, at 3:18 AM, Jörn Franke  wrote:
>> 
>> What is the reason for this number of replicas? Solr should work fine, but 
>> maybe it is worth to consolidate some collections to avoid also 
>> administrative overhead.
>> 
>>> Am 29.08.2019 um 05:27 schrieb Hongxu Ma :
>>> 
>>> Hi
>>> I have a solr-cloud cluster, but it's unstable when collection number is 
>>> big: 1000 replica/core per solr node.
>>> 
>>> To solve this issue, I have read the performance guide:
>>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>>> 
>>> I noted there is a sentence on solr-cloud section:
>>> "Recent Solr versions perform well with thousands of replicas."
>>> 
>>> I want to know does it mean a single solr node can handle thousands of 
>>> replicas? or a solr cluster can (if so, what's the size of the cluster?)
>>> 
>>> My solr version is 7.3.1 and 6.6.2 (looks they are the same in performance)
>>> 
>>> Thanks for you help.
>>> 
>

Re: Question: Solr perform well with thousands of replicas?

2019-09-02 Thread Hongxu Ma

Thanks @Jörn and @Erick
I enlarged my JVM memory, so far it's stable (but used many memory).
And I will check lower-level errors according to your suggestion if error 
happens.

About my scenario:

  *   why so many collection/replica: it's our customer needs, for example: 
each database table mappings a collection.
  *   this env is just a test cluster: I want to verify the max collection 
number solr can support stably.

From: Erick Erickson 
Sent: Friday, August 30, 2019 20:05
To: solr-user@lucene.apache.org 
Subject: Re: Question: Solr perform well with thousands of replicas?

“no registered leader” is the effect of some problem usually, not the root 
cause. In this case, for instance, you could be running out of file handles and 
see other errors like “too many open files”. That’s just one example.

One common problem is that Solr needs a lot of file handles and the system 
defaults are too low. We usually recommend you start with 65K file handles 
(ulimit) and bump up the number of processes to 65K too.

So to throw some numbers out. With 1,000 replicas, and let’s say you have 50 
segments in the index in each replica. Each segment consists of multiple files 
(I’m skipping “compound files” here as an advanced topic), so each segment has, 
let’s say, 10 segments. 1,000 * 50 * 10 would require 500,000 file handles on 
your system.

Bottom line: look for other, lower-level errors in the log to try to understand 
what limit you’re running into.

All that said, there’ll be a number of “gotchas” when running that many 
replicas on a particular node, I second Jörn;’s question...

Best,
Erick

> On Aug 30, 2019, at 3:18 AM, Jörn Franke  wrote:
>
> What is the reason for this number of replicas? Solr should work fine, but 
> maybe it is worth to consolidate some collections to avoid also 
> administrative overhead.
>
>> Am 29.08.2019 um 05:27 schrieb Hongxu Ma :
>>
>> Hi
>> I have a solr-cloud cluster, but it's unstable when collection number is 
>> big: 1000 replica/core per solr node.
>>
>> To solve this issue, I have read the performance guide:
>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>>
>> I noted there is a sentence on solr-cloud section:
>> "Recent Solr versions perform well with thousands of replicas."
>>
>> I want to know does it mean a single solr node can handle thousands of 
>> replicas? or a solr cluster can (if so, what's the size of the cluster?)
>>
>> My solr version is 7.3.1 and 6.6.2 (looks they are the same in performance)
>>
>> Thanks for you help.
>>

Re: Question: Solr perform well with thousands of replicas?

2019-08-30 Thread Erick Erickson

 “no registered leader” is the effect of some problem usually, not the root 
cause. In this case, for instance, you could be running out of file handles and 
see other errors like “too many open files”. That’s just one example.

One common problem is that Solr needs a lot of file handles and the system 
defaults are too low. We usually recommend you start with 65K file handles 
(ulimit) and bump up the number of processes to 65K too.

So to throw some numbers out. With 1,000 replicas, and let’s say you have 50 
segments in the index in each replica. Each segment consists of multiple files 
(I’m skipping “compound files” here as an advanced topic), so each segment has, 
let’s say, 10 segments. 1,000 * 50 * 10 would require 500,000 file handles on 
your system.

Bottom line: look for other, lower-level errors in the log to try to understand 
what limit you’re running into.

All that said, there’ll be a number of “gotchas” when running that many 
replicas on a particular node, I second Jörn;’s question...

Best,
Erick

> On Aug 30, 2019, at 3:18 AM, Jörn Franke  wrote:
> 
> What is the reason for this number of replicas? Solr should work fine, but 
> maybe it is worth to consolidate some collections to avoid also 
> administrative overhead.
> 
>> Am 29.08.2019 um 05:27 schrieb Hongxu Ma :
>> 
>> Hi
>> I have a solr-cloud cluster, but it's unstable when collection number is 
>> big: 1000 replica/core per solr node.
>> 
>> To solve this issue, I have read the performance guide:
>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>> 
>> I noted there is a sentence on solr-cloud section:
>> "Recent Solr versions perform well with thousands of replicas."
>> 
>> I want to know does it mean a single solr node can handle thousands of 
>> replicas? or a solr cluster can (if so, what's the size of the cluster?)
>> 
>> My solr version is 7.3.1 and 6.6.2 (looks they are the same in performance)
>> 
>> Thanks for you help.
>>

Re: Question: Solr perform well with thousands of replicas?

2019-08-30 Thread Jörn Franke

What is the reason for this number of replicas? Solr should work fine, but 
maybe it is worth to consolidate some collections to avoid also administrative 
overhead.

> Am 29.08.2019 um 05:27 schrieb Hongxu Ma :
> 
> Hi
> I have a solr-cloud cluster, but it's unstable when collection number is big: 
> 1000 replica/core per solr node.
> 
> To solve this issue, I have read the performance guide:
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
> 
> I noted there is a sentence on solr-cloud section:
> "Recent Solr versions perform well with thousands of replicas."
> 
> I want to know does it mean a single solr node can handle thousands of 
> replicas? or a solr cluster can (if so, what's the size of the cluster?)
> 
> My solr version is 7.3.1 and 6.6.2 (looks they are the same in performance)
> 
> Thanks for you help.
>

Re: Question: Solr perform well with thousands of replicas?

2019-08-30 Thread Hongxu Ma

Hi guys
Thanks for your helpful help!

More details about my env.
Cluster:
A 4 GCP(google cloud) hosts cluster, each host: 16Core cpu, 60G mem, 2TB HDD.
I set up 2 solr nodes on each host and there are 1000+ replicas on each solr 
node.
(Sorry for forgetting this before: 2 solr node on each host, so there are 2000+ 
replicas on each host...)
zookeeper has 3 instances, reuse the solr hosts (using a separated disk).
Workload:
just index tens of millions of record (total size near 100GB) into dozens (near 
100) of indexes, 30 concurrent, no search operation at the same time (I will do 
search test later).
Error:
"unstable" means there are many solr errors in log and the solr request is 
failed.
e.g. "No registered leader was found after waiting for 4000ms , collection ..."

@ Hendrik
after saw your reply, I noted my replicas num is too big, so I adjusted to: 720 
replicas on each host (reduced shard num), then all my index requests are 
successful. (happy)
but I saw the JVM peak mem usage is 24GB (via solr web UI), it's too big to be 
risky in the future (my JMV xmx is 32GB).
so would you give me some guides to reduce the memory usage? (like you 
mentioned "tuned a few caches down to a minimum")

@ Erick
I gave details above, please check.

@ Shawn
thanks for your info, it's a bad news...
hope solr-cloud can handle more collections in future.

From: Shawn Heisey 
Sent: Thursday, August 29, 2019 21:58
To: solr-user@lucene.apache.org 
Subject: Re: Question: Solr perform well with thousands of replicas?

On 8/28/2019 9:27 PM, Hongxu Ma wrote:
> I have a solr-cloud cluster, but it's unstable when collection number is big: 
> 1000 replica/core per solr node.
>
> To solve this issue, I have read the performance guide:
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>
> I noted there is a sentence on solr-cloud section:
> "Recent Solr versions perform well with thousands of replicas."

The SolrPerformanceProblems wiki page is my work.  I only wrote that
sentence because other devs working in SolrCloud code told me that was
the case.  Based on things said by people (including your comments on
this thread), I think newer versions probably aren't any better, and
that sentence needs to be removed from the wiki page.

See this issue that I created a few years ago:

https://issues.apache.org/jira/browse/SOLR-7191

This issue was closed with a 6.3 fix version ... but nothing was
committed with a tag for the issue, so I have no idea why it was closed.
  I think the problems described there are still there in recent Solr
versions, and MIGHT be even worse than they were in 4.x and 5.x.

> I want to know does it mean a single solr node can handle thousands of 
> replicas? or a solr cluster can (if so, what's the size of the cluster?)

A single standalone Solr instance can handle lots of indexes, but Solr
startup is probably going to be slow.

No matter how many nodes there are, SolrCloud has problems with
thousands of collections or replicas due to issues with the overseer
queue getting enormous.  When I created SOLR-7191, I found that
restarting a node in a cloud with thousands of replicas (cores) can
result in a performance death spiral.

I haven't ever administered a production setup with thousands of
indexes, I've only done some single machine testing for the issue I
created.  I need to repeat it with 8.x and see what happens.  But I have
very little free time these days.

Thanks,
Shawn

Re: Question: Solr perform well with thousands of replicas?

2019-08-29 Thread Shawn Heisey


On 8/28/2019 9:27 PM, Hongxu Ma wrote:

I have a solr-cloud cluster, but it's unstable when collection number is big: 
1000 replica/core per solr node.

To solve this issue, I have read the performance guide:
https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems

I noted there is a sentence on solr-cloud section:
"Recent Solr versions perform well with thousands of replicas."


The SolrPerformanceProblems wiki page is my work.  I only wrote that 
sentence because other devs working in SolrCloud code told me that was 
the case.  Based on things said by people (including your comments on 
this thread), I think newer versions probably aren't any better, and 
that sentence needs to be removed from the wiki page.


See this issue that I created a few years ago:

https://issues.apache.org/jira/browse/SOLR-7191

This issue was closed with a 6.3 fix version ... but nothing was 
committed with a tag for the issue, so I have no idea why it was closed. 
 I think the problems described there are still there in recent Solr 
versions, and MIGHT be even worse than they were in 4.x and 5.x.



I want to know does it mean a single solr node can handle thousands of 
replicas? or a solr cluster can (if so, what's the size of the cluster?)


A single standalone Solr instance can handle lots of indexes, but Solr 
startup is probably going to be slow.


No matter how many nodes there are, SolrCloud has problems with 
thousands of collections or replicas due to issues with the overseer 
queue getting enormous.  When I created SOLR-7191, I found that 
restarting a node in a cloud with thousands of replicas (cores) can 
result in a performance death spiral.


I haven't ever administered a production setup with thousands of 
indexes, I've only done some single machine testing for the issue I 
created.  I need to repeat it with 8.x and see what happens.  But I have 
very little free time these days.


Thanks,
Shawn

Re: Question: Solr perform well with thousands of replicas?

2019-08-29 Thread Erick Erickson

There are two factors:
1> the raw number of replicas on a Solr node.
2> total resources Solr needs.

You say “..it’s unstalble…”. _How_ is it unstable? What symptoms are you seeing?

You might want to review: 
https://cwiki.apache.org/confluence/display/solr/UsingMailingLists

And not as you add more cores, you put more pressure on memory, I/O, etc. So
whether it’s the raw number of cores or you’re just exhausting memory, 
overloading
your CPU, etc. is hard to say without more information.

Best,
Erick

> On Aug 29, 2019, at 1:31 AM, Hendrik Haddorp  wrote:
> 
> Hi,
> 
> we are usually using Solr Clouds with 5 nodes and up to 2000 collections
> and a replication factor of 2. So we have close to 1000 cores per node.
> That is on Solr 7.6 but I believe 7.3 worked as well. We tuned a few
> caches down to a minimum as otherwise the memory usage goes up a lot.
> The Solr UI is having some problems with a high number of collections,
> like lots of timeouts when loading the status.
> 
> Older Solr versions had problem with the overseer queue in ZooKeeper. If
> you restarted too many nodes at once then the queue got too long and
> Solr died and required some help and cleanup to start at all again.
> 
> regards,
> Hendrik
> 
> On 29.08.19 05:27, Hongxu Ma wrote:
>> Hi
>> I have a solr-cloud cluster, but it's unstable when collection number is 
>> big: 1000 replica/core per solr node.
>> 
>> To solve this issue, I have read the performance guide:
>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>> 
>> I noted there is a sentence on solr-cloud section:
>> "Recent Solr versions perform well with thousands of replicas."
>> 
>> I want to know does it mean a single solr node can handle thousands of 
>> replicas? or a solr cluster can (if so, what's the size of the cluster?)
>> 
>> My solr version is 7.3.1 and 6.6.2 (looks they are the same in performance)
>> 
>> Thanks for you help.
>> 
>> 
>

Re: Question: Solr perform well with thousands of replicas?

2019-08-28 Thread Hendrik Haddorp


Hi,

we are usually using Solr Clouds with 5 nodes and up to 2000 collections
and a replication factor of 2. So we have close to 1000 cores per node.
That is on Solr 7.6 but I believe 7.3 worked as well. We tuned a few
caches down to a minimum as otherwise the memory usage goes up a lot.
The Solr UI is having some problems with a high number of collections,
like lots of timeouts when loading the status.

Older Solr versions had problem with the overseer queue in ZooKeeper. If
you restarted too many nodes at once then the queue got too long and
Solr died and required some help and cleanup to start at all again.

regards,
Hendrik

On 29.08.19 05:27, Hongxu Ma wrote:

Hi
I have a solr-cloud cluster, but it's unstable when collection number is big: 
1000 replica/core per solr node.

To solve this issue, I have read the performance guide:
https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems

I noted there is a sentence on solr-cloud section:
"Recent Solr versions perform well with thousands of replicas."

I want to know does it mean a single solr node can handle thousands of 
replicas? or a solr cluster can (if so, what's the size of the cluster?)

My solr version is 7.3.1 and 6.6.2 (looks they are the same in performance)

Thanks for you help.

Re: question about solrCloud joining

2019-08-23 Thread Mikhail Khludnev

Raised  https://issues.apache.org/jira/browse/SOLR-13716

On Wed, Aug 21, 2019 at 10:37 AM Lisheng Wang 
wrote:

> Hi  Mikhail,
>
> okay.
>
> below is 2 requests:
>
> both are select from "movieDirectors" collection join "movies" collection
> which has 2 shards.
>
>
> http://localhost:8983/solr/movieDirectors/select?fq=%7B!join%20from%3Ddirector_id%20fromIndex%3Dmovies%20to%3Did%7Dtitle%3A%22Dunkirk%22=*%3A*
>
> http://localhost:8984/solr/movieDirectors/select?fq=%7B!join%20from%3Ddirector_id%20fromIndex%3Dmovies%20to%3Did%7Dtitle%3A%22Dunkirk%22=*%3A*
>
> first request can get result without Exception, response is following
>
> { "responseHeader":{ "zkConnected":true, "status":0, "QTime":3, "params":{
> "
> q":"*:*", "fq":"{!join from=director_id fromIndex=movies
> to=id}title:\"Dunkirk\"", "_":"1566261450613"}}, "response":{"numFound":1,"
> start":0,"docs":[ { "id":"1", "name":"Christopher Nolan", "has_oscar":true,
> "_version_":1642343436642156544}] }}
>
> second request will get Exception
> { "responseHeader":{ "zkConnected":true, "status":400, "QTime":29,
> "params":{
> "q":"*:*", "fq":"{!join from=director_id fromIndex=movies
> to=id}title:\"Dunkirk\"", "_":"1566261620152"}}, "error":{ "metadata":[
> "error-class","org.apache.solr.common.SolrException", "root-error-class",
> "org.apache.solr.common.SolrException"], "msg":"SolrCloud join: multiple
> shards not yet supported movies", "code":400}}
>
> i don't know why get 2 different result when you request from different
> node, i think both should get Exception with "SolrCloud join: multiple
> shards not yet supported movies".
>
> Best,
> Lisheng
>
>
> Mikhail Khludnev  于2019年8月21日周三 下午3:19写道：
>
> > Ok. Still hard to follow. Can you clarify which collection you run these
> > queries on?
> > Collection name (url segment before /select) is more significant than any
> > port (jvm) identity.
> >
> > On Wed, Aug 21, 2019 at 5:14 AM Lisheng Wang 
> > wrote:
> >
> > > Hi Mikhail
> > >
> > > Thanks for your response,  but question is not related to "title:Get
> > Out",
> > > maybe i did not describe clearly.
> > >
> > > I knew solrCloud joining is not working in index which is splited to
> > > multiple shards.
> > >
> > > but why i run "*{!join from=director_id fromIndex=movies
> > > to=id}title:"Dunkirk"*" on 8984 (fromIndex=movies, movies has 2 shards)
> > i
> > > got exception "SolrCloud join: multiple shards not yet supported
> movies"
> > >
> > > but when run on 8983, i got result but it is incorrect without above
> > > exception. i think should get same exception no matter run joining on
> > 8983
> > > or 8984.
> > >
> > > Not sure my explanation is clear?
> > >
> > > Please kindly let me know if you have any question.
> > >
> > > Thanks!
> > >
> > > Lisheng
> > >
> > >
> > >
> > > Mikhail Khludnev  于2019年8月21日周三 上午4:41写道：
> > >
> > > > Hello, Lisheng.
> > > > I barely follow, but couldn't the space symbol in "title:Get Out"
> > > > cause the problem
> > > > ?
> > > > Check debugQuery and nested query in local param.
> > > >
> > > >
> > > > On Tue, Aug 20, 2019 at 6:35 PM Lisheng Wang <
> wanglishen...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Erick
> > > > >
> > > > > Thanks for your quick response and remaining me about attachment
> > issue.
> > > > >
> > > > > Yes, i run on 2 different jvms that not related to if they are on
> > same
> > > > > machine or not.
> > > > >
> > > > > let me describe my scenario, i have two collection:
> > > > >
> > > > > i start 2 nodes on my laptop on 2 different JVM, ports are 8983 and
> > > 8984.
> > > > >
> > > > > 1. movieDirectors: 1 shard, 2 replica, master is on 8984, slave is
> on
> > > > 8983
> > > > > 2. movies: 2 shard, 1 replica/shardshard1 is on 8983, shard2 is
> > on
> > > > > 8984.
> > > > >
> > > > > collection movieDirectors has 2 docs:
> > > > > {
> > > > > "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > > > > 1642343781358370816
> > > > > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > > > > 1642343828930166784
> > > > > }
> > > > > collection movies has 2 docs too:
> > > > > { "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > > > > 1642343781358370816
> > > > > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > > > > 1642343828930166784
> > > > > }
> > > > > everything is ok when i run query with "{!join from=id
> > > > > fromIndex=movieDirectors to=director_id}has_oscar:true" on both
> 8983
> > > and
> > > > > 8984, i can got expected result:
> > > > > { "responseHeader":{ "zkConnected":true, "status":0, "QTime":79,
> > > > "params":{
> > > > > "q":"*:*", "fq":"{!join from=id fromIndex=movieDirectors
> > > > > to=director_id}has_oscar:true", "_":"1566313944099"}},
> > > > > "response":{"numFound
> > > > > ":2,"start":0,"maxScore":1.0,"docs":[ { "id":"1",
> "title":"Dunkirk",
> > "
> > > > > director_id":"1", "_version_":1642343781358370816}, { "id":"2",
> > > > > "title":"Get
> > > > > Out", "director_id":"2",

Re: question about solrCloud joining

2019-08-21 Thread Mikhail Khludnev

I'm not sure, but it might be an issue. It make sense to add negative test
and assert the exception at
https://github.com/apache/lucene-solr/blob/master/solr/core/src/test/org/apache/solr/cloud/DistribJoinFromCollectionTest.java

On Wed, Aug 21, 2019 at 10:37 AM Lisheng Wang 
wrote:

> Hi  Mikhail,
>
> okay.
>
> below is 2 requests:
>
> both are select from "movieDirectors" collection join "movies" collection
> which has 2 shards.
>
>
> http://localhost:8983/solr/movieDirectors/select?fq=%7B!join%20from%3Ddirector_id%20fromIndex%3Dmovies%20to%3Did%7Dtitle%3A%22Dunkirk%22=*%3A*
>
> http://localhost:8984/solr/movieDirectors/select?fq=%7B!join%20from%3Ddirector_id%20fromIndex%3Dmovies%20to%3Did%7Dtitle%3A%22Dunkirk%22=*%3A*
>
> first request can get result without Exception, response is following
>
> { "responseHeader":{ "zkConnected":true, "status":0, "QTime":3, "params":{
> "
> q":"*:*", "fq":"{!join from=director_id fromIndex=movies
> to=id}title:\"Dunkirk\"", "_":"1566261450613"}}, "response":{"numFound":1,"
> start":0,"docs":[ { "id":"1", "name":"Christopher Nolan", "has_oscar":true,
> "_version_":1642343436642156544}] }}
>
> second request will get Exception
> { "responseHeader":{ "zkConnected":true, "status":400, "QTime":29,
> "params":{
> "q":"*:*", "fq":"{!join from=director_id fromIndex=movies
> to=id}title:\"Dunkirk\"", "_":"1566261620152"}}, "error":{ "metadata":[
> "error-class","org.apache.solr.common.SolrException", "root-error-class",
> "org.apache.solr.common.SolrException"], "msg":"SolrCloud join: multiple
> shards not yet supported movies", "code":400}}
>
> i don't know why get 2 different result when you request from different
> node, i think both should get Exception with "SolrCloud join: multiple
> shards not yet supported movies".
>
> Best,
> Lisheng
>
>
> Mikhail Khludnev  于2019年8月21日周三 下午3:19写道：
>
> > Ok. Still hard to follow. Can you clarify which collection you run these
> > queries on?
> > Collection name (url segment before /select) is more significant than any
> > port (jvm) identity.
> >
> > On Wed, Aug 21, 2019 at 5:14 AM Lisheng Wang 
> > wrote:
> >
> > > Hi Mikhail
> > >
> > > Thanks for your response,  but question is not related to "title:Get
> > Out",
> > > maybe i did not describe clearly.
> > >
> > > I knew solrCloud joining is not working in index which is splited to
> > > multiple shards.
> > >
> > > but why i run "*{!join from=director_id fromIndex=movies
> > > to=id}title:"Dunkirk"*" on 8984 (fromIndex=movies, movies has 2 shards)
> > i
> > > got exception "SolrCloud join: multiple shards not yet supported
> movies"
> > >
> > > but when run on 8983, i got result but it is incorrect without above
> > > exception. i think should get same exception no matter run joining on
> > 8983
> > > or 8984.
> > >
> > > Not sure my explanation is clear?
> > >
> > > Please kindly let me know if you have any question.
> > >
> > > Thanks!
> > >
> > > Lisheng
> > >
> > >
> > >
> > > Mikhail Khludnev  于2019年8月21日周三 上午4:41写道：
> > >
> > > > Hello, Lisheng.
> > > > I barely follow, but couldn't the space symbol in "title:Get Out"
> > > > cause the problem
> > > > ?
> > > > Check debugQuery and nested query in local param.
> > > >
> > > >
> > > > On Tue, Aug 20, 2019 at 6:35 PM Lisheng Wang <
> wanglishen...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Erick
> > > > >
> > > > > Thanks for your quick response and remaining me about attachment
> > issue.
> > > > >
> > > > > Yes, i run on 2 different jvms that not related to if they are on
> > same
> > > > > machine or not.
> > > > >
> > > > > let me describe my scenario, i have two collection:
> > > > >
> > > > > i start 2 nodes on my laptop on 2 different JVM, ports are 8983 and
> > > 8984.
> > > > >
> > > > > 1. movieDirectors: 1 shard, 2 replica, master is on 8984, slave is
> on
> > > > 8983
> > > > > 2. movies: 2 shard, 1 replica/shardshard1 is on 8983, shard2 is
> > on
> > > > > 8984.
> > > > >
> > > > > collection movieDirectors has 2 docs:
> > > > > {
> > > > > "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > > > > 1642343781358370816
> > > > > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > > > > 1642343828930166784
> > > > > }
> > > > > collection movies has 2 docs too:
> > > > > { "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > > > > 1642343781358370816
> > > > > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > > > > 1642343828930166784
> > > > > }
> > > > > everything is ok when i run query with "{!join from=id
> > > > > fromIndex=movieDirectors to=director_id}has_oscar:true" on both
> 8983
> > > and
> > > > > 8984, i can got expected result:
> > > > > { "responseHeader":{ "zkConnected":true, "status":0, "QTime":79,
> > > > "params":{
> > > > > "q":"*:*", "fq":"{!join from=id fromIndex=movieDirectors
> > > > > to=director_id}has_oscar:true", "_":"1566313944099"}},
> > > > > "response":{"numFound
> > > > > ":2,"start":0,"maxScore":1.0,"docs":[ {

Re: question about solrCloud joining

2019-08-21 Thread Lisheng Wang

Hi  Mikhail,

okay.

below is 2 requests:

both are select from "movieDirectors" collection join "movies" collection
which has 2 shards.

http://localhost:8983/solr/movieDirectors/select?fq=%7B!join%20from%3Ddirector_id%20fromIndex%3Dmovies%20to%3Did%7Dtitle%3A%22Dunkirk%22=*%3A*
http://localhost:8984/solr/movieDirectors/select?fq=%7B!join%20from%3Ddirector_id%20fromIndex%3Dmovies%20to%3Did%7Dtitle%3A%22Dunkirk%22=*%3A*

first request can get result without Exception, response is following

{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":3, "params":{ "
q":"*:*", "fq":"{!join from=director_id fromIndex=movies
to=id}title:\"Dunkirk\"", "_":"1566261450613"}}, "response":{"numFound":1,"
start":0,"docs":[ { "id":"1", "name":"Christopher Nolan", "has_oscar":true,
"_version_":1642343436642156544}] }}

second request will get Exception
{ "responseHeader":{ "zkConnected":true, "status":400, "QTime":29, "params":{
"q":"*:*", "fq":"{!join from=director_id fromIndex=movies
to=id}title:\"Dunkirk\"", "_":"1566261620152"}}, "error":{ "metadata":[
"error-class","org.apache.solr.common.SolrException", "root-error-class",
"org.apache.solr.common.SolrException"], "msg":"SolrCloud join: multiple
shards not yet supported movies", "code":400}}

i don't know why get 2 different result when you request from different
node, i think both should get Exception with "SolrCloud join: multiple
shards not yet supported movies".

Best,
Lisheng


Mikhail Khludnev  于2019年8月21日周三 下午3:19写道：

> Ok. Still hard to follow. Can you clarify which collection you run these
> queries on?
> Collection name (url segment before /select) is more significant than any
> port (jvm) identity.
>
> On Wed, Aug 21, 2019 at 5:14 AM Lisheng Wang 
> wrote:
>
> > Hi Mikhail
> >
> > Thanks for your response,  but question is not related to "title:Get
> Out",
> > maybe i did not describe clearly.
> >
> > I knew solrCloud joining is not working in index which is splited to
> > multiple shards.
> >
> > but why i run "*{!join from=director_id fromIndex=movies
> > to=id}title:"Dunkirk"*" on 8984 (fromIndex=movies, movies has 2 shards)
> i
> > got exception "SolrCloud join: multiple shards not yet supported movies"
> >
> > but when run on 8983, i got result but it is incorrect without above
> > exception. i think should get same exception no matter run joining on
> 8983
> > or 8984.
> >
> > Not sure my explanation is clear?
> >
> > Please kindly let me know if you have any question.
> >
> > Thanks!
> >
> > Lisheng
> >
> >
> >
> > Mikhail Khludnev  于2019年8月21日周三 上午4:41写道：
> >
> > > Hello, Lisheng.
> > > I barely follow, but couldn't the space symbol in "title:Get Out"
> > > cause the problem
> > > ?
> > > Check debugQuery and nested query in local param.
> > >
> > >
> > > On Tue, Aug 20, 2019 at 6:35 PM Lisheng Wang 
> > > wrote:
> > >
> > > > Hi Erick
> > > >
> > > > Thanks for your quick response and remaining me about attachment
> issue.
> > > >
> > > > Yes, i run on 2 different jvms that not related to if they are on
> same
> > > > machine or not.
> > > >
> > > > let me describe my scenario, i have two collection:
> > > >
> > > > i start 2 nodes on my laptop on 2 different JVM, ports are 8983 and
> > 8984.
> > > >
> > > > 1. movieDirectors: 1 shard, 2 replica, master is on 8984, slave is on
> > > 8983
> > > > 2. movies: 2 shard, 1 replica/shardshard1 is on 8983, shard2 is
> on
> > > > 8984.
> > > >
> > > > collection movieDirectors has 2 docs:
> > > > {
> > > > "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > > > 1642343781358370816
> > > > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > > > 1642343828930166784
> > > > }
> > > > collection movies has 2 docs too:
> > > > { "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > > > 1642343781358370816
> > > > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > > > 1642343828930166784
> > > > }
> > > > everything is ok when i run query with "{!join from=id
> > > > fromIndex=movieDirectors to=director_id}has_oscar:true" on both 8983
> > and
> > > > 8984, i can got expected result:
> > > > { "responseHeader":{ "zkConnected":true, "status":0, "QTime":79,
> > > "params":{
> > > > "q":"*:*", "fq":"{!join from=id fromIndex=movieDirectors
> > > > to=director_id}has_oscar:true", "_":"1566313944099"}},
> > > > "response":{"numFound
> > > > ":2,"start":0,"maxScore":1.0,"docs":[ { "id":"1", "title":"Dunkirk",
> "
> > > > director_id":"1", "_version_":1642343781358370816}, { "id":"2",
> > > > "title":"Get
> > > > Out", "director_id":"2", "_version_":1642343828930166784}] }}
> > > > but when i run "{!join from=director_id fromIndex=movies
> > > > to=id}title:"Dunkirk"" on 8983 got 1 doc,
> > > >  if i filter by "title:Get Out", i got nothing.  i understood "Get
> Out"
> > > is
> > > > not exist in 8983.
> > > > { "responseHeader":{ "zkConnected":true, "status":0, "QTime":3,
> > > "params":{
> > > > "
> > > > q":"*:*", "fq":"{!join from=director_id

Re: question about solrCloud joining

2019-08-21 Thread Mikhail Khludnev

Ok. Still hard to follow. Can you clarify which collection you run these
queries on?
Collection name (url segment before /select) is more significant than any
port (jvm) identity.

On Wed, Aug 21, 2019 at 5:14 AM Lisheng Wang 
wrote:

> Hi Mikhail
>
> Thanks for your response,  but question is not related to "title:Get Out",
> maybe i did not describe clearly.
>
> I knew solrCloud joining is not working in index which is splited to
> multiple shards.
>
> but why i run "*{!join from=director_id fromIndex=movies
> to=id}title:"Dunkirk"*" on 8984 (fromIndex=movies, movies has 2 shards)  i
> got exception "SolrCloud join: multiple shards not yet supported movies"
>
> but when run on 8983, i got result but it is incorrect without above
> exception. i think should get same exception no matter run joining on 8983
> or 8984.
>
> Not sure my explanation is clear?
>
> Please kindly let me know if you have any question.
>
> Thanks!
>
> Lisheng
>
>
>
> Mikhail Khludnev  于2019年8月21日周三 上午4:41写道：
>
> > Hello, Lisheng.
> > I barely follow, but couldn't the space symbol in "title:Get Out"
> > cause the problem
> > ?
> > Check debugQuery and nested query in local param.
> >
> >
> > On Tue, Aug 20, 2019 at 6:35 PM Lisheng Wang 
> > wrote:
> >
> > > Hi Erick
> > >
> > > Thanks for your quick response and remaining me about attachment issue.
> > >
> > > Yes, i run on 2 different jvms that not related to if they are on same
> > > machine or not.
> > >
> > > let me describe my scenario, i have two collection:
> > >
> > > i start 2 nodes on my laptop on 2 different JVM, ports are 8983 and
> 8984.
> > >
> > > 1. movieDirectors: 1 shard, 2 replica, master is on 8984, slave is on
> > 8983
> > > 2. movies: 2 shard, 1 replica/shardshard1 is on 8983, shard2 is on
> > > 8984.
> > >
> > > collection movieDirectors has 2 docs:
> > > {
> > > "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > > 1642343781358370816
> > > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > > 1642343828930166784
> > > }
> > > collection movies has 2 docs too:
> > > { "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > > 1642343781358370816
> > > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > > 1642343828930166784
> > > }
> > > everything is ok when i run query with "{!join from=id
> > > fromIndex=movieDirectors to=director_id}has_oscar:true" on both 8983
> and
> > > 8984, i can got expected result:
> > > { "responseHeader":{ "zkConnected":true, "status":0, "QTime":79,
> > "params":{
> > > "q":"*:*", "fq":"{!join from=id fromIndex=movieDirectors
> > > to=director_id}has_oscar:true", "_":"1566313944099"}},
> > > "response":{"numFound
> > > ":2,"start":0,"maxScore":1.0,"docs":[ { "id":"1", "title":"Dunkirk", "
> > > director_id":"1", "_version_":1642343781358370816}, { "id":"2",
> > > "title":"Get
> > > Out", "director_id":"2", "_version_":1642343828930166784}] }}
> > > but when i run "{!join from=director_id fromIndex=movies
> > > to=id}title:"Dunkirk"" on 8983 got 1 doc,
> > >  if i filter by "title:Get Out", i got nothing.  i understood "Get Out"
> > is
> > > not exist in 8983.
> > > { "responseHeader":{ "zkConnected":true, "status":0, "QTime":3,
> > "params":{
> > > "
> > > q":"*:*", "fq":"{!join from=director_id fromIndex=movies
> > > to=id}title:\"Dunkirk\"", "_":"1566261450613"}},
> > "response":{"numFound":1,"
> > > start":0,"docs":[ { "id":"1", "name":"Christopher Nolan",
> > "has_oscar":true,
> > > "_version_":1642343436642156544}] }}
> > >
> > > but question is coming, when i run "{!join from=director_id
> > > fromIndex=movies to=id}title:"Dunkirk"" on 8984, i got "SolrCloud join:
> > > multiple shards not yet supported movies"
> > > no matter what filter value is.
> > >
> > > i found following code:
> > >
> > > private static String findLocalReplicaForFromIndex(ZkController
> > > zkController, String fromIndex) {
> > >   String fromReplica = null;
> > >
> > >   String nodeName = zkController.getNodeName();
> > >   for (Slice slice :
> > >
> > >
> >
> zkController.getClusterState().getCollection(fromIndex).getActiveSlicesArr())
> > > {
> > > if (fromReplica != null)
> > >   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
> > >   "SolrCloud join: multiple shards not yet supported " +
> > > fromIndex);
> > > for (Replica replica : slice.getReplicas()) {
> > >   if (replica.getNodeName().equals(nodeName)) {
> > > fromReplica = replica.getStr(ZkStateReader.CORE_NAME_PROP);
> > > // found local replica, but is it Active?
> > > if (replica.getState() != Replica.State.ACTIVE)
> > >   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
> > >   "SolrCloud join: "+fromIndex+" has a local replica
> > > ("+fromReplica+
> > >   ") on "+nodeName+", but it is "+replica.getState());
> > >
> > > break;
> > >   }
> > > }
> > >   }
> > >
> > >   if (fromReplica == null)
> > >

Re: question about solrCloud joining

2019-08-20 Thread Lisheng Wang

Hi Mikhail

Thanks for your response,  but question is not related to "title:Get Out",
maybe i did not describe clearly.

I knew solrCloud joining is not working in index which is splited to
multiple shards.

but why i run "*{!join from=director_id fromIndex=movies
to=id}title:"Dunkirk"*" on 8984 (fromIndex=movies, movies has 2 shards)  i
got exception "SolrCloud join: multiple shards not yet supported movies"

but when run on 8983, i got result but it is incorrect without above
exception. i think should get same exception no matter run joining on 8983
or 8984.

Not sure my explanation is clear?

Please kindly let me know if you have any question.

Thanks!

Lisheng



Mikhail Khludnev  于2019年8月21日周三 上午4:41写道：

> Hello, Lisheng.
> I barely follow, but couldn't the space symbol in "title:Get Out"
> cause the problem
> ?
> Check debugQuery and nested query in local param.
>
>
> On Tue, Aug 20, 2019 at 6:35 PM Lisheng Wang 
> wrote:
>
> > Hi Erick
> >
> > Thanks for your quick response and remaining me about attachment issue.
> >
> > Yes, i run on 2 different jvms that not related to if they are on same
> > machine or not.
> >
> > let me describe my scenario, i have two collection:
> >
> > i start 2 nodes on my laptop on 2 different JVM, ports are 8983 and 8984.
> >
> > 1. movieDirectors: 1 shard, 2 replica, master is on 8984, slave is on
> 8983
> > 2. movies: 2 shard, 1 replica/shardshard1 is on 8983, shard2 is on
> > 8984.
> >
> > collection movieDirectors has 2 docs:
> > {
> > "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > 1642343781358370816
> > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > 1642343828930166784
> > }
> > collection movies has 2 docs too:
> > { "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> > 1642343781358370816
> > }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> > 1642343828930166784
> > }
> > everything is ok when i run query with "{!join from=id
> > fromIndex=movieDirectors to=director_id}has_oscar:true" on both 8983 and
> > 8984, i can got expected result:
> > { "responseHeader":{ "zkConnected":true, "status":0, "QTime":79,
> "params":{
> > "q":"*:*", "fq":"{!join from=id fromIndex=movieDirectors
> > to=director_id}has_oscar:true", "_":"1566313944099"}},
> > "response":{"numFound
> > ":2,"start":0,"maxScore":1.0,"docs":[ { "id":"1", "title":"Dunkirk", "
> > director_id":"1", "_version_":1642343781358370816}, { "id":"2",
> > "title":"Get
> > Out", "director_id":"2", "_version_":1642343828930166784}] }}
> > but when i run "{!join from=director_id fromIndex=movies
> > to=id}title:"Dunkirk"" on 8983 got 1 doc,
> >  if i filter by "title:Get Out", i got nothing.  i understood "Get Out"
> is
> > not exist in 8983.
> > { "responseHeader":{ "zkConnected":true, "status":0, "QTime":3,
> "params":{
> > "
> > q":"*:*", "fq":"{!join from=director_id fromIndex=movies
> > to=id}title:\"Dunkirk\"", "_":"1566261450613"}},
> "response":{"numFound":1,"
> > start":0,"docs":[ { "id":"1", "name":"Christopher Nolan",
> "has_oscar":true,
> > "_version_":1642343436642156544}] }}
> >
> > but question is coming, when i run "{!join from=director_id
> > fromIndex=movies to=id}title:"Dunkirk"" on 8984, i got "SolrCloud join:
> > multiple shards not yet supported movies"
> > no matter what filter value is.
> >
> > i found following code:
> >
> > private static String findLocalReplicaForFromIndex(ZkController
> > zkController, String fromIndex) {
> >   String fromReplica = null;
> >
> >   String nodeName = zkController.getNodeName();
> >   for (Slice slice :
> >
> >
> zkController.getClusterState().getCollection(fromIndex).getActiveSlicesArr())
> > {
> > if (fromReplica != null)
> >   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
> >   "SolrCloud join: multiple shards not yet supported " +
> > fromIndex);
> > for (Replica replica : slice.getReplicas()) {
> >   if (replica.getNodeName().equals(nodeName)) {
> > fromReplica = replica.getStr(ZkStateReader.CORE_NAME_PROP);
> > // found local replica, but is it Active?
> > if (replica.getState() != Replica.State.ACTIVE)
> >   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
> >   "SolrCloud join: "+fromIndex+" has a local replica
> > ("+fromReplica+
> >   ") on "+nodeName+", but it is "+replica.getState());
> >
> > break;
> >   }
> > }
> >   }
> >
> >   if (fromReplica == null)
> > throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
> > "SolrCloud join: No active replicas for "+fromIndex+
> > " found in node " + nodeName);
> >
> >   return fromReplica;
> > }
> >
> >
> > when i run joining from movies on 8983, slice length is 2 as movies have
> 2
> > shards. "fromReplica " was assigned in second cycle,  because
> zkController
> > name is 8983 and replica name is 8984 in first cycle.
> >
> > but when run on 8984, "fromReplica" was

Re: question about solrCloud joining

2019-08-20 Thread Mikhail Khludnev

Hello, Lisheng.
I barely follow, but couldn't the space symbol in "title:Get Out"
cause the problem
?
Check debugQuery and nested query in local param.


On Tue, Aug 20, 2019 at 6:35 PM Lisheng Wang 
wrote:

> Hi Erick
>
> Thanks for your quick response and remaining me about attachment issue.
>
> Yes, i run on 2 different jvms that not related to if they are on same
> machine or not.
>
> let me describe my scenario, i have two collection:
>
> i start 2 nodes on my laptop on 2 different JVM, ports are 8983 and 8984.
>
> 1. movieDirectors: 1 shard, 2 replica, master is on 8984, slave is on 8983
> 2. movies: 2 shard, 1 replica/shardshard1 is on 8983, shard2 is on
> 8984.
>
> collection movieDirectors has 2 docs:
> {
> "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> 1642343781358370816
> }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> 1642343828930166784
> }
> collection movies has 2 docs too:
> { "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
> 1642343781358370816
> }, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
> 1642343828930166784
> }
> everything is ok when i run query with "{!join from=id
> fromIndex=movieDirectors to=director_id}has_oscar:true" on both 8983 and
> 8984, i can got expected result:
> { "responseHeader":{ "zkConnected":true, "status":0, "QTime":79, "params":{
> "q":"*:*", "fq":"{!join from=id fromIndex=movieDirectors
> to=director_id}has_oscar:true", "_":"1566313944099"}},
> "response":{"numFound
> ":2,"start":0,"maxScore":1.0,"docs":[ { "id":"1", "title":"Dunkirk", "
> director_id":"1", "_version_":1642343781358370816}, { "id":"2",
> "title":"Get
> Out", "director_id":"2", "_version_":1642343828930166784}] }}
> but when i run "{!join from=director_id fromIndex=movies
> to=id}title:"Dunkirk"" on 8983 got 1 doc,
>  if i filter by "title:Get Out", i got nothing.  i understood "Get Out" is
> not exist in 8983.
> { "responseHeader":{ "zkConnected":true, "status":0, "QTime":3, "params":{
> "
> q":"*:*", "fq":"{!join from=director_id fromIndex=movies
> to=id}title:\"Dunkirk\"", "_":"1566261450613"}}, "response":{"numFound":1,"
> start":0,"docs":[ { "id":"1", "name":"Christopher Nolan", "has_oscar":true,
> "_version_":1642343436642156544}] }}
>
> but question is coming, when i run "{!join from=director_id
> fromIndex=movies to=id}title:"Dunkirk"" on 8984, i got "SolrCloud join:
> multiple shards not yet supported movies"
> no matter what filter value is.
>
> i found following code:
>
> private static String findLocalReplicaForFromIndex(ZkController
> zkController, String fromIndex) {
>   String fromReplica = null;
>
>   String nodeName = zkController.getNodeName();
>   for (Slice slice :
>
> zkController.getClusterState().getCollection(fromIndex).getActiveSlicesArr())
> {
> if (fromReplica != null)
>   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
>   "SolrCloud join: multiple shards not yet supported " +
> fromIndex);
> for (Replica replica : slice.getReplicas()) {
>   if (replica.getNodeName().equals(nodeName)) {
> fromReplica = replica.getStr(ZkStateReader.CORE_NAME_PROP);
> // found local replica, but is it Active?
> if (replica.getState() != Replica.State.ACTIVE)
>   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
>   "SolrCloud join: "+fromIndex+" has a local replica
> ("+fromReplica+
>   ") on "+nodeName+", but it is "+replica.getState());
>
> break;
>   }
> }
>   }
>
>   if (fromReplica == null)
> throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
> "SolrCloud join: No active replicas for "+fromIndex+
> " found in node " + nodeName);
>
>   return fromReplica;
> }
>
>
> when i run joining from movies on 8983, slice length is 2 as movies have 2
> shards. "fromReplica " was assigned in second cycle,  because zkController
> name is 8983 and replica name is 8984 in first cycle.
>
> but when run on 8984, "fromReplica" was assigned in first cycle, because
> zkController name isand replica name both are 8984 in first cycle, so throw
> "SolrCloud join: multiple shards not yet supported" in second cycle.
>
> Thanks for your patience, it's too long. i'm confused about why use this
> way to judge "multiple shards", because the result is also wrong running on
> 8983 even if didnt throw exception. why dont use  slice length>1 to judge
> "multiple shards" ? or maybe have other better way?
>
> Please advise.
>
> Thanks in advance!
>
> Erick Erickson  于2019年8月20日周二 下午7:39写道：
>
> > None of your images came through, the mail server aggressively strips
> > attachments. You’ll have to put them somewhere and provide a link.
> >
> > Given that, I’m guessing without much data so this may be totally
> > misguided. You mention ports 8984 and 8984. Assuming those are two
> > different Solr JVMs, the fact that they’re running on the same machine is
> > irrelevant; As far as SolrCloud

Re: question about solrCloud joining

2019-08-20 Thread Lisheng Wang

Hi Erick

Thanks for your quick response and remaining me about attachment issue.

Yes, i run on 2 different jvms that not related to if they are on same
machine or not.

let me describe my scenario, i have two collection:

i start 2 nodes on my laptop on 2 different JVM, ports are 8983 and 8984.

1. movieDirectors: 1 shard, 2 replica, master is on 8984, slave is on 8983
2. movies: 2 shard, 1 replica/shardshard1 is on 8983, shard2 is on 8984.

collection movieDirectors has 2 docs:
{
"id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
1642343781358370816
}, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
1642343828930166784
}
collection movies has 2 docs too:
{ "id":"1", "title":"Dunkirk", "director_id":"1", "_version_":
1642343781358370816
}, { "id":"2", "title":"Get Out", "director_id":"2", "_version_":
1642343828930166784
}
everything is ok when i run query with "{!join from=id
fromIndex=movieDirectors to=director_id}has_oscar:true" on both 8983 and
8984, i can got expected result:
{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":79, "params":{
"q":"*:*", "fq":"{!join from=id fromIndex=movieDirectors
to=director_id}has_oscar:true", "_":"1566313944099"}}, "response":{"numFound
":2,"start":0,"maxScore":1.0,"docs":[ { "id":"1", "title":"Dunkirk", "
director_id":"1", "_version_":1642343781358370816}, { "id":"2", "title":"Get
Out", "director_id":"2", "_version_":1642343828930166784}] }}
but when i run "{!join from=director_id fromIndex=movies
to=id}title:"Dunkirk"" on 8983 got 1 doc,
 if i filter by "title:Get Out", i got nothing.  i understood "Get Out" is
not exist in 8983.
{ "responseHeader":{ "zkConnected":true, "status":0, "QTime":3, "params":{ "
q":"*:*", "fq":"{!join from=director_id fromIndex=movies
to=id}title:\"Dunkirk\"", "_":"1566261450613"}}, "response":{"numFound":1,"
start":0,"docs":[ { "id":"1", "name":"Christopher Nolan", "has_oscar":true,
"_version_":1642343436642156544}] }}

but question is coming, when i run "{!join from=director_id
fromIndex=movies to=id}title:"Dunkirk"" on 8984, i got "SolrCloud join:
multiple shards not yet supported movies"
no matter what filter value is.

i found following code:

private static String findLocalReplicaForFromIndex(ZkController
zkController, String fromIndex) {
  String fromReplica = null;

  String nodeName = zkController.getNodeName();
  for (Slice slice :
zkController.getClusterState().getCollection(fromIndex).getActiveSlicesArr())
{
if (fromReplica != null)
  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
  "SolrCloud join: multiple shards not yet supported " + fromIndex);
for (Replica replica : slice.getReplicas()) {
  if (replica.getNodeName().equals(nodeName)) {
fromReplica = replica.getStr(ZkStateReader.CORE_NAME_PROP);
// found local replica, but is it Active?
if (replica.getState() != Replica.State.ACTIVE)
  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
  "SolrCloud join: "+fromIndex+" has a local replica ("+fromReplica+
  ") on "+nodeName+", but it is "+replica.getState());

break;
  }
}
  }

  if (fromReplica == null)
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
"SolrCloud join: No active replicas for "+fromIndex+
" found in node " + nodeName);

  return fromReplica;
}

when i run joining from movies on 8983, slice length is 2 as movies have 2
shards. "fromReplica " was assigned in second cycle,  because zkController
name is 8983 and replica name is 8984 in first cycle.

but when run on 8984, "fromReplica" was assigned in first cycle, because
zkController name isand replica name both are 8984 in first cycle, so throw
"SolrCloud join: multiple shards not yet supported" in second cycle.

Thanks for your patience, it's too long. i'm confused about why use this
way to judge "multiple shards", because the result is also wrong running on
8983 even if didnt throw exception. why dont use  slice length>1 to judge
"multiple shards" ? or maybe have other better way?

Please advise.

Thanks in advance!

Erick Erickson  于2019年8月20日周二 下午7:39写道：

> None of your images came through, the mail server aggressively strips
> attachments. You’ll have to put them somewhere and provide a link.
>
> Given that, I’m guessing without much data so this may be totally
> misguided. You mention ports 8984 and 8984. Assuming those are two
> different Solr JVMs, the fact that they’re running on the same machine is
> irrelevant; As far as SolrCloud is concerned, they are two separate
> machines. Your directors collection must be completely resident on both
> Solr instances for cross-collection join to work.
>
> Best,
> Erick
>
> > On Aug 19, 2019, at 9:39 PM, 王立生  wrote:
> >
> > Hello,
> >
> > I have a question about solrCloud joining. i knew solrCloud joining can
> do join only when index is  not splited to shards, but when i test it, i
> found a problem which make me

Re: question about solrCloud joining

2019-08-20 Thread Erick Erickson

None of your images came through, the mail server aggressively strips 
attachments. You’ll have to put them somewhere and provide a link.

Given that, I’m guessing without much data so this may be totally misguided. 
You mention ports 8984 and 8984. Assuming those are two different Solr JVMs, 
the fact that they’re running on the same machine is irrelevant; As far as 
SolrCloud is concerned, they are two separate machines. Your directors 
collection must be completely resident on both Solr instances for 
cross-collection join to work.

Best,
Erick

> On Aug 19, 2019, at 9:39 PM, 王立生  wrote:
> 
> Hello,
> 
> I have a question about solrCloud joining. i knew solrCloud joining can do 
> join only when index is  not splited to shards, but when i test it, i found a 
> problem which make me confused. 
> 
> i tested it on version 8.2
> 
> assuming i have 2 collections like sample about "joining" on solr offcial 
> website,
> 
> one collection called "movies", another called "movieDirectors".
> 
> movies's fields: id, title, director_id
> movieDirectors's fields: id, name, has_oscar
> 
> the information of shards and replicas as below, i started two nodes on my 
> laptop:
> 
>  moviesDirectors have 2 docs:
> 
> movies also have 2 docs:
> 
> everything is ok when i run query with "{!join from=id 
> fromIndex=movieDirectors to=director_id}has_oscar:true" on both 8983 and 
> 8984, i can got expacted result:
> 
> but when i run "{!join from=director_id fromIndex=movies 
> to=id}title:"Dunkirk"" on 8983
> got 1 doc and if i filter by "title:Get Out", i got nothing.  i understood 
> "Get Out" is not exist in 8983.
> 
> 
> but question is coming, when i run "{!join from=director_id fromIndex=movies 
> to=id}title:"Dunkirk"" on 8984, i got "SolrCloud join: multiple shards not 
> yet supported movies"
> no matter what filter value is.
> 
> i found following code:
> 
> 
> when i run joining from movies on 8983, slice length is 2 as movies have 2 
> shards. "fromReplica " was assigned in second cycle,  because zkController 
> name is 8983 and replica name is 8984 in first cycle.
> 
> but when run on 8984, "fromReplica" was assigned in first cycle, because 
> zkController name isand replica name both are 8984 in first cycle, so throw 
> "SolrCloud join: multiple shards not yet supported" in second cycle.
> 
> Thanks for your patience, it's too long. i'm confused about why use this way 
> to judge "multiple shards", because the result is also wrong running on 8983 
> even if didnt throw exception. why dont use  slice length>1 to judge 
> "multiple shards" ? or maybe have other better way?
> 
> Please advise.
> 
> Thanks in advance!
> 
> 
>

Re: Question regarding Solr fq query

2019-06-28 Thread Saurabh Sharma

Hi,

Images are not visible. Please upload on some image sharing platform and
share the link.

Thanks

On Fri, 28 Jun, 2019, 11:00 PM Krishna Kammadanam, 
wrote:

> Hello,
>
>
>
> I am a back-end developer working with Solr 4.0 version.
>
>
>
> I am running into so many issues, but trying to understand at the same
> time.
>
>
>
> But I have a question for anyone who can help me.
>
>
>
>
>
>
>
> A list exists within The JournalId 0036-8075. But cant able to search with
> the dash in between.
>
>
>
> I cant able to escape the dash in between..
>
>
>
> Any suggestions?
>
>
>
>
>
> Best regards
>
>
>
> *Krist Kammadanam*
>
> Back-end Developer
>
> [image: signature_708671773] 
>
>
>
> *V*:
>
> *E:* *k...@chronos-oa.com *
>
>
>

Re: Question regarding negated block join queries

2019-06-17 Thread Erick Erickson

Bram:

Here’s a fuller explanation that you might be interested in:

https://lucidworks.com/2011/12/28/why-not-and-or-and-not/

Best,
Erick

> On Jun 17, 2019, at 11:32 AM, Bram Biesbrouck 
>  wrote:
> 
> On Mon, Jun 17, 2019 at 7:11 PM Shawn Heisey  wrote:
> 
>> On 6/17/2019 4:46 AM, Bram Biesbrouck wrote:
>>> q={!parent which=-(parentUri:*)}*:*
>> 
>> Pure negative queries do not work in Lucene.  Sometimes, when you do a
>> single-clause negative query, Solr is able to detect the problem and
>> automatically make an adjustment so the query works.  This happens
>> transparently so you never notice.
>> 
>> In essence, what your negative query tells Lucene is "start with
>> nothing, and then subtract docs that match this query."  Since you
>> started with nothing and then subtracted, you get nothing.
>> 
>> Also, that's a wilcard query.  Which could be very slow if the possible
>> number of values in parentUri is more than a few.  If that field can
>> only contain a very small number of values, then a wildcard query might
>> be fast.
>> 
>> The following query solves both problems -- starting with all docs and
>> then subtracting things that match the query clause after that:
>> 
>> *:* -parentUri:[* TO *]
>> 
>> This will return all documents that do not have the parentUri field
>> defined.  The [* TO *] syntax is an all-inclusive range query.
>> 
> 
> Hi Shawn,
> 
> Awesome elaborate explanation, thank you. Also thanks for the optimization
> hint. I found both approaches online, but didn't realize there was a
> performance difference .
> Digging deeper, I've found this SO post, basically explaining why it worked
> some of the time, but not in all cases:
> https://stackoverflow.com/questions/10651548/negation-in-solr-query
> 
> best,
> 
> b.

Re: Question regarding negated block join queries

2019-06-17 Thread Bram Biesbrouck

On Mon, Jun 17, 2019 at 7:11 PM Shawn Heisey  wrote:

> On 6/17/2019 4:46 AM, Bram Biesbrouck wrote:
> > q={!parent which=-(parentUri:*)}*:*
>
> Pure negative queries do not work in Lucene.  Sometimes, when you do a
> single-clause negative query, Solr is able to detect the problem and
> automatically make an adjustment so the query works.  This happens
> transparently so you never notice.
>
> In essence, what your negative query tells Lucene is "start with
> nothing, and then subtract docs that match this query."  Since you
> started with nothing and then subtracted, you get nothing.
>
> Also, that's a wilcard query.  Which could be very slow if the possible
> number of values in parentUri is more than a few.  If that field can
> only contain a very small number of values, then a wildcard query might
> be fast.
>
> The following query solves both problems -- starting with all docs and
> then subtracting things that match the query clause after that:
>
> *:* -parentUri:[* TO *]
>
> This will return all documents that do not have the parentUri field
> defined.  The [* TO *] syntax is an all-inclusive range query.
>

Hi Shawn,

Awesome elaborate explanation, thank you. Also thanks for the optimization
hint. I found both approaches online, but didn't realize there was a
performance difference .
Digging deeper, I've found this SO post, basically explaining why it worked
some of the time, but not in all cases:
https://stackoverflow.com/questions/10651548/negation-in-solr-query

best,

b.

Re: Question regarding negated block join queries

2019-06-17 Thread Shawn Heisey


On 6/17/2019 4:46 AM, Bram Biesbrouck wrote:

q={!parent which=-(parentUri:*)}*:*


Pure negative queries do not work in Lucene.  Sometimes, when you do a 
single-clause negative query, Solr is able to detect the problem and 
automatically make an adjustment so the query works.  This happens 
transparently so you never notice.


In essence, what your negative query tells Lucene is "start with 
nothing, and then subtract docs that match this query."  Since you 
started with nothing and then subtracted, you get nothing.


Also, that's a wilcard query.  Which could be very slow if the possible 
number of values in parentUri is more than a few.  If that field can 
only contain a very small number of values, then a wildcard query might 
be fast.


The following query solves both problems -- starting with all docs and 
then subtracting things that match the query clause after that:


*:* -parentUri:[* TO *]

This will return all documents that do not have the parentUri field 
defined.  The [* TO *] syntax is an all-inclusive range query.


Thanks,
Shawn

Re: Question RE: Contents of Field Value Cache

2019-05-06 Thread benrollinger

Mikhail Khludnev-2 wrote
> Hello,
> Every FVC entry corresponds to to a field, but capped by max size. So,
> it's
> really odd that its' numbers peaked as some point of time. Note that some
> caches support showItems parameter, check the doc.
> 
> On Sat, May 4, 2019 at 11:04 AM benrollinger 

> rollinger.benjamin.c@

> 
> wrote:
> 
>> Good Evening,
>>
>> Running into a puzzle with my SOLR instance (bundled with WebSphere
>> Commerce).  I understand that FieldValueCache(FVC) roughly corresponds to
>> facets on the storefront.  Under normal processing we fill the FVC up to
>> 137
>> and everything runs happy.  This roughly corresponds to the number of
>> facetable attributes on the front end.
>>
>> But every so often (seems like it might correlate to indexprop timing),
>> we
>> see the FVC climb up over 200.
>>  When it happens, it drives a bunch of extra CPU as the FVC cache hit
>> ratio
>> decreases drastically (at 137 happy mode its right about 100% hit ratio).
>>
>> So far have been unable to reproduce on demand, but users manage it a
>> couple
>> times a week.  Since Im not finding how to reproduce, my next thought is
>> how
>> can I log these entries, or more info about the contents?  So far Google
>> search hasnt helped with this much.  Nor did my PMR/IBM support case
>> about
>> it.  If anyone has an idea how I can find & log the cache keys as they
>> are
>> loaded, I'd very much appreciate it.
>>
>> Thanks in advance.
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev

Ah ha!  showItems should do the trick, I appreciate it!

And in case it helps any others in the future, for those running WCS, the
"keys" from the cache will correspond back to srchattrprop.propertyvalue
(table.col name).



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Question RE: Contents of Field Value Cache

2019-05-04 Thread Mikhail Khludnev

Hello,
Every FVC entry corresponds to to a field, but capped by max size. So, it's
really odd that its' numbers peaked as some point of time. Note that some
caches support showItems parameter, check the doc.

On Sat, May 4, 2019 at 11:04 AM benrollinger 
wrote:

> Good Evening,
>
> Running into a puzzle with my SOLR instance (bundled with WebSphere
> Commerce).  I understand that FieldValueCache(FVC) roughly corresponds to
> facets on the storefront.  Under normal processing we fill the FVC up to
> 137
> and everything runs happy.  This roughly corresponds to the number of
> facetable attributes on the front end.
>
> But every so often (seems like it might correlate to indexprop timing), we
> see the FVC climb up over 200.
>  When it happens, it drives a bunch of extra CPU as the FVC cache hit ratio
> decreases drastically (at 137 happy mode its right about 100% hit ratio).
>
> So far have been unable to reproduce on demand, but users manage it a
> couple
> times a week.  Since Im not finding how to reproduce, my next thought is
> how
> can I log these entries, or more info about the contents?  So far Google
> search hasnt helped with this much.  Nor did my PMR/IBM support case about
> it.  If anyone has an idea how I can find & log the cache keys as they are
> loaded, I'd very much appreciate it.
>
> Thanks in advance.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Question on Solr/WordPress Integration

2019-03-01 Thread markus kalkbrenner

If you’re more familiar with PHP you can do the same using the Solarium library 
instead of SolrJ for Java.

Once the PDFs are extracted and indexed, Drupal is an alternative to Wordpress 
as Frontend. Using the Serach API Solr module you can access and „present“ any 
existing Solr index without a single line of custom code.

Markus

> Am 02.03.2019 um 01:30 schrieb Erick Erickson :
> 
> Writing a Java (SolrJ) program that traverses a filesystem and extracts the 
> contents of PDF is actually quite simple, see: 
> https://lucidworks.com/2012/02/14/indexing-with-solrj/ (you can ignore the 
> RDBMS stuff). That code is a little out of date so may need some very minor 
> tweaks.
> 
> Tika (the library Solr uses to parse PDFs and most other files) may have 
> something that makes the job even easier, I’d ask on their user’s list. 
> Putting WordPress in the middle of it all seems unnecessarily complicated.
> 
> Best,
> Erick
> 
>> On Mar 1, 2019, at 11:18 AM, Paul Buiocchi  wrote:
>> 
>> Thank you Shawn !
>> 
>> Sent from Yahoo Mail on Android 
>> 
>> On Fri, Mar 1, 2019 at 12:25 PM, Paul Buiocchi 
>> wrote:   Greetings, 
>> 
>> I have a couple of questions about Solr /Wordpress integration - 
>> 
>> First , I am not "committed to using WordPress as a front end. If there is a 
>> better front end option , I would be willing to convert. For functionality , 
>> all I am looking for is the ability to full txt search , highlight the 
>> search terms in the search results  It should be pretty simple , maybe I 
>> am overanalyzing it  ...Looking for as much "out of the box" as possible 
>> 
>> My scenario is this: 
>> 
>> I am putting together an old newspaper archive site . about 25k pdf files 
>> that are full txt searchable. 
>> 
>> Questions on architecture: 
>> 1) Is there a way for Solr to index from a local file structure i.e local 
>> drive:/newpaper_name/date/page# ? . From the experimenting I have done with 
>> Wordpress/Solr integration , I found that I had to upload the documents in 
>> Wordpress to get Solr to recognize them . 
>> 
>> I'm sure I will have more questions , any help/suggestions would be greatly 
>> appreciated - thank you  
>> 
>> Sent from Yahoo Mail on Android  
>

Re: Question on Solr/WordPress Integration

2019-03-01 Thread Erick Erickson

Writing a Java (SolrJ) program that traverses a filesystem and extracts the 
contents of PDF is actually quite simple, see: 
https://lucidworks.com/2012/02/14/indexing-with-solrj/ (you can ignore the 
RDBMS stuff). That code is a little out of date so may need some very minor 
tweaks.

Tika (the library Solr uses to parse PDFs and most other files) may have 
something that makes the job even easier, I’d ask on their user’s list. Putting 
WordPress in the middle of it all seems unnecessarily complicated.

Best,
Erick

> On Mar 1, 2019, at 11:18 AM, Paul Buiocchi  wrote:
> 
> Thank you Shawn !
> 
> Sent from Yahoo Mail on Android 
> 
>  On Fri, Mar 1, 2019 at 12:25 PM, Paul Buiocchi 
> wrote:   Greetings, 
> 
> I have a couple of questions about Solr /Wordpress integration - 
> 
> First , I am not "committed to using WordPress as a front end. If there is a 
> better front end option , I would be willing to convert. For functionality , 
> all I am looking for is the ability to full txt search , highlight the search 
> terms in the search results  It should be pretty simple , maybe I am 
> overanalyzing it  ...Looking for as much "out of the box" as possible 
> 
> My scenario is this: 
> 
> I am putting together an old newspaper archive site . about 25k pdf files 
> that are full txt searchable. 
> 
> Questions on architecture: 
> 1) Is there a way for Solr to index from a local file structure i.e local 
> drive:/newpaper_name/date/page# ? . From the experimenting I have done with 
> Wordpress/Solr integration , I found that I had to upload the documents in 
> Wordpress to get Solr to recognize them . 
> 
> I'm sure I will have more questions , any help/suggestions would be greatly 
> appreciated - thank you  
> 
> Sent from Yahoo Mail on Android

Re: Question on Solr/WordPress Integration

2019-03-01 Thread Paul Buiocchi

Thank you Shawn !

Sent from Yahoo Mail on Android 
 
  On Fri, Mar 1, 2019 at 12:25 PM, Paul Buiocchi 
wrote:   Greetings, 

I have a couple of questions about Solr /Wordpress integration - 

First , I am not "committed to using WordPress as a front end. If there is a 
better front end option , I would be willing to convert. For functionality , 
all I am looking for is the ability to full txt search , highlight the search 
terms in the search results  It should be pretty simple , maybe I am 
overanalyzing it  ...Looking for as much "out of the box" as possible 

My scenario is this: 

I am putting together an old newspaper archive site . about 25k pdf files that 
are full txt searchable. 

Questions on architecture: 
1) Is there a way for Solr to index from a local file structure i.e local 
drive:/newpaper_name/date/page# ? . From the experimenting I have done with 
Wordpress/Solr integration , I found that I had to upload the documents in 
Wordpress to get Solr to recognize them . 

I'm sure I will have more questions , any help/suggestions would be greatly 
appreciated - thank you  

Sent from Yahoo Mail on Android

Re: Question on Solr/WordPress Integration

2019-03-01 Thread Shawn Heisey


On 3/1/2019 10:25 AM, Paul Buiocchi wrote:

I have a couple of questions about Solr /Wordpress integration -


You would need to talk to the person who wrote the plugin for Wordpress 
that integrates with Solr.  If they indicate that a question can only be 
answered by the Solr project, then bring that to us.



I am putting together an old newspaper archive site . about 25k pdf files that 
are full txt searchable.


If you want Solr to index your PDF documents, you would have to use 
SolrCell, also known as the Extracting Request Handler.


We strongly recommend that this functionality should never be used in 
production.  The reason is that the underlying technology, Apache Tika, 
can crash when given certain input.  PDF documents are more likely than 
other kinds to cause this problem.  If Tika crashes when it is being run 
inside Solr, then Solr will also crash.



Questions on architecture:
1) Is there a way for Solr to index from a local file structure i.e local 
drive:/newpaper_name/date/page# ? . From the experimenting I have done with 
Wordpress/Solr integration , I found that I had to upload the documents in 
Wordpress to get Solr to recognize them .


Yes, you can index just about anything you like if you are willing to 
create the configuration and the software to do it.  But in order for 
Wordpress to understand that data, it most likely would have to be done 
through Wordpress.


Thanks,
Shawn

Re: Question about IndexSearcher.search()

2019-01-25 Thread Shawn Heisey


On 1/24/2019 11:11 PM, NDelt wrote:

Hello.
I'm trying to make sample search application using Lucene.


You're on the solr-user mailing list.  If you want help with Lucene, 
you'll need to ask your question on the java-user mailing list instead.


https://lucene.apache.org/core/discussion.html

Thanks,
Shawn

Re: Question about Solr concept

2019-01-03 Thread Alexandre Rafalovitch

I believe the answer is yes, but specifics depends on whether you mean
online or offline index creation (as in when does the content appear)
and also why you want to do so.

Couple of ideas:
1) If you just want to make sure all updates are visible at once, you
can control that with commit strategies even in the same collection:
https://lucene.apache.org/solr/guide/7_6/updatehandlers-in-solrconfig.html#commits
2) If you are doing full re-indexing, you can do that on a separate
(identical) instance and bring it to the active instance to swap-in
and/or aliases:
https://lucene.apache.org/solr/guide/7_6/coreadmin-api.html#coreadmin-api
(for non SolrCloud),
https://lucene.apache.org/solr/guide/7_6/collections-api.html#createalias
(for Cloud)
3) If you are looking at primary/read-only secondary options, latest
Solr has new replication strategies in SolrCloud mode:
https://lucene.apache.org/solr/guide/7_6/shards-and-indexing-data-in-solrcloud.html

Regards,
   Alex.

On Thu, 3 Jan 2019 at 09:35, KrishnaKumar Satyamurthy
 wrote:
>
> Hi Solr Community Help,
>
> We are new to Solr and have have a basic question about Solr functioning.
> Is it possible to configure solr to perform searching only but not perform
> any indexing by reading the indexes created by a second solr instance?
>
> We really appreciate your kind response in this matter
>
> Thanks,
> Krishna

Re: Question about elevations

2018-11-19 Thread Ray Niu

one more thing to add, if there are fqs, they will be evaluated as well.

Edward Ribeiro  于2018年11月19日周一 下午1:24写道：

> Just complementing Alessandro's answer:
> 1. the elevateIds are inserted into the query, server side (a query
> expansion indeed);
> 2. the query is executed;
> 3. elevatedIds (if found) are popped up to the top of the search results
> via boosting;
>
> Edward
>
> On Mon, Nov 19, 2018 at 3:41 PM Alessandro Benedetti  >
> wrote:
> >
> > As far as I remember the answer is no.
> > You could take a deep look into the code, but as far as I remember the
> > elevated doc Ids must be in the index to be elevated.
> > Those ids will be added to the query built, a sort of query expansion
> server
> > side.
> > And then the search executed.
> >
> > Cheers
> >
> >
> >
> >
> >
> > -
> > ---
> > Alessandro Benedetti
> > Search Consultant, R Software Engineer, Director
> > Sease Ltd. - www.sease.io
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Question about elevations

2018-11-19 Thread Edward Ribeiro

Just complementing Alessandro's answer:
1. the elevateIds are inserted into the query, server side (a query
expansion indeed);
2. the query is executed;
3. elevatedIds (if found) are popped up to the top of the search results
via boosting;

Edward

On Mon, Nov 19, 2018 at 3:41 PM Alessandro Benedetti 
wrote:
>
> As far as I remember the answer is no.
> You could take a deep look into the code, but as far as I remember the
> elevated doc Ids must be in the index to be elevated.
> Those ids will be added to the query built, a sort of query expansion
server
> side.
> And then the search executed.
>
> Cheers
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Question about elevations

2018-11-19 Thread Alessandro Benedetti

As far as I remember the answer is no.
You could take a deep look into the code, but as far as I remember the
elevated doc Ids must be in the index to be elevated.
Those ids will be added to the query built, a sort of query expansion server
side.
And then the search executed.

Cheers





-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: question for rule based replica placement

2018-09-02 Thread Wei

Thanks Erick. Suppose I have 5 hosts h1,h2,h3,h4,h5  and want to create a
5X2 solr cloud of 5 shards, 2 replicas per shard. On each host I will run
two solr JVMs, each hosts a single solr core. Solr's default 'snitch'
provide a 'host' tag, so I wonder if I can use it to prevent any host from
have two replicas from the same shard, when creating collection:

/solr/admin/collections?action=CREATE=mycollection=5=2=1=shard:*,
replica<2, host:*

Is this the correct way to use 'snitch'? I cannot find more relevant
documentation on how to configure and customize 'snitch'.

Thanks,
Wei

On Sun, Sep 2, 2018 at 9:30 PM Erick Erickson 
wrote:

> You need to provide a "snitch" and define a rule appropriately. This
> is a variant of "rack awareness".
>
> Solr considers two JVMs running on the same physical host as
> completely separate Solr instances, so to get replicas on different
> hosts you need a snitch etc.
>
> Best,
> Erick
> On Sun, Sep 2, 2018 at 4:39 PM Wei  wrote:
> >
> > Hi,
> >
> > In rule based replica placement,  how to ensure there are no more than
> one
> > replica for any shard on the same host?   In the documentation there is
> an
> > example rule
> >
> > shard:*,replica:<2,node:*
> >
> > Does 'node' refer to solr instance or actual physical host?  Is there an
> > example for defining the physical host?
> >
> > Thanks,
> > Wei
>

Re: question for rule based replica placement

2018-09-02 Thread Erick Erickson

You need to provide a "snitch" and define a rule appropriately. This
is a variant of "rack awareness".

Solr considers two JVMs running on the same physical host as
completely separate Solr instances, so to get replicas on different
hosts you need a snitch etc.

Best,
Erick
On Sun, Sep 2, 2018 at 4:39 PM Wei  wrote:
>
> Hi,
>
> In rule based replica placement,  how to ensure there are no more than one
> replica for any shard on the same host?   In the documentation there is an
> example rule
>
> shard:*,replica:<2,node:*
>
> Does 'node' refer to solr instance or actual physical host?  Is there an
> example for defining the physical host?
>
> Thanks,
> Wei

Re: Question on query time boosting

2018-08-23 Thread Kydryavtsev Andrey

Hi, Pratic 

I believe that your observations are correct. 

Score for each individual query (in your example it's wildcards query like 
'concept_name:(*semantic*)^200') is calculated by a complex formulas (one of 
possible implementations with a good explanation is described here 
https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html),
 but it could be simplified as follows:

score(doc, query) = query_boost *  

Score for full disjunction (by default) would be calculated as a sum of every 
individual query matched.

So score of case1 would be:

score_for_case1(doc, query) = 200 *  + 400 * 
 + 20 *  + 40 * 
 = 10 * (20 *  + 
40 *  + 2 *  + 4 
* ) = 10 * score_for_case2(doc, query)



Thank you,

Andrey Kudryavtsev

23.08.2018, 18:53, "Pratik Patel" :
> Hello All,
>
> I am trying to understand how exactly query time boosting works in solr.
> Primarily, I want to understand if absolute boost values matter or is it
> just the relative difference between various boost values which decides
> scoring. Let's take following two queries for example.
>
> // case1: q parameter
>
>>  concept_name:(*semantic*)^200 OR
>>  concept_name:(*machine*)^400 OR
>>  Abstract_note:(*semantic*)^20 OR
>>  Abstract_note:(*machine*)^40
>
> //case2: q parameter
>
>>  concept_name:(*semantic*)^20 OR
>>  concept_name:(*machine*)^40 OR
>>  Abstract_note:(*semantic*)^2 OR
>>  Abstract_note:(*machine*)^4
>
> Are these two queries any different?
>
> Relative boosting is same in both of them.
> I can see that they produce same results and ordering. Only difference is
> that the score in case1 is 10 times the score in case2.
>
> Thanks,
> Pratik

Re: Question about updating indexes on solrcloud with single instance solr

2018-08-20 Thread Erick Erickson

There are two choices:

1> shut down all three replicas and copy the index to each one then
start them up.
2> DELETEREPLICA on two of them, update the remaining one, then issue
an ADDREPLICA to get the other two back.

Of the two, I'd go with <2>. When you ADDREPLICA Solr will take care
of copying down the index and putting the new replicas into service.

Best,
Erick



On Mon, Aug 20, 2018 at 2:51 PM, Sushant Vengurlekar
 wrote:
> thanks for the reply Eric
>
> I have one shard per replica but I have 3 replicas on the solrcloud. So how
> do I update from the standalone solr core to these 3 replicas
>
> On Mon, Aug 20, 2018 at 2:43 PM Erick Erickson 
> wrote:
>
>> Assuming that your stand-alone indexes are a single core (i.e. not
>> sharded), then just create a single-shard collection with the
>> appropriate schema. From there I'd shut my Solr instance down, copy
>> the index files "to the right place" and fire it all back up. I'd do
>> this with a single-replica SolrCloud collection, then ADDREPLICA to
>> build out the collection.
>>
>> There's no way to say "reconcile this arbitrary index I built with
>> stand-alone with my SolrCloud collection", so it'a all manual.
>>
>> Best,
>> Erick
>>
>> On Mon, Aug 20, 2018 at 12:38 PM, Sushant Vengurlekar
>>  wrote:
>> > I have a question regarding updating the indexes on solrcloud with
>> indexes
>> > from a standalone solr server. We have a solrcloud which is running. We
>> > have couple of cores on that standalone solr instance which are also
>> > present on the solrcloud as collections. I need to bring in updated
>> indexes
>> > from this standalone solr instance to cloud.
>> >
>> > Does anyone have an idea on what steps need to be taken.
>> >
>> > Thank you
>>

Re: Question about updating indexes on solrcloud with single instance solr

2018-08-20 Thread Sushant Vengurlekar

thanks for the reply Eric

I have one shard per replica but I have 3 replicas on the solrcloud. So how
do I update from the standalone solr core to these 3 replicas

On Mon, Aug 20, 2018 at 2:43 PM Erick Erickson 
wrote:

> Assuming that your stand-alone indexes are a single core (i.e. not
> sharded), then just create a single-shard collection with the
> appropriate schema. From there I'd shut my Solr instance down, copy
> the index files "to the right place" and fire it all back up. I'd do
> this with a single-replica SolrCloud collection, then ADDREPLICA to
> build out the collection.
>
> There's no way to say "reconcile this arbitrary index I built with
> stand-alone with my SolrCloud collection", so it'a all manual.
>
> Best,
> Erick
>
> On Mon, Aug 20, 2018 at 12:38 PM, Sushant Vengurlekar
>  wrote:
> > I have a question regarding updating the indexes on solrcloud with
> indexes
> > from a standalone solr server. We have a solrcloud which is running. We
> > have couple of cores on that standalone solr instance which are also
> > present on the solrcloud as collections. I need to bring in updated
> indexes
> > from this standalone solr instance to cloud.
> >
> > Does anyone have an idea on what steps need to be taken.
> >
> > Thank you
>

Re: Question about updating indexes on solrcloud with single instance solr

2018-08-20 Thread Erick Erickson

Assuming that your stand-alone indexes are a single core (i.e. not
sharded), then just create a single-shard collection with the
appropriate schema. From there I'd shut my Solr instance down, copy
the index files "to the right place" and fire it all back up. I'd do
this with a single-replica SolrCloud collection, then ADDREPLICA to
build out the collection.

There's no way to say "reconcile this arbitrary index I built with
stand-alone with my SolrCloud collection", so it'a all manual.

Best,
Erick

On Mon, Aug 20, 2018 at 12:38 PM, Sushant Vengurlekar
 wrote:
> I have a question regarding updating the indexes on solrcloud with indexes
> from a standalone solr server. We have a solrcloud which is running. We
> have couple of cores on that standalone solr instance which are also
> present on the solrcloud as collections. I need to bring in updated indexes
> from this standalone solr instance to cloud.
>
> Does anyone have an idea on what steps need to be taken.
>
> Thank you

Re: Question regarding searching Chinese characters

2018-08-14 Thread Christopher Beer

Hi all,

Thanks for this enlightening thread. As it happens, at Stanford Libraries we’re 
currently working on upgrading from Solr 4 to 7 and we’re looking forward to 
using the new dictionary-based word splitting in the ICUTokenizer.

We have many of the same challenges as Amanda mentioned, and thanks to the 
advice on this thread, we’ve taken a stab at a CharFilter to do the traditional 
-> simplified transformation [1] and it seems to be promising and we've sent it 
out for testing by our subject matter experts for evaluation.

Thanks,
Chris

[1] 
https://github.com/sul-dlss/CJKFilterUtils/blob/master/src/main/java/edu/stanford/lucene/analysis/ICUTransformCharFilter.java

On 2018/07/24 12:54:35, Tomoko Uchida  wrote:
Hi Amanda,>

do all I need to do is modify the settings from smartChinese to the ones>
you posted here>

Yes, the settings I posted should work for you, at least partially.>
If you are happy with the results, it's OK!>
But please take this as a starting point because it's not perfect.>

Or do I need to still do something with the SmartChineseAnalyzer?>

Try the settings, then if you notice something strange and want to know why>
and how to solve it, that may be the time to dive deep into. ;)>

I cannot explain how analyzers works here... but you should start off with>
the Solr documentation.>
https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html>

Regards,>
Tomoko>

2018年7月24日(火) 21:08 Amanda Shuman :>

Hi Tomoko,>

Thanks so much for this explanation - I did not even know this was>
possible! I will try it out but I have one question: do all I need to do is>
modify the settings from smartChinese to the ones you posted here:>

>
>
>

id="Traditional-Simplified"/>>
>

Or do I need to still do something with the SmartChineseAnalyzer? I did not>
quite understand this in your first message:>

" I think you need two steps if you want to use HMMChineseTokenizer>
correctly.>

1. transform all traditional characters to simplified ones and save to>
temporary files.>
I do not have clear idea for doing this, but you can create a Java>
program that calls Lucene's ICUTransformFilter>
2. then, index to Solr using SmartChineseAnalyzer.">

My understanding is that with the new settings you posted, I don't need to>
do these steps. Is that correct? Otherwise, I don't really know how to do>
step 1 with the java program>

Thanks!>
Amanda>

-->
Dr. Amanda Shuman>
Post-doc researcher, University of Freiburg, The Maoist Legacy Project>
>
PhD, University of California, Santa Cruz>
http://www.amandashuman.net/>
http://www.prchistoryresources.org/>
Office: +49 (0) 761 203 4925>

Re: Question regarding searching Chinese characters

2018-07-24 Thread Tomoko Uchida

Hi Amanda,

> do all I need to do is modify the settings from smartChinese to the ones
you posted here

Yes, the settings I posted should work for you, at least partially.
If you are happy with the results, it's OK!
But please take this as a starting point because it's not perfect.

> Or do I need to still do something with the SmartChineseAnalyzer?

Try the settings, then if you notice something strange and want to know why
and how to solve it, that may be the time to dive deep into. ;)

I cannot explain how analyzers works here... but you should start off with
the Solr documentation.
https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html

Regards,
Tomoko



2018年7月24日(火) 21:08 Amanda Shuman :

> Hi Tomoko,
>
> Thanks so much for this explanation - I did not even know this was
> possible! I will try it out but I have one question: do all I need to do is
> modify the settings from smartChinese to the ones you posted here:
>
> 
>   
>   
>id="Traditional-Simplified"/>
> 
>
> Or do I need to still do something with the SmartChineseAnalyzer? I did not
> quite understand this in your first message:
>
> " I think you need two steps if you want to use HMMChineseTokenizer
> correctly.
>
> 1. transform all traditional characters to simplified ones and save to
> temporary files.
> I do not have clear idea for doing this, but you can create a Java
> program that calls Lucene's ICUTransformFilter
> 2. then, index to Solr using SmartChineseAnalyzer."
>
> My understanding is that with the new settings you posted, I don't need to
> do these steps. Is that correct? Otherwise, I don't really know how to do
> step 1 with the java program
>
> Thanks!
> Amanda
>
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com
> > wrote:
>
> > Yes, while traditional - simplified transformation would be out of the
> > scope of Unicode normalization,
> > you would like to add ICUNormalizer2CharFilterFactory anyway :)
> >
> > Let me refine my example settings:
> >
> > 
> >   
> >   
> >> id="Traditional-Simplified"/>
> > 
> >
> > Regards,
> > Tomoko
> >
> >
> > 2018年7月21日(土) 2:54 Alexandre Rafalovitch :
> >
> > > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > > template of what needs to be done.
> > >
> > > Regards,
> > >Alex.
> > >
> > > On 20 July 2018 at 12:40, Walter Underwood 
> > wrote:
> > > > Looks like we need a charfilter version of the ICU transforms. That
> > > could run before the tokenizer.
> > > >
> > > > I’ve never built a charfilter, but it seems like this would be a good
> > > first project for someone who wants to contribute.
> > > >
> > > > wunder
> > > > Walter Underwood
> > > > wun...@wunderwood.org
> > > > http://observer.wunderwood.org/  (my blog)
> > > >
> > > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > > tomoko.uchida.1...@gmail.com> wrote:
> > > >>
> > > >> Exactly. More concretely, the starting point is: replacing your
> > analyzer
> > > >>
> > > >>  > > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > > >>
> > > >> to
> > > >>
> > > >> 
> > > >>  
> > > >>   > > >> id="Traditional-Simplified"/>
> > > >> 
> > > >>
> > > >> and see if the results are as expected. Then research another
> filters
> > if
> > > >> your requirements is not met.
> > > >>
> > > >> Just a reminder: HMMChineseTokenizerFactory do not handle
> traditional
> > > >> characters as I noted previous in post, so ICUTransformFilterFactory
> > is
> > > an
> > > >> incomplete workaround.
> > > >>
> > > >> 2018年7月21日(土) 0:05 Walter Underwood :
> > > >>
> > > >>> I expect that this is the line that does the transformation:
> > > >>>
> > > >>>> > >>> id="Traditional-Simplified"/>
> > > >>>
> > > >>> This mapping is a standard feature of ICU. More info on ICU
> > transforms
> > > is
> > > >>> in this doc, though not much detail on this particular transform.
> > > >>>
> > > >>> http://userguide.icu-project.org/transforms/general
> > > >>>
> > > >>> wunder
> > > >>> Walter Underwood
> > > >>> wun...@wunderwood.org
> > > >>> http://observer.wunderwood.org/  (my blog)
> > > >>>
> > >  On Jul 20, 2018, at 7:43 AM, Susheel Kumar  >
> > > >>> wrote:
> > > 
> > >  I think so.  I used the exact as in github
> > > 
> > >   > >  positionIncrementGap="1" autoGeneratePhraseQueries="false">
> > >  
> > >    
> > >    
> > > > > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > > > > >>> id="Traditional-Simplified"/>
> > > > > >>> id="Katakana-Hiragana"/>
> > >    
> > > > >  hiragana="true" katakana="true" hangul="true"
>

Re: Question regarding searching Chinese characters

2018-07-24 Thread Amanda Shuman

Hi Tomoko,

Thanks so much for this explanation - I did not even know this was
possible! I will try it out but I have one question: do all I need to do is
modify the settings from smartChinese to the ones you posted here:


  
  
  


Or do I need to still do something with the SmartChineseAnalyzer? I did not
quite understand this in your first message:

" I think you need two steps if you want to use HMMChineseTokenizer
correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer."

My understanding is that with the new settings you posted, I don't need to
do these steps. Is that correct? Otherwise, I don't really know how to do
step 1 with the java program

Thanks!
Amanda


--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida  wrote:

> Yes, while traditional - simplified transformation would be out of the
> scope of Unicode normalization,
> you would like to add ICUNormalizer2CharFilterFactory anyway :)
>
> Let me refine my example settings:
>
> 
>   
>   
>id="Traditional-Simplified"/>
> 
>
> Regards,
> Tomoko
>
>
> 2018年7月21日(土) 2:54 Alexandre Rafalovitch :
>
> > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > template of what needs to be done.
> >
> > Regards,
> >Alex.
> >
> > On 20 July 2018 at 12:40, Walter Underwood 
> wrote:
> > > Looks like we need a charfilter version of the ICU transforms. That
> > could run before the tokenizer.
> > >
> > > I’ve never built a charfilter, but it seems like this would be a good
> > first project for someone who wants to contribute.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > tomoko.uchida.1...@gmail.com> wrote:
> > >>
> > >> Exactly. More concretely, the starting point is: replacing your
> analyzer
> > >>
> > >>  > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > >>
> > >> to
> > >>
> > >> 
> > >>  
> > >>   > >> id="Traditional-Simplified"/>
> > >> 
> > >>
> > >> and see if the results are as expected. Then research another filters
> if
> > >> your requirements is not met.
> > >>
> > >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> > >> characters as I noted previous in post, so ICUTransformFilterFactory
> is
> > an
> > >> incomplete workaround.
> > >>
> > >> 2018年7月21日(土) 0:05 Walter Underwood :
> > >>
> > >>> I expect that this is the line that does the transformation:
> > >>>
> > >>>> >>> id="Traditional-Simplified"/>
> > >>>
> > >>> This mapping is a standard feature of ICU. More info on ICU
> transforms
> > is
> > >>> in this doc, though not much detail on this particular transform.
> > >>>
> > >>> http://userguide.icu-project.org/transforms/general
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> >  On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
> > >>> wrote:
> > 
> >  I think so.  I used the exact as in github
> > 
> >   >  positionIncrementGap="1" autoGeneratePhraseQueries="false">
> >  
> >    
> >    
> > > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > > >>> id="Traditional-Simplified"/>
> > > >>> id="Katakana-Hiragana"/>
> >    
> > >  hiragana="true" katakana="true" hangul="true" outputUnigrams="true"
> />
> >  
> >  
> > 
> > 
> > 
> >  On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> > amanda.shu...@gmail.com
> > 
> >  wrote:
> > 
> > > Thanks! That does indeed look promising... This can be added on top
> > of
> > > Smart Chinese, right? Or is it an alternative?
> > >
> > >
> > > --
> > > Dr. Amanda Shuman
> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
> > Project
> > > 
> > > PhD, University of California, Santa Cruz
> > > http://www.amandashuman.net/
> > > http://www.prchistoryresources.org/
> > > Office: +49 (0) 761 203 4925
> > >
> > >
> > > On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> > susheel2...@gmail.com>
> > > wrote:
> > >
> > >> I think CJKFoldingFilter will work for you.  I put 舊小說 in index
> and
> > >>> then
> > >> each of A, B or C or D in query and they seems to be matching and
> > CJKFF
> > > is
> > >> transforming the 舊 to 旧
> > >>
> > >> On Fri, Jul 20,

Re: Question

2018-07-23 Thread Alexandre Rafalovitch

That depends on what you mean by "unstructured" and "handle".

If by "unstructured" you mean things like PDFs and MSWord - which are
structured under the covers, then yes. Solr ships with Apache Tika to
injest such documents (see shipped examples as well as Data Import
Handler example). E.g.
http://lucene.apache.org/solr/guide/7_4/uploading-data-with-solr-cell-using-apache-tika.html#uploading-data-with-solr-cell-using-apache-tika
and 
http://lucene.apache.org/solr/guide/7_4/uploading-structured-data-store-data-with-the-data-import-handler.html
 You do have to map what you extract to what you mean by "handle".

If you mean just long blob of text (e.g. whole book as a plain text
file), then we go straight to "handle" - what you want to find and how
you want to search for it.

So, think backwards from the search. What you need to find and then
what you have. Then come back with the question on how to connect the
dots in the middle.

Regards,
   Alex.

On 23 July 2018 at 07:02, Driss Khalil  wrote:
> Hi,
> I'm new to Solr and I just want to know if it's possible to handle
> Unstrcutured data in solr .If yes how can we do it ? Do we need it to
> combine it with something else?
>
>
>
>
>
> *Driss KHALIL*
>
> Responsable prospection & sponsoring, Forum GENI Entreprises.
>
> Elève ingénieur en Génie Logiciel, ENSIAS.
> GSM: (+212) 06 62 52 83 26
>
> [image: https://www.linkedin.com/in/driss-khalil-b3aab4151/]
>

Re: Question

2018-07-23 Thread Andrea Gazzarini

Hi Driss,
I think the answer to the first question is yes, but I guess It doesn't
help you so much.
Second and third questions: "It depends", you should describe better your
contest, narrowing questions ad much as possibile ("how can web do It" is
definitely top much generic)

Best,
Andrea


Il lun 23 lug 2018, 15:18 Driss Khalil  ha scritto:

> Hi,
> I'm new to Solr and I just want to know if it's possible to handle
> Unstrcutured data in solr .If yes how can we do it ? Do we need it to
> combine it with something else?
>
>
>
>
>
> *Driss KHALIL*
>
> Responsable prospection & sponsoring, Forum GENI Entreprises.
>
> Elève ingénieur en Génie Logiciel, ENSIAS.
> GSM: (+212) 06 62 52 83 26
>
> [image: https://www.linkedin.com/in/driss-khalil-b3aab4151/]
> 
>

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida

Yes, while traditional - simplified transformation would be out of the
scope of Unicode normalization,
you would like to add ICUNormalizer2CharFilterFactory anyway :)

Let me refine my example settings:


  
  
  


Regards,
Tomoko


2018年7月21日(土) 2:54 Alexandre Rafalovitch :

> Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> template of what needs to be done.
>
> Regards,
>Alex.
>
> On 20 July 2018 at 12:40, Walter Underwood  wrote:
> > Looks like we need a charfilter version of the ICU transforms. That
> could run before the tokenizer.
> >
> > I’ve never built a charfilter, but it seems like this would be a good
> first project for someone who wants to contribute.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com> wrote:
> >>
> >> Exactly. More concretely, the starting point is: replacing your analyzer
> >>
> >>  class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> >>
> >> to
> >>
> >> 
> >>  
> >>   >> id="Traditional-Simplified"/>
> >> 
> >>
> >> and see if the results are as expected. Then research another filters if
> >> your requirements is not met.
> >>
> >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> >> characters as I noted previous in post, so ICUTransformFilterFactory is
> an
> >> incomplete workaround.
> >>
> >> 2018年7月21日(土) 0:05 Walter Underwood :
> >>
> >>> I expect that this is the line that does the transformation:
> >>>
> >>>>>> id="Traditional-Simplified"/>
> >>>
> >>> This mapping is a standard feature of ICU. More info on ICU transforms
> is
> >>> in this doc, though not much detail on this particular transform.
> >>>
> >>> http://userguide.icu-project.org/transforms/general
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
>  On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
> >>> wrote:
> 
>  I think so.  I used the exact as in github
> 
>    positionIncrementGap="1" autoGeneratePhraseQueries="false">
>  
>    
>    
> class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> >>> id="Traditional-Simplified"/>
> >>> id="Katakana-Hiragana"/>
>    
>  hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>  
>  
> 
> 
> 
>  On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> amanda.shu...@gmail.com
> 
>  wrote:
> 
> > Thanks! That does indeed look promising... This can be added on top
> of
> > Smart Chinese, right? Or is it an alternative?
> >
> >
> > --
> > Dr. Amanda Shuman
> > Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> > 
> > PhD, University of California, Santa Cruz
> > http://www.amandashuman.net/
> > http://www.prchistoryresources.org/
> > Office: +49 (0) 761 203 4925
> >
> >
> > On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> susheel2...@gmail.com>
> > wrote:
> >
> >> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> >>> then
> >> each of A, B or C or D in query and they seems to be matching and
> CJKFF
> > is
> >> transforming the 舊 to 旧
> >>
> >> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <
> susheel2...@gmail.com>
> >> wrote:
> >>
> >>> Lack of my chinese language knowledge but if you want, I can do
> quick
> >> test
> >>> for you in Analysis tab if you can give me what to put in index and
> > query
> >>> window...
> >>>
> >>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <
> susheel2...@gmail.com
> 
> >>> wrote:
> >>>
>  Have you tried to use CJKFoldingFilter https://g
>  ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> > cover
>  your use case but I am using this filter and so far no issues.
> 
>  Thnx
> 
>  On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> > amanda.shu...@gmail.com
> >>>
>  wrote:
> 
> > Thanks, Alex - I have seen a few of those links but never
> considered
> > transliteration! We use lucene's Smart Chinese analyzer. The
> issue
> >>> is
> > basically what is laid out in the old blogspot post, namely this
> > point:
> >
> >
> > "Why approach CJK resource discovery differently?
> >
> > 2.  Search results must be as script agnostic as possible.
> >
> > There is more than one way to write each word. "Simplified"
> > characters
> > were
> > emphasized for printed materials in mainland China starting in
> the
> >> 1950s;
> > "Traditional" characters were used in printed materials prior

Re: Question regarding searching Chinese characters

2018-07-20 Thread Alexandre Rafalovitch

Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
template of what needs to be done.

Regards,
   Alex.

On 20 July 2018 at 12:40, Walter Underwood  wrote:
> Looks like we need a charfilter version of the ICU transforms. That could run 
> before the tokenizer.
>
> I’ve never built a charfilter, but it seems like this would be a good first 
> project for someone who wants to contribute.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida  
>> wrote:
>>
>> Exactly. More concretely, the starting point is: replacing your analyzer
>>
>> 
>>
>> to
>>
>> 
>>  
>>  > id="Traditional-Simplified"/>
>> 
>>
>> and see if the results are as expected. Then research another filters if
>> your requirements is not met.
>>
>> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
>> characters as I noted previous in post, so ICUTransformFilterFactory is an
>> incomplete workaround.
>>
>> 2018年7月21日(土) 0:05 Walter Underwood :
>>
>>> I expect that this is the line that does the transformation:
>>>
>>>   >> id="Traditional-Simplified"/>
>>>
>>> This mapping is a standard feature of ICU. More info on ICU transforms is
>>> in this doc, though not much detail on this particular transform.
>>>
>>> http://userguide.icu-project.org/transforms/general
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
 On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
>>> wrote:

 I think so.  I used the exact as in github

 >>> positionIncrementGap="1" autoGeneratePhraseQueries="false">
 
   
   
   
   >> id="Traditional-Simplified"/>
   >> id="Katakana-Hiragana"/>
   
   >>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
 
 



 On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman >>>
 wrote:

> Thanks! That does indeed look promising... This can be added on top of
> Smart Chinese, right? Or is it an alternative?
>
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
> wrote:
>
>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>>> then
>> each of A, B or C or D in query and they seems to be matching and CJKFF
> is
>> transforming the 舊 to 旧
>>
>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
>> wrote:
>>
>>> Lack of my chinese language knowledge but if you want, I can do quick
>> test
>>> for you in Analysis tab if you can give me what to put in index and
> query
>>> window...
>>>
>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar >>>
>>> wrote:
>>>
 Have you tried to use CJKFoldingFilter https://g
 ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> cover
 your use case but I am using this filter and so far no issues.

 Thnx

 On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> amanda.shu...@gmail.com
>>>
 wrote:

> Thanks, Alex - I have seen a few of those links but never considered
> transliteration! We use lucene's Smart Chinese analyzer. The issue
>>> is
> basically what is laid out in the old blogspot post, namely this
> point:
>
>
> "Why approach CJK resource discovery differently?
>
> 2.  Search results must be as script agnostic as possible.
>
> There is more than one way to write each word. "Simplified"
> characters
> were
> emphasized for printed materials in mainland China starting in the
>> 1950s;
> "Traditional" characters were used in printed materials prior to the
> 1950s,
> and are still used in Taiwan, Hong Kong and Macau today.
> Since the characters are distinct, it's as if Chinese materials are
> written
> in two scripts.
> Another way to think about it:  every written Chinese word has at
> least
> two
> completely different spellings.  And it can be mix-n-match:  a word
> can
> be
> written with one traditional  and one simplified character.
> Example:   Given a user query 舊小說  (traditional for old fiction),
>>> the
> results should include matches for 舊小說 (traditional) and 旧小说
>> (simplified
> characters for old fiction)"
>
> So, using the example provided above, we are dealing with materials
> produced in the 1950s-1970s that do even weirder

Re: Question regarding searching Chinese characters

2018-07-20 Thread Walter Underwood

Looks like we need a charfilter version of the ICU transforms. That could run 
before the tokenizer.

I’ve never built a charfilter, but it seems like this would be a good first 
project for someone who wants to contribute.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida  
> wrote:
> 
> Exactly. More concretely, the starting point is: replacing your analyzer
> 
> 
> 
> to
> 
> 
>  
>   id="Traditional-Simplified"/>
> 
> 
> and see if the results are as expected. Then research another filters if
> your requirements is not met.
> 
> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> characters as I noted previous in post, so ICUTransformFilterFactory is an
> incomplete workaround.
> 
> 2018年7月21日(土) 0:05 Walter Underwood :
> 
>> I expect that this is the line that does the transformation:
>> 
>>   > id="Traditional-Simplified"/>
>> 
>> This mapping is a standard feature of ICU. More info on ICU transforms is
>> in this doc, though not much detail on this particular transform.
>> 
>> http://userguide.icu-project.org/transforms/general
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
>> wrote:
>>> 
>>> I think so.  I used the exact as in github
>>> 
>>> >> positionIncrementGap="1" autoGeneratePhraseQueries="false">
>>> 
>>>   
>>>   
>>>   
>>>   > id="Traditional-Simplified"/>
>>>   > id="Katakana-Hiragana"/>
>>>   
>>>   >> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman >> 
>>> wrote:
>>> 
 Thanks! That does indeed look promising... This can be added on top of
 Smart Chinese, right? Or is it an alternative?
 
 
 --
 Dr. Amanda Shuman
 Post-doc researcher, University of Freiburg, The Maoist Legacy Project
 
 PhD, University of California, Santa Cruz
 http://www.amandashuman.net/
 http://www.prchistoryresources.org/
 Office: +49 (0) 761 203 4925
 
 
 On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
 wrote:
 
> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>> then
> each of A, B or C or D in query and they seems to be matching and CJKFF
 is
> transforming the 舊 to 旧
> 
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> wrote:
> 
>> Lack of my chinese language knowledge but if you want, I can do quick
> test
>> for you in Analysis tab if you can give me what to put in index and
 query
>> window...
>> 
>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar >> 
>> wrote:
>> 
>>> Have you tried to use CJKFoldingFilter https://g
>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
 cover
>>> your use case but I am using this filter and so far no issues.
>>> 
>>> Thnx
>>> 
>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
 amanda.shu...@gmail.com
>> 
>>> wrote:
>>> 
 Thanks, Alex - I have seen a few of those links but never considered
 transliteration! We use lucene's Smart Chinese analyzer. The issue
>> is
 basically what is laid out in the old blogspot post, namely this
 point:
 
 
 "Why approach CJK resource discovery differently?
 
 2.  Search results must be as script agnostic as possible.
 
 There is more than one way to write each word. "Simplified"
 characters
 were
 emphasized for printed materials in mainland China starting in the
> 1950s;
 "Traditional" characters were used in printed materials prior to the
 1950s,
 and are still used in Taiwan, Hong Kong and Macau today.
 Since the characters are distinct, it's as if Chinese materials are
 written
 in two scripts.
 Another way to think about it:  every written Chinese word has at
 least
 two
 completely different spellings.  And it can be mix-n-match:  a word
 can
 be
 written with one traditional  and one simplified character.
 Example:   Given a user query 舊小說  (traditional for old fiction),
>> the
 results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
 characters for old fiction)"
 
 So, using the example provided above, we are dealing with materials
 produced in the 1950s-1970s that do even weirder things like:
 
 A. 舊小說
 
 can also be
 
 B. 旧小说 (all simplified)
 or
 C. 旧小說 (first character simplified, last character traditional)
 or
 D. 舊小 说 (first character traditional, last character simplified)

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida

Exactly. More concretely, the starting point is: replacing your analyzer



to


  
  


and see if the results are as expected. Then research another filters if
your requirements is not met.

Just a reminder: HMMChineseTokenizerFactory do not handle traditional
characters as I noted previous in post, so ICUTransformFilterFactory is an
incomplete workaround.

2018年7月21日(土) 0:05 Walter Underwood :

> I expect that this is the line that does the transformation:
>
> id="Traditional-Simplified"/>
>
> This mapping is a standard feature of ICU. More info on ICU transforms is
> in this doc, though not much detail on this particular transform.
>
> http://userguide.icu-project.org/transforms/general
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
> wrote:
> >
> > I think so.  I used the exact as in github
> >
> >  > positionIncrementGap="1" autoGeneratePhraseQueries="false">
> >  
> >
> >
> >
> > id="Traditional-Simplified"/>
> > id="Katakana-Hiragana"/>
> >
> > > hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
> >  
> > 
> >
> >
> >
> > On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman  >
> > wrote:
> >
> >> Thanks! That does indeed look promising... This can be added on top of
> >> Smart Chinese, right? Or is it an alternative?
> >>
> >>
> >> --
> >> Dr. Amanda Shuman
> >> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >> 
> >> PhD, University of California, Santa Cruz
> >> http://www.amandashuman.net/
> >> http://www.prchistoryresources.org/
> >> Office: +49 (0) 761 203 4925
> >>
> >>
> >> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
> >> wrote:
> >>
> >>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
> then
> >>> each of A, B or C or D in query and they seems to be matching and CJKFF
> >> is
> >>> transforming the 舊 to 旧
> >>>
> >>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> >>> wrote:
> >>>
>  Lack of my chinese language knowledge but if you want, I can do quick
> >>> test
>  for you in Analysis tab if you can give me what to put in index and
> >> query
>  window...
> 
>  On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar  >
>  wrote:
> 
> > Have you tried to use CJKFoldingFilter https://g
> > ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> >> cover
> > your use case but I am using this filter and so far no issues.
> >
> > Thnx
> >
> > On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> >> amanda.shu...@gmail.com
> 
> > wrote:
> >
> >> Thanks, Alex - I have seen a few of those links but never considered
> >> transliteration! We use lucene's Smart Chinese analyzer. The issue
> is
> >> basically what is laid out in the old blogspot post, namely this
> >> point:
> >>
> >>
> >> "Why approach CJK resource discovery differently?
> >>
> >> 2.  Search results must be as script agnostic as possible.
> >>
> >> There is more than one way to write each word. "Simplified"
> >> characters
> >> were
> >> emphasized for printed materials in mainland China starting in the
> >>> 1950s;
> >> "Traditional" characters were used in printed materials prior to the
> >> 1950s,
> >> and are still used in Taiwan, Hong Kong and Macau today.
> >> Since the characters are distinct, it's as if Chinese materials are
> >> written
> >> in two scripts.
> >> Another way to think about it:  every written Chinese word has at
> >> least
> >> two
> >> completely different spellings.  And it can be mix-n-match:  a word
> >> can
> >> be
> >> written with one traditional  and one simplified character.
> >> Example:   Given a user query 舊小說  (traditional for old fiction),
> the
> >> results should include matches for 舊小說 (traditional) and 旧小说
> >>> (simplified
> >> characters for old fiction)"
> >>
> >> So, using the example provided above, we are dealing with materials
> >> produced in the 1950s-1970s that do even weirder things like:
> >>
> >> A. 舊小說
> >>
> >> can also be
> >>
> >> B. 旧小说 (all simplified)
> >> or
> >> C. 旧小說 (first character simplified, last character traditional)
> >> or
> >> D. 舊小 说 (first character traditional, last character simplified)
> >>
> >> Thankfully the middle character was never simplified in recent
> times.
> >>
> >> From a historical standpoint, the mixed nature of the characters in
> >> the
> >> same word/phrase is because not all simplified characters were
> >> adopted
> >>> at
> >> the same time by everyone uniformly (good times...).
> >>
> >> The problem seems to be that Solr can easily handle A or B above,
> but
> >> NOT C
> >> or D using the Smart Chinese analyzer. I'm not really

Re: Question regarding searching Chinese characters

2018-07-20 Thread Walter Underwood

I expect that this is the line that does the transformation:

   

This mapping is a standard feature of ICU. More info on ICU transforms is in 
this doc, though not much detail on this particular transform. 

http://userguide.icu-project.org/transforms/general

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 7:43 AM, Susheel Kumar  wrote:
> 
> I think so.  I used the exact as in github
> 
>  positionIncrementGap="1" autoGeneratePhraseQueries="false">
>  
>
>
>
> id="Traditional-Simplified"/>
>
>
> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>  
> 
> 
> 
> 
> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman 
> wrote:
> 
>> Thanks! That does indeed look promising... This can be added on top of
>> Smart Chinese, right? Or is it an alternative?
>> 
>> 
>> --
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> 
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>> 
>> 
>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
>> wrote:
>> 
>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
>>> each of A, B or C or D in query and they seems to be matching and CJKFF
>> is
>>> transforming the 舊 to 旧
>>> 
>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
>>> wrote:
>>> 
 Lack of my chinese language knowledge but if you want, I can do quick
>>> test
 for you in Analysis tab if you can give me what to put in index and
>> query
 window...
 
 On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
 wrote:
 
> Have you tried to use CJKFoldingFilter https://g
> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
>> cover
> your use case but I am using this filter and so far no issues.
> 
> Thnx
> 
> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
>> amanda.shu...@gmail.com
 
> wrote:
> 
>> Thanks, Alex - I have seen a few of those links but never considered
>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>> basically what is laid out in the old blogspot post, namely this
>> point:
>> 
>> 
>> "Why approach CJK resource discovery differently?
>> 
>> 2.  Search results must be as script agnostic as possible.
>> 
>> There is more than one way to write each word. "Simplified"
>> characters
>> were
>> emphasized for printed materials in mainland China starting in the
>>> 1950s;
>> "Traditional" characters were used in printed materials prior to the
>> 1950s,
>> and are still used in Taiwan, Hong Kong and Macau today.
>> Since the characters are distinct, it's as if Chinese materials are
>> written
>> in two scripts.
>> Another way to think about it:  every written Chinese word has at
>> least
>> two
>> completely different spellings.  And it can be mix-n-match:  a word
>> can
>> be
>> written with one traditional  and one simplified character.
>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>> results should include matches for 舊小說 (traditional) and 旧小说
>>> (simplified
>> characters for old fiction)"
>> 
>> So, using the example provided above, we are dealing with materials
>> produced in the 1950s-1970s that do even weirder things like:
>> 
>> A. 舊小說
>> 
>> can also be
>> 
>> B. 旧小说 (all simplified)
>> or
>> C. 旧小說 (first character simplified, last character traditional)
>> or
>> D. 舊小 说 (first character traditional, last character simplified)
>> 
>> Thankfully the middle character was never simplified in recent times.
>> 
>> From a historical standpoint, the mixed nature of the characters in
>> the
>> same word/phrase is because not all simplified characters were
>> adopted
>>> at
>> the same time by everyone uniformly (good times...).
>> 
>> The problem seems to be that Solr can easily handle A or B above, but
>> NOT C
>> or D using the Smart Chinese analyzer. I'm not really sure how to
>>> change
>> that at this point... maybe I should figure out how to contact the
>> creators
>> of the analyzer and ask them?
>> 
>> Amanda
>> 
>> --
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>> Project
>> 
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>> 
>> 
>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>> 
>>> This is probably your start, if not read already:
>>>

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar

I think so.  I used the exact as in github


  







  




On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman 
wrote:

> Thanks! That does indeed look promising... This can be added on top of
> Smart Chinese, right? Or is it an alternative?
>
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
> wrote:
>
> > I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> > each of A, B or C or D in query and they seems to be matching and CJKFF
> is
> > transforming the 舊 to 旧
> >
> > On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> > wrote:
> >
> > > Lack of my chinese language knowledge but if you want, I can do quick
> > test
> > > for you in Analysis tab if you can give me what to put in index and
> query
> > > window...
> > >
> > > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> > > wrote:
> > >
> > >> Have you tried to use CJKFoldingFilter https://g
> > >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> cover
> > >> your use case but I am using this filter and so far no issues.
> > >>
> > >> Thnx
> > >>
> > >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> amanda.shu...@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Thanks, Alex - I have seen a few of those links but never considered
> > >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> > >>> basically what is laid out in the old blogspot post, namely this
> point:
> > >>>
> > >>>
> > >>> "Why approach CJK resource discovery differently?
> > >>>
> > >>> 2.  Search results must be as script agnostic as possible.
> > >>>
> > >>> There is more than one way to write each word. "Simplified"
> characters
> > >>> were
> > >>> emphasized for printed materials in mainland China starting in the
> > 1950s;
> > >>> "Traditional" characters were used in printed materials prior to the
> > >>> 1950s,
> > >>> and are still used in Taiwan, Hong Kong and Macau today.
> > >>> Since the characters are distinct, it's as if Chinese materials are
> > >>> written
> > >>> in two scripts.
> > >>> Another way to think about it:  every written Chinese word has at
> least
> > >>> two
> > >>> completely different spellings.  And it can be mix-n-match:  a word
> can
> > >>> be
> > >>> written with one traditional  and one simplified character.
> > >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> > >>> results should include matches for 舊小說 (traditional) and 旧小说
> > (simplified
> > >>> characters for old fiction)"
> > >>>
> > >>> So, using the example provided above, we are dealing with materials
> > >>> produced in the 1950s-1970s that do even weirder things like:
> > >>>
> > >>> A. 舊小說
> > >>>
> > >>> can also be
> > >>>
> > >>> B. 旧小说 (all simplified)
> > >>> or
> > >>> C. 旧小說 (first character simplified, last character traditional)
> > >>> or
> > >>> D. 舊小 说 (first character traditional, last character simplified)
> > >>>
> > >>> Thankfully the middle character was never simplified in recent times.
> > >>>
> > >>> From a historical standpoint, the mixed nature of the characters in
> the
> > >>> same word/phrase is because not all simplified characters were
> adopted
> > at
> > >>> the same time by everyone uniformly (good times...).
> > >>>
> > >>> The problem seems to be that Solr can easily handle A or B above, but
> > >>> NOT C
> > >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> > change
> > >>> that at this point... maybe I should figure out how to contact the
> > >>> creators
> > >>> of the analyzer and ask them?
> > >>>
> > >>> Amanda
> > >>>
> > >>> --
> > >>> Dr. Amanda Shuman
> > >>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> > >>> 
> > >>> PhD, University of California, Santa Cruz
> > >>> http://www.amandashuman.net/
> > >>> http://www.prchistoryresources.org/
> > >>> Office: +49 (0) 761 203 4925
> > >>>
> > >>>
> > >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> > >>> arafa...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > This is probably your start, if not read already:
> > >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> > >>> >
> > >>> > Otherwise, I think your answer would be somewhere around using
> ICU4J,
> > >>> > IBM's library for dealing with Unicode:
> http://site.icu-project.org/
> > >>> > (mentioned on the same page above)
> > >>> > Specifically, transformations:
> > >>> > http://userguide.icu-project.org/transforms/general
> > >>> >
> > >>> > With that, maybe you map both alphabets into latin. I did that once
> > >>> > for Thai for a demo:
> > >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> > >>> >

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida

Hi,

There is ICUTransformFilter (that included Solr distribution) which also
should be work for you.
See the example settings:
https://lucene.apache.org/solr/guide/7_4/filter-descriptions.html#icu-transform-filter

Combine it with HMMChineseTokenizer.
https://lucene.apache.org/solr/guide/7_4/language-analysis.html#hmm-chinese-tokenizer

In other words, replace your SmartChineseAnalyzer settings by
HMMChineseTokenizer
& ICUTransformFilter pipeline.

Here is a bit complicated explanation, so you can skip if you do not want
to go into analyzer details.

I do not understand Chinese, but seems there are no easy or one-stop
solutions in my view. (As Japanese, we have similar problems with Chinese.)

HMMChineseTokenizer expects Simplified Chinese text.
See:
https://lucene.apache.org/core/7_4_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizer.html

So you should transform all traditional Chinese characters **before**
applying HMMChineseTokenizer by CharFilters, otherwise the Tokenizer do not
correctly work.

Unfortunately, there is no such CharFilters as far as I know.
ICUNormalizer2CharFilter do not handle such transformation so it is no
help. CJKFoldingFilter and  ICUTransformFilter do the
traditional-simplified transformation, however, they are TokenFilters that
works after applying a Tokenizer.

I think you need two steps if you want to use HMMChineseTokenizer correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer.

Regards,
Tomoko

2018年7月20日(金) 22:12 Susheel Kumar :

> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> each of A, B or C or D in query and they seems to be matching and CJKFF is
> transforming the 舊 to 旧
>
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> wrote:
>
> > Lack of my chinese language knowledge but if you want, I can do quick
> test
> > for you in Analysis tab if you can give me what to put in index and query
> > window...
> >
> > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> > wrote:
> >
> >> Have you tried to use CJKFoldingFilter https://g
> >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
> >> your use case but I am using this filter and so far no issues.
> >>
> >> Thnx
> >>
> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman  >
> >> wrote:
> >>
> >>> Thanks, Alex - I have seen a few of those links but never considered
> >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> >>> basically what is laid out in the old blogspot post, namely this point:
> >>>
> >>>
> >>> "Why approach CJK resource discovery differently?
> >>>
> >>> 2.  Search results must be as script agnostic as possible.
> >>>
> >>> There is more than one way to write each word. "Simplified" characters
> >>> were
> >>> emphasized for printed materials in mainland China starting in the
> 1950s;
> >>> "Traditional" characters were used in printed materials prior to the
> >>> 1950s,
> >>> and are still used in Taiwan, Hong Kong and Macau today.
> >>> Since the characters are distinct, it's as if Chinese materials are
> >>> written
> >>> in two scripts.
> >>> Another way to think about it:  every written Chinese word has at least
> >>> two
> >>> completely different spellings.  And it can be mix-n-match:  a word can
> >>> be
> >>> written with one traditional  and one simplified character.
> >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> >>> results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
> >>> characters for old fiction)"
> >>>
> >>> So, using the example provided above, we are dealing with materials
> >>> produced in the 1950s-1970s that do even weirder things like:
> >>>
> >>> A. 舊小說
> >>>
> >>> can also be
> >>>
> >>> B. 旧小说 (all simplified)
> >>> or
> >>> C. 旧小說 (first character simplified, last character traditional)
> >>> or
> >>> D. 舊小 说 (first character traditional, last character simplified)
> >>>
> >>> Thankfully the middle character was never simplified in recent times.
> >>>
> >>> From a historical standpoint, the mixed nature of the characters in the
> >>> same word/phrase is because not all simplified characters were adopted
> at
> >>> the same time by everyone uniformly (good times...).
> >>>
> >>> The problem seems to be that Solr can easily handle A or B above, but
> >>> NOT C
> >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> change
> >>> that at this point... maybe I should figure out how to contact the
> >>> creators
> >>> of the analyzer and ask them?
> >>>
> >>> Amanda
> >>>
> >>> --
> >>> Dr. Amanda Shuman
> >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >>> 
> >>> PhD, University of California, Santa Cruz
> >>>

Re: Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman

Thanks! That does indeed look promising... This can be added on top of
Smart Chinese, right? Or is it an alternative?


--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar 
wrote:

> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> each of A, B or C or D in query and they seems to be matching and CJKFF is
> transforming the 舊 to 旧
>
> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
> wrote:
>
> > Lack of my chinese language knowledge but if you want, I can do quick
> test
> > for you in Analysis tab if you can give me what to put in index and query
> > window...
> >
> > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> > wrote:
> >
> >> Have you tried to use CJKFoldingFilter https://g
> >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
> >> your use case but I am using this filter and so far no issues.
> >>
> >> Thnx
> >>
> >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman  >
> >> wrote:
> >>
> >>> Thanks, Alex - I have seen a few of those links but never considered
> >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> >>> basically what is laid out in the old blogspot post, namely this point:
> >>>
> >>>
> >>> "Why approach CJK resource discovery differently?
> >>>
> >>> 2.  Search results must be as script agnostic as possible.
> >>>
> >>> There is more than one way to write each word. "Simplified" characters
> >>> were
> >>> emphasized for printed materials in mainland China starting in the
> 1950s;
> >>> "Traditional" characters were used in printed materials prior to the
> >>> 1950s,
> >>> and are still used in Taiwan, Hong Kong and Macau today.
> >>> Since the characters are distinct, it's as if Chinese materials are
> >>> written
> >>> in two scripts.
> >>> Another way to think about it:  every written Chinese word has at least
> >>> two
> >>> completely different spellings.  And it can be mix-n-match:  a word can
> >>> be
> >>> written with one traditional  and one simplified character.
> >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> >>> results should include matches for 舊小說 (traditional) and 旧小说
> (simplified
> >>> characters for old fiction)"
> >>>
> >>> So, using the example provided above, we are dealing with materials
> >>> produced in the 1950s-1970s that do even weirder things like:
> >>>
> >>> A. 舊小說
> >>>
> >>> can also be
> >>>
> >>> B. 旧小说 (all simplified)
> >>> or
> >>> C. 旧小說 (first character simplified, last character traditional)
> >>> or
> >>> D. 舊小 说 (first character traditional, last character simplified)
> >>>
> >>> Thankfully the middle character was never simplified in recent times.
> >>>
> >>> From a historical standpoint, the mixed nature of the characters in the
> >>> same word/phrase is because not all simplified characters were adopted
> at
> >>> the same time by everyone uniformly (good times...).
> >>>
> >>> The problem seems to be that Solr can easily handle A or B above, but
> >>> NOT C
> >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> change
> >>> that at this point... maybe I should figure out how to contact the
> >>> creators
> >>> of the analyzer and ask them?
> >>>
> >>> Amanda
> >>>
> >>> --
> >>> Dr. Amanda Shuman
> >>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> >>> 
> >>> PhD, University of California, Santa Cruz
> >>> http://www.amandashuman.net/
> >>> http://www.prchistoryresources.org/
> >>> Office: +49 (0) 761 203 4925
> >>>
> >>>
> >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> >>> arafa...@gmail.com>
> >>> wrote:
> >>>
> >>> > This is probably your start, if not read already:
> >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >>> >
> >>> > Otherwise, I think your answer would be somewhere around using ICU4J,
> >>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> >>> > (mentioned on the same page above)
> >>> > Specifically, transformations:
> >>> > http://userguide.icu-project.org/transforms/general
> >>> >
> >>> > With that, maybe you map both alphabets into latin. I did that once
> >>> > for Thai for a demo:
> >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> >>> > collection1/conf/schema.xml#L34
> >>> >
> >>> > The challenge is to figure out all the magic rules for that. You'd
> >>> > have to dig through the ICU documentation and other web pages. I
> found
> >>> > this one for example:
> >>> > http://avajava.com/tutorials/lessons/what-are-the-system-
> >>> > transliterators-available-with-icu4j.html;jsessionid=
> >>> > BEAB0AF05A588B97B8A2393054D908C0
> >>> >
> >>> > There is also 12 part series on Solr and

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar

I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
each of A, B or C or D in query and they seems to be matching and CJKFF is
transforming the 舊 to 旧

On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar 
wrote:

> Lack of my chinese language knowledge but if you want, I can do quick test
> for you in Analysis tab if you can give me what to put in index and query
> window...
>
> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
> wrote:
>
>> Have you tried to use CJKFoldingFilter https://g
>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would cover
>> your use case but I am using this filter and so far no issues.
>>
>> Thnx
>>
>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman 
>> wrote:
>>
>>> Thanks, Alex - I have seen a few of those links but never considered
>>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>>> basically what is laid out in the old blogspot post, namely this point:
>>>
>>>
>>> "Why approach CJK resource discovery differently?
>>>
>>> 2.  Search results must be as script agnostic as possible.
>>>
>>> There is more than one way to write each word. "Simplified" characters
>>> were
>>> emphasized for printed materials in mainland China starting in the 1950s;
>>> "Traditional" characters were used in printed materials prior to the
>>> 1950s,
>>> and are still used in Taiwan, Hong Kong and Macau today.
>>> Since the characters are distinct, it's as if Chinese materials are
>>> written
>>> in two scripts.
>>> Another way to think about it:  every written Chinese word has at least
>>> two
>>> completely different spellings.  And it can be mix-n-match:  a word can
>>> be
>>> written with one traditional  and one simplified character.
>>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
>>> characters for old fiction)"
>>>
>>> So, using the example provided above, we are dealing with materials
>>> produced in the 1950s-1970s that do even weirder things like:
>>>
>>> A. 舊小說
>>>
>>> can also be
>>>
>>> B. 旧小说 (all simplified)
>>> or
>>> C. 旧小說 (first character simplified, last character traditional)
>>> or
>>> D. 舊小 说 (first character traditional, last character simplified)
>>>
>>> Thankfully the middle character was never simplified in recent times.
>>>
>>> From a historical standpoint, the mixed nature of the characters in the
>>> same word/phrase is because not all simplified characters were adopted at
>>> the same time by everyone uniformly (good times...).
>>>
>>> The problem seems to be that Solr can easily handle A or B above, but
>>> NOT C
>>> or D using the Smart Chinese analyzer. I'm not really sure how to change
>>> that at this point... maybe I should figure out how to contact the
>>> creators
>>> of the analyzer and ask them?
>>>
>>> Amanda
>>>
>>> --
>>> Dr. Amanda Shuman
>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>>> 
>>> PhD, University of California, Santa Cruz
>>> http://www.amandashuman.net/
>>> http://www.prchistoryresources.org/
>>> Office: +49 (0) 761 203 4925
>>>
>>>
>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>> arafa...@gmail.com>
>>> wrote:
>>>
>>> > This is probably your start, if not read already:
>>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>> >
>>> > Otherwise, I think your answer would be somewhere around using ICU4J,
>>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
>>> > (mentioned on the same page above)
>>> > Specifically, transformations:
>>> > http://userguide.icu-project.org/transforms/general
>>> >
>>> > With that, maybe you map both alphabets into latin. I did that once
>>> > for Thai for a demo:
>>> > https://github.com/arafalov/solr-thai-test/blob/master/
>>> > collection1/conf/schema.xml#L34
>>> >
>>> > The challenge is to figure out all the magic rules for that. You'd
>>> > have to dig through the ICU documentation and other web pages. I found
>>> > this one for example:
>>> > http://avajava.com/tutorials/lessons/what-are-the-system-
>>> > transliterators-available-with-icu4j.html;jsessionid=
>>> > BEAB0AF05A588B97B8A2393054D908C0
>>> >
>>> > There is also 12 part series on Solr and Asian text processing, though
>>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
>>> >
>>> > Hope one of these things help.
>>> >
>>> > Regards,
>>> >Alex.
>>> >
>>> >
>>> > On 20 July 2018 at 03:54, Amanda Shuman 
>>> wrote:
>>> > > Hi all,
>>> > >
>>> > > We have a problem. Some of our historical documents have mixed
>>> together
>>> > > simplified and Chinese characters. There seems to be no problem when
>>> > > searching either traditional or simplified separately - that is, if a
>>> > > particular string/phrase is all in traditional or simplified, it
>>> finds
>>> > it -
>>> > > but it does not find the string/phrase if the two different
>>> characters
>>>

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar

Lack of my chinese language knowledge but if you want, I can do quick test
for you in Analysis tab if you can give me what to put in index and query
window...

On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar 
wrote:

> Have you tried to use CJKFoldingFilter https://github.com/sul-dlss/
> CJKFoldingFilter.  I am not sure if this would cover your use case but I
> am using this filter and so far no issues.
>
> Thnx
>
> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman 
> wrote:
>
>> Thanks, Alex - I have seen a few of those links but never considered
>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
>> basically what is laid out in the old blogspot post, namely this point:
>>
>>
>> "Why approach CJK resource discovery differently?
>>
>> 2.  Search results must be as script agnostic as possible.
>>
>> There is more than one way to write each word. "Simplified" characters
>> were
>> emphasized for printed materials in mainland China starting in the 1950s;
>> "Traditional" characters were used in printed materials prior to the
>> 1950s,
>> and are still used in Taiwan, Hong Kong and Macau today.
>> Since the characters are distinct, it's as if Chinese materials are
>> written
>> in two scripts.
>> Another way to think about it:  every written Chinese word has at least
>> two
>> completely different spellings.  And it can be mix-n-match:  a word can be
>> written with one traditional  and one simplified character.
>> Example:   Given a user query 舊小說  (traditional for old fiction), the
>> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
>> characters for old fiction)"
>>
>> So, using the example provided above, we are dealing with materials
>> produced in the 1950s-1970s that do even weirder things like:
>>
>> A. 舊小說
>>
>> can also be
>>
>> B. 旧小说 (all simplified)
>> or
>> C. 旧小說 (first character simplified, last character traditional)
>> or
>> D. 舊小 说 (first character traditional, last character simplified)
>>
>> Thankfully the middle character was never simplified in recent times.
>>
>> From a historical standpoint, the mixed nature of the characters in the
>> same word/phrase is because not all simplified characters were adopted at
>> the same time by everyone uniformly (good times...).
>>
>> The problem seems to be that Solr can easily handle A or B above, but NOT
>> C
>> or D using the Smart Chinese analyzer. I'm not really sure how to change
>> that at this point... maybe I should figure out how to contact the
>> creators
>> of the analyzer and ask them?
>>
>> Amanda
>>
>> --
>> Dr. Amanda Shuman
>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>> 
>> PhD, University of California, Santa Cruz
>> http://www.amandashuman.net/
>> http://www.prchistoryresources.org/
>> Office: +49 (0) 761 203 4925
>>
>>
>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>>
>> > This is probably your start, if not read already:
>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>> >
>> > Otherwise, I think your answer would be somewhere around using ICU4J,
>> > IBM's library for dealing with Unicode: http://site.icu-project.org/
>> > (mentioned on the same page above)
>> > Specifically, transformations:
>> > http://userguide.icu-project.org/transforms/general
>> >
>> > With that, maybe you map both alphabets into latin. I did that once
>> > for Thai for a demo:
>> > https://github.com/arafalov/solr-thai-test/blob/master/
>> > collection1/conf/schema.xml#L34
>> >
>> > The challenge is to figure out all the magic rules for that. You'd
>> > have to dig through the ICU documentation and other web pages. I found
>> > this one for example:
>> > http://avajava.com/tutorials/lessons/what-are-the-system-
>> > transliterators-available-with-icu4j.html;jsessionid=
>> > BEAB0AF05A588B97B8A2393054D908C0
>> >
>> > There is also 12 part series on Solr and Asian text processing, though
>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
>> >
>> > Hope one of these things help.
>> >
>> > Regards,
>> >Alex.
>> >
>> >
>> > On 20 July 2018 at 03:54, Amanda Shuman 
>> wrote:
>> > > Hi all,
>> > >
>> > > We have a problem. Some of our historical documents have mixed
>> together
>> > > simplified and Chinese characters. There seems to be no problem when
>> > > searching either traditional or simplified separately - that is, if a
>> > > particular string/phrase is all in traditional or simplified, it finds
>> > it -
>> > > but it does not find the string/phrase if the two different characters
>> > (one
>> > > traditional, one simplified) are mixed together in the SAME
>> > string/phrase.
>> > >
>> > > Has anyone ever handled this problem before? I know some libraries
>> seem
>> > to
>> > > have implemented something that seems to be able to handle this, but
>> I'm
>> > > not sure how they did so!
>> > >
>> > > Amanda
>> > > --
>> > > Dr. Amanda Shuman
>> >

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar

Have you tried to use CJKFoldingFilter
https://github.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
cover your use case but I am using this filter and so far no issues.

Thnx

On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman 
wrote:

> Thanks, Alex - I have seen a few of those links but never considered
> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> basically what is laid out in the old blogspot post, namely this point:
>
>
> "Why approach CJK resource discovery differently?
>
> 2.  Search results must be as script agnostic as possible.
>
> There is more than one way to write each word. "Simplified" characters were
> emphasized for printed materials in mainland China starting in the 1950s;
> "Traditional" characters were used in printed materials prior to the 1950s,
> and are still used in Taiwan, Hong Kong and Macau today.
> Since the characters are distinct, it's as if Chinese materials are written
> in two scripts.
> Another way to think about it:  every written Chinese word has at least two
> completely different spellings.  And it can be mix-n-match:  a word can be
> written with one traditional  and one simplified character.
> Example:   Given a user query 舊小說  (traditional for old fiction), the
> results should include matches for 舊小說 (traditional) and 旧小说 (simplified
> characters for old fiction)"
>
> So, using the example provided above, we are dealing with materials
> produced in the 1950s-1970s that do even weirder things like:
>
> A. 舊小說
>
> can also be
>
> B. 旧小说 (all simplified)
> or
> C. 旧小說 (first character simplified, last character traditional)
> or
> D. 舊小 说 (first character traditional, last character simplified)
>
> Thankfully the middle character was never simplified in recent times.
>
> From a historical standpoint, the mixed nature of the characters in the
> same word/phrase is because not all simplified characters were adopted at
> the same time by everyone uniformly (good times...).
>
> The problem seems to be that Solr can easily handle A or B above, but NOT C
> or D using the Smart Chinese analyzer. I'm not really sure how to change
> that at this point... maybe I should figure out how to contact the creators
> of the analyzer and ask them?
>
> Amanda
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch  >
> wrote:
>
> > This is probably your start, if not read already:
> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> >
> > Otherwise, I think your answer would be somewhere around using ICU4J,
> > IBM's library for dealing with Unicode: http://site.icu-project.org/
> > (mentioned on the same page above)
> > Specifically, transformations:
> > http://userguide.icu-project.org/transforms/general
> >
> > With that, maybe you map both alphabets into latin. I did that once
> > for Thai for a demo:
> > https://github.com/arafalov/solr-thai-test/blob/master/
> > collection1/conf/schema.xml#L34
> >
> > The challenge is to figure out all the magic rules for that. You'd
> > have to dig through the ICU documentation and other web pages. I found
> > this one for example:
> > http://avajava.com/tutorials/lessons/what-are-the-system-
> > transliterators-available-with-icu4j.html;jsessionid=
> > BEAB0AF05A588B97B8A2393054D908C0
> >
> > There is also 12 part series on Solr and Asian text processing, though
> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> >
> > Hope one of these things help.
> >
> > Regards,
> >Alex.
> >
> >
> > On 20 July 2018 at 03:54, Amanda Shuman  wrote:
> > > Hi all,
> > >
> > > We have a problem. Some of our historical documents have mixed together
> > > simplified and Chinese characters. There seems to be no problem when
> > > searching either traditional or simplified separately - that is, if a
> > > particular string/phrase is all in traditional or simplified, it finds
> > it -
> > > but it does not find the string/phrase if the two different characters
> > (one
> > > traditional, one simplified) are mixed together in the SAME
> > string/phrase.
> > >
> > > Has anyone ever handled this problem before? I know some libraries seem
> > to
> > > have implemented something that seems to be able to handle this, but
> I'm
> > > not sure how they did so!
> > >
> > > Amanda
> > > --
> > > Dr. Amanda Shuman
> > > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > > 
> > > PhD, University of California, Santa Cruz
> > > http://www.amandashuman.net/
> > > http://www.prchistoryresources.org/
> > > Office: +49 (0) 761 203 4925
> >
>

Re: Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman

Thanks, Alex - I have seen a few of those links but never considered
transliteration! We use lucene's Smart Chinese analyzer. The issue is
basically what is laid out in the old blogspot post, namely this point:

"Why approach CJK resource discovery differently?

2.  Search results must be as script agnostic as possible.

There is more than one way to write each word. "Simplified" characters were
emphasized for printed materials in mainland China starting in the 1950s;
"Traditional" characters were used in printed materials prior to the 1950s,
and are still used in Taiwan, Hong Kong and Macau today.
Since the characters are distinct, it's as if Chinese materials are written
in two scripts.
Another way to think about it:  every written Chinese word has at least two
completely different spellings.  And it can be mix-n-match:  a word can be
written with one traditional  and one simplified character.
Example:   Given a user query 舊小說  (traditional for old fiction), the
results should include matches for 舊小說 (traditional) and 旧小说 (simplified
characters for old fiction)"

So, using the example provided above, we are dealing with materials
produced in the 1950s-1970s that do even weirder things like:

A. 舊小說

can also be

B. 旧小说 (all simplified)
or
C. 旧小說 (first character simplified, last character traditional)
or
D. 舊小 说 (first character traditional, last character simplified)

Thankfully the middle character was never simplified in recent times.

>From a historical standpoint, the mixed nature of the characters in the
same word/phrase is because not all simplified characters were adopted at
the same time by everyone uniformly (good times...).

The problem seems to be that Solr can easily handle A or B above, but NOT C
or D using the Smart Chinese analyzer. I'm not really sure how to change
that at this point... maybe I should figure out how to contact the creators
of the analyzer and ask them?

Amanda

--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925

On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch 
wrote:

> This is probably your start, if not read already:
> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>
> Otherwise, I think your answer would be somewhere around using ICU4J,
> IBM's library for dealing with Unicode: http://site.icu-project.org/
> (mentioned on the same page above)
> Specifically, transformations:
> http://userguide.icu-project.org/transforms/general
>
> With that, maybe you map both alphabets into latin. I did that once
> for Thai for a demo:
> https://github.com/arafalov/solr-thai-test/blob/master/
> collection1/conf/schema.xml#L34
>
> The challenge is to figure out all the magic rules for that. You'd
> have to dig through the ICU documentation and other web pages. I found
> this one for example:
> http://avajava.com/tutorials/lessons/what-are-the-system-
> transliterators-available-with-icu4j.html;jsessionid=
> BEAB0AF05A588B97B8A2393054D908C0
>
> There is also 12 part series on Solr and Asian text processing, though
> it is a bit old now: http://discovery-grindstone.blogspot.com/
>
> Hope one of these things help.
>
> Regards,
>Alex.
>
>
> On 20 July 2018 at 03:54, Amanda Shuman  wrote:
> > Hi all,
> >
> > We have a problem. Some of our historical documents have mixed together
> > simplified and Chinese characters. There seems to be no problem when
> > searching either traditional or simplified separately - that is, if a
> > particular string/phrase is all in traditional or simplified, it finds
> it -
> > but it does not find the string/phrase if the two different characters
> (one
> > traditional, one simplified) are mixed together in the SAME
> string/phrase.
> >
> > Has anyone ever handled this problem before? I know some libraries seem
> to
> > have implemented something that seems to be able to handle this, but I'm
> > not sure how they did so!
> >
> > Amanda
> > --
> > Dr. Amanda Shuman
> > Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> > 
> > PhD, University of California, Santa Cruz
> > http://www.amandashuman.net/
> > http://www.prchistoryresources.org/
> > Office: +49 (0) 761 203 4925
>

Re: Question regarding searching Chinese characters

2018-07-20 Thread Alexandre Rafalovitch

This is probably your start, if not read already:
https://lucene.apache.org/solr/guide/7_4/language-analysis.html

Otherwise, I think your answer would be somewhere around using ICU4J,
IBM's library for dealing with Unicode: http://site.icu-project.org/
(mentioned on the same page above)
Specifically, transformations:
http://userguide.icu-project.org/transforms/general

With that, maybe you map both alphabets into latin. I did that once
for Thai for a demo:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34

The challenge is to figure out all the magic rules for that. You'd
have to dig through the ICU documentation and other web pages. I found
this one for example:
http://avajava.com/tutorials/lessons/what-are-the-system-transliterators-available-with-icu4j.html;jsessionid=BEAB0AF05A588B97B8A2393054D908C0

There is also 12 part series on Solr and Asian text processing, though
it is a bit old now: http://discovery-grindstone.blogspot.com/

Hope one of these things help.

Regards,
   Alex.


On 20 July 2018 at 03:54, Amanda Shuman  wrote:
> Hi all,
>
> We have a problem. Some of our historical documents have mixed together
> simplified and Chinese characters. There seems to be no problem when
> searching either traditional or simplified separately - that is, if a
> particular string/phrase is all in traditional or simplified, it finds it -
> but it does not find the string/phrase if the two different characters (one
> traditional, one simplified) are mixed together in the SAME string/phrase.
>
> Has anyone ever handled this problem before? I know some libraries seem to
> have implemented something that seems to be able to handle this, but I'm
> not sure how they did so!
>
> Amanda
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925

Re: Question regarding TLS version for solr

2018-05-24 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Anchal,

On 5/24/18 6:02 AM, Anchal Sharma2 wrote:
> Thanks a lot for sharing the steps . I tried few of them .Actually
> we already have been using solr in our application since an year or
> so  .We just want to encrypt it to use secure solr now .So ,I
> followed the steps where you have created the certificates ,etc
> .But when I go to start the solr back ,it doesnt start . We are
> using zookeeper .Following is the error I get ,on running solr
> start command.
> 
> Command:./solr -c -m 1g -p 8984 -z :2181 -s  folder containing data>
> 
> Error:
> 
> lsof 4.55 (latest revision at
> ftp://vic.cc.purdue.edu/pub/tools/unix/lsof) usage:
> [-?abhlnNoOPRstUvVX] [-c c] [+|-d s] [+|-D D] [+|-f[cfgGn]] [-F
> [f]] [-g [s]] [-i [i]] [+|-L [l]] [-m m] [+|-M] [-o [o]] [-p s] 
> [+|-r [t]] [-S [t]] [-T [t]] [-u s] [+|-w] [--] [names] Use the
> ``-h'' option to get more help information. Still not seeing Solr
> listening on 8984 after 30 seconds! at
> java.security.KeyStore.load(KeyStore.java:1456) at
> org.eclipse.jetty.util.security.CertificateUtils.getKeyStore(Certifica
teUtils.java:55)
>
> 
at
org.eclipse.jetty.util.ssl.SslContextFactory.loadKeyStore(SslContextFact
ory.java:871)
> at
> org.eclipse.jetty.util.ssl.SslContextFactory.doStart(SslContextFactory
.java:273)
>
> 
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCyc
le.java:68)
> at
> org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLif
eCycle.java:132)
>
> 
at
org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLif
eCycle.java:114)
> at
> org.eclipse.jetty.server.SslConnectionFactory.doStart(SslConnectionFac
tory.java:64)
>
> 
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCyc
le.java:68)
> at
> org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLif
eCycle.java:132)
>
> 
at
org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLif
eCycle.java:114)
> at
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.j
ava:256)
>
> 
at
org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetwor
kConnector.java:81)
> at
> org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:
236)
>
> 
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCyc
le.java:68)
> at org.eclipse.jetty.server.Server.doStart(Server.java:366) at
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeC
ycle.java:68)
>
> 
at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:12
55)
> at
> java.security.AccessController.doPrivileged(AccessController.java:594)
>
> 
at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:117
4)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
ava:90)
>
> 
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:55)
> at java.lang.reflect.Method.invoke(Method.java:508) at
> org.eclipse.jetty.start.Main.invokeMain(Main.java:321) at
> org.eclipse.jetty.start.Main.start(Main.java:817) at
> org.eclipse.jetty.start.Main.main(Main.java:112) 2018-05-24
> 09:05:16.714 INFO
> (zkCallback-3-thread-1-processing-n:9.109.122.113:8984_solr) [   ]
> o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent
> state:SyncConnected type:NodeDataChanged path:/clusterstate.json,
> has occurred - updating... (live nodes size: 1) 2018-05-24
> 09:05:17.018 INFO
> (zkCallback-3-thread-1-processing-n:9.109.122.113:8984_solr) [   ]
> o.a.s.c.c.ZkStateReader Updated cluster state version to 9702 
> 2018-05-24 09:05:17.153 INFO
> (coreLoadExecutor-7-thread-2-processing-n:9.109.122.113:8984_solr)
> [c:document  r:core_node1 x:document] o.a.s.u.SolrIndexConfig
> IndexWriter infoStream solr logging is enabled [\]  sleep: bad
> character in argument


What does the solr.log file say? The above stack trace isn't terribly
helpful, and it's incomplete.

- -chris

> -Christopher Schultz <ch...@christopherschultz.net> wrote:
> - To: solr-user@lucene.apache.org From: Christopher Schultz
> <ch...@christopherschultz.net> Date: 05/23/2018 07:29PM Subject:
> Re: Question regarding TLS version for solr
> 
> Anchal,
> 
> On 5/23/18 2:38 AM, Anchal Sharma2 wrote:
>> Thank you for replying .But ,I checked the java version solr
>> using ,and it is already  version 1.8.
> 
>> @Christopher ,can you let me know what steps you followed for
>> TLS authentication on solr version 7.3.0.
> 
> Sure. Here are my deployment notes. You may have to adjust them 
> slightly for your environment. Note that we are using standalone

Re: Question regarding TLS version for solr

2018-05-24 Thread Anchal Sharma2

Hi Chris,

Thanks a lot for sharing the steps .
I tried few of them .Actually we already have been using solr in our 
application since an year or so  .We just want to encrypt it to use secure solr 
now .So ,I followed the steps where you have created the certificates ,etc .But 
when I go to start the solr back ,it doesnt start .
We are using zookeeper .Following is the error I get ,on running solr start 
command.

Command:./solr -c -m 1g -p 8984 -z :2181 -s 

Error:

lsof 4.55 (latest revision at ftp://vic.cc.purdue.edu/pub/tools/unix/lsof)
 usage: [-?abhlnNoOPRstUvVX] [-c c] [+|-d s] [+|-D D] [+|-f[cfgGn]]
 [-F [f]] [-g [s]] [-i [i]] [+|-L [l]] [-m m] [+|-M] [-o [o]] [-p s]
 [+|-r [t]] [-S [t]] [-T [t]] [-u s] [+|-w] [--] [names]
Use the ``-h'' option to get more help information.
Still not seeing Solr listening on 8984 after 30 seconds!
at java.security.KeyStore.load(KeyStore.java:1456)
at 
org.eclipse.jetty.util.security.CertificateUtils.getKeyStore(CertificateUtils.java:55)
at 
org.eclipse.jetty.util.ssl.SslContextFactory.loadKeyStore(SslContextFactory.java:871)
at 
org.eclipse.jetty.util.ssl.SslContextFactory.doStart(SslContextFactory.java:273)
at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at 
org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132)
at 
org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114)
at 
org.eclipse.jetty.server.SslConnectionFactory.doStart(SslConnectionFactory.java:64)
at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at 
org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132)
at 
org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114)
at 
org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:256)
at 
org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:81)
at 
org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.server.Server.doStart(Server.java:366)
at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at 
org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1255)
at 
java.security.AccessController.doPrivileged(AccessController.java:594)
at 
org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:508)
at org.eclipse.jetty.start.Main.invokeMain(Main.java:321)
at org.eclipse.jetty.start.Main.start(Main.java:817)
at org.eclipse.jetty.start.Main.main(Main.java:112)
2018-05-24 09:05:16.714 INFO  
(zkCallback-3-thread-1-processing-n:9.109.122.113:8984_solr) [   ] 
o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent 
state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred 
- updating... (live nodes size: 1)
2018-05-24 09:05:17.018 INFO  
(zkCallback-3-thread-1-processing-n:9.109.122.113:8984_solr) [   ] 
o.a.s.c.c.ZkStateReader Updated cluster state version to 9702
2018-05-24 09:05:17.153 INFO  
(coreLoadExecutor-7-thread-2-processing-n:9.109.122.113:8984_solr) [c:document  
r:core_node1 x:document] o.a.s.u.SolrIndexConfig IndexWriter infoStream solr 
logging is enabled
 [\]  sleep: bad character in argument   
 
Thanks & Regards,
-
Anchal Sharma
e-Pricer Development
ES Team
Mobile: +9871290248

-Christopher Schultz <ch...@christopherschultz.net> wrote: -
To: solr-user@lucene.apache.org
From: Christopher Schultz <ch...@christopherschultz.net>
Date: 05/23/2018 07:29PM
Subject: Re: Question regarding TLS version for solr

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Anchal,

On 5/23/18 2:38 AM, Anchal Sharma2 wrote:
> Thank you for replying .But ,I checked the java version solr using
> ,and it is already  version 1.8.
> 
> @Christopher ,can you let me know what steps you followed for TLS
> authentication on solr version 7.3.0.

Sure. Here are my deployment notes. You may have to adjust them
slightly for your environment. Note that we are using standalone Solr
without any Zookeeper, clustering, etc. This is just about configuring
a single instance. Also, this guide says 7.3.0, but 7.3.1 would be
better as it contains a fix for a CVE.

=== CUT ===

===

Re: Question regarding TLS version for solr

2018-05-23 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Anchal,

On 5/23/18 2:38 AM, Anchal Sharma2 wrote:
> Thank you for replying .But ,I checked the java version solr using
> ,and it is already  version 1.8.
> 
> @Christopher ,can you let me know what steps you followed for TLS
> authentication on solr version 7.3.0.

Sure. Here are my deployment notes. You may have to adjust them
slightly for your environment. Note that we are using standalone Solr
without any Zookeeper, clustering, etc. This is just about configuring
a single instance. Also, this guide says 7.3.0, but 7.3.1 would be
better as it contains a fix for a CVE.

=== CUT ===

 Instructions for installing Solr and working with Cores

Installation
- 

Installing Solr is fairly simple. One can simply untar the distribution
tarball and work from that directory, but it is better to install it
in a somewhat more centralized place with a separate data directory
to facilitate upgrades, etc.

1. Obtain the distribution tarball
   Go to https://lucene.apache.org/solr/mirrors-solr-latest-redir.html
   and obtain the latest supported version of Solr.
   (7.3.0 as of this writing).

2. Untar the archive
   $ tar xzf solr-x.y.x.tgz

3. Install Solr
   $ cd solr-x.y.z
   $ sudo bin/install_solr_service.sh ../solr-x.y.z.tgz \
 -i /usr/local \
 -d /mnt/securefs/solr \
 -n
   (that last -n says "don't start Solr")

4. Configure Solr Settings
   Edit the file /etc/default/solr.in.sh

   Settings you may want to explicitly set:

   SOLR_JAVA_HOME=(java home)
   SOLR_HEAP="1024M"

5. Configure Solr for TLS
   Create a server key and certificate:
   $ sudo mkdir /etc/solr
   $ sudo keytool -genkey -keyalg EC -sigalg SHA256withECDSA -keysize
256 -validity 730 \
  -alias 'solr-ssl' -keystore /etc/solr/solr.p12 -storetype
PKCS12 \
  -ext san=dns:localhost,ip:192.168.10.20
 Use the following information for the certificate:
 First and Last name: 192.168.10.20 (or "localhost", or your
IP address)
 Org unit:  [whatever]
 Everything else should be obvious

   Now, export the public key from the keystore.

   $ sudo /usr/local/java-8/bin/keytool -list -rfc -keystore
/etc/solr/solr.p12 -storetype PKCS12 -alias solr-ssl

   Copy that certificate and paste it into this command's stdin:

   $ sudo keytool -importcert -keystore /etc/solr/solr-server.p12
- -storetype PKCS12 -alias 'solr-ssl'

   Now, fix the ownership and permissions on these files:

   $ sudo chown root:solr /etc/solr/solr.p12 /etc/solr/solr-server.p12
   $ sudo chmod 0640 /etc/solr/solr.p12

   Edit the file /etc/default/solr.in.sh

   Set the following settings:

   SOLR_SSL_KEY_STORE=/etc/solr/solr.p12
   SOLR_SSL_KEY_STORE_TYPE=PKCS12
   SOLR_SSL_KEY_STORE_PASSWORD=whatever

   # You MUST set the trust store for some reason.
   SOLR_SSL_TRUST_STORE=/etc/solr/solr-server.p12
   SOLR_SSL_TRUST_STORE_TYPE=PKCS12
   SOLR_SSL_TRUST_STORE_PASSWORD=whatever

   Then, patch the file bin/post; you are going to need this, later.

- --- bin/post2017-09-03 13:29:15.0 -0400
+++ /usr/local/solr/bin/post2018-04-11 20:08:17.0 -0400
@@ -231,8 +231,8 @@
   PROPS+=('-Drecursive=yes')
 fi

- -echo "$JAVA" -classpath "${TOOL_JAR[0]}" "${PROPS[@]}"
org.apache.solr.util.SimplePostTool "${PARAMS[@]}"
- -"$JAVA" -classpath "${TOOL_JAR[0]}" "${PROPS[@]}"
org.apache.solr.util.SimplePostTool "${PARAMS[@]}"
+echo "$JAVA" -classpath "${TOOL_JAR[0]}" "${PROPS[@]}"
${SOLR_POST_OPTS} org.apache.solr.util.SimplePostTool "${PARAMS[@]}"
+"$JAVA" -classpath "${TOOL_JAR[0]}" "${PROPS[@]}" ${SOLR_POST_OPTS}
org.apache.solr.util.SimplePostTool "${PARAMS[@]}"

6. Configure Solr to Require Client TLS Certificates

  On each client, create a client key and certificate:

  $ keytool -genkey -keyalg EC -sigalg SHA256withECDSA -keysize 256 \
-validity 730 -alias 'solr-client-ssl'

  Now dump the certificate for the next step:

  $ keytool -exportcert -keystore [client-key-store] -storetype PKCS12 \
-alias 'solr-client-ssl'

  Don't forget that you might want to generate your own client certifica
te
  to use from you own web browser if you want to be able to connect to t
he
  server's dashboard.

  Use the output of that command on each client to put the cert(s)
into this
  trust store on the server:

  $ sudo keytool -importcert -keystore
/etc/solr/solr-trusted-clients.p12 \
 -storetype PKCS12 -alias '[client key alias]'

Edit /etc/default/solr.in.sh and add the following entries:

  SOLR_SSL_NEED_CLIENT_AUTH=true
  SOLR_SSL_TRUST_STORE=/etc/solr/solr-trusted-clients.p12
  SOLR_SSL_TRUST_STORE_TYPE=PKCS12
  SOLR_SSL_TRUST_STORE_PASSWORD=whatever

Summary of Files in /etc/solr
- -

solr-client.p12   Client keystore. Contains client key and certificate.
  Used by clients to

Re: Question regarding TLS version for solr

2018-05-23 Thread Anchal Sharma2

 Hi Christopher /Shawn ,

Thank you for replying .But ,I checked the java version solr using ,and it is 
already  version 1.8.

@Christopher ,can you let me know what steps you followed for TLS 
authentication on solr version 7.3.0.

Thanks & Regards,
-
Anchal Sharma
e-Pricer Development
ES Team
Mobile: +9871290248

-Christopher Schultz <ch...@christopherschultz.net> wrote: -
To: solr-user@lucene.apache.org
From: Christopher Schultz <ch...@christopherschultz.net>
Date: 05/17/2018 06:29PM
Subject: Re: Question regarding TLS version for solr

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Shawn,

On 5/17/18 4:23 AM, Shawn Heisey wrote:
> On 5/17/2018 1:53 AM, Anchal Sharma2 wrote:
>> We are using solr version 5.3.0 and  have been  trying to enable 
>> security on our solr .We followed steps mentioned on site 
>> -https://lucene.apache.org/solr/guide/6_6/enabling-ssl.html .But
>> by default it picks ,TLS version  1.0,which is causing an issue
>> as our application uses TLSv 1.2.We tried using online resources
>> ,but could not find anything regarding TLS enablement for solr .
>> 
>> It will be a huge help if anyone can provide some suggestions as
>> to how we can enable TLS v 1.2 for solr.
> 
> The choice of ciphers and encryption protocols is mostly made by
> Java. The servlet container might influence it as well. The only
> servlet container that is supported since Solr 5.0 is the Jetty
> that is bundled in the Solr download.
> 
> TLS 1.2 was added in Java 7, and it became default in Java 8. If
> you can install the latest version of Java 8 and make sure that it
> has the policy files for unlimited crypto strength installed,
> support for TLS 1.2 might happen automatically.

There is no "default" TLS version for either the client or the server:
the two endpoints always negotiate the highest mutual version they
both support. The key agreement, authentication, and cipher suites are
the items that are negotiated during the handshake.

> Solr 5.3.0 is running a fairly old version of Jetty -- 9.2.11. 
> Information for 9.2.x versions is hard to find, so although I think
> it probably CAN do TLS 1.2 if the Java version supports it, I can't
> be absolutely sure.  You'll need to upgrade Solr to get an upgraded
> Jetty.

I would be shocked if Jetty ships with its own crypto libraries; it
should be using JSSE.

Anchal,

Java 1.7 or later is an absolute requirement if you want to use
TLSv1.2 (and you SHOULD want to use it).

I have recently spent a lot of time getting Solr 7.3.0 running with
TLS mutual-authentication, but I haven't worked with the 5.3.x line. I
can tell you have I've done things for my version, but they may need
some adjustments for yours.

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlr9fKYACgkQHPApP6U8
pFh8lRAAmmvBMUSk35keW0OG0/SHpUy/ExJK69JGIKGwi96ddbz2yH8MG+OjjE3G
GNq/o5+EMT7tP/nW6XuPQou5UQvA2nlA9jsskox3A+CqOH7e6cbSxfxIkTqf9YDl
Kxr4J6mYjvTIjJAqLXGF+ghJfswS6RjZezDgo1PdSUox+gUOvmY61tlSjuYTaAYw
vH1i1DRzb8PkkR4ULePF48Y4r5+ZYz/4ZwSvnJTTkyl97KCw93rZ/kI5v9p3cCHK
Ycuwi/ZirO/VNf/9ruAOtgET3aojNfuNCX/A+vrSbJfiY7mXo05lYKN+eT80elQr
X8OKQaqHP6haF2aNPHrqXGtY2YoiGrdyaGtrXkUHFDfXgQeOmlk/eSVWemcSsatk
eEHSWW9NALMaalRAM7NuXQtgqq1badJhKysiJwSqFgcdgVKcSt8SsQ/09qTPjaNE
Ce1/EHdR6j1hM0Bnv5Hzf85cZjM7PfLmh7P8fnUD5d8eSbBpeWYVBDsS+fXp8WWv
FO5axbnSYIScOIz33i0UZyxpJgcsAkABLGghL6WWQSkfBf4ANgdTumS7K9Pn7Thz
Uq+lD9QPEPWJ91Fc0gnCWtDAEIRjOyLLbYzgI4ebV5qo41GO1WDDHfQZEcqA0Vod
+K8oAMD8nnwU+TprTFkjlQwbDnW1q1efTD6IrpEL5H7h6Xw2cgg=
=RpO6
-END PGP SIGNATURE-

Re: Question regarding TLS version for solr

2018-05-17 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Shawn,

On 5/17/18 4:23 AM, Shawn Heisey wrote:
> On 5/17/2018 1:53 AM, Anchal Sharma2 wrote:
>> We are using solr version 5.3.0 and  have been  trying to enable 
>> security on our solr .We followed steps mentioned on site 
>> -https://lucene.apache.org/solr/guide/6_6/enabling-ssl.html .But
>> by default it picks ,TLS version  1.0,which is causing an issue
>> as our application uses TLSv 1.2.We tried using online resources
>> ,but could not find anything regarding TLS enablement for solr .
>> 
>> It will be a huge help if anyone can provide some suggestions as
>> to how we can enable TLS v 1.2 for solr.
> 
> The choice of ciphers and encryption protocols is mostly made by
> Java. The servlet container might influence it as well. The only
> servlet container that is supported since Solr 5.0 is the Jetty
> that is bundled in the Solr download.
> 
> TLS 1.2 was added in Java 7, and it became default in Java 8. If
> you can install the latest version of Java 8 and make sure that it
> has the policy files for unlimited crypto strength installed,
> support for TLS 1.2 might happen automatically.

There is no "default" TLS version for either the client or the server:
the two endpoints always negotiate the highest mutual version they
both support. The key agreement, authentication, and cipher suites are
the items that are negotiated during the handshake.

> Solr 5.3.0 is running a fairly old version of Jetty -- 9.2.11. 
> Information for 9.2.x versions is hard to find, so although I think
> it probably CAN do TLS 1.2 if the Java version supports it, I can't
> be absolutely sure.  You'll need to upgrade Solr to get an upgraded
> Jetty.

I would be shocked if Jetty ships with its own crypto libraries; it
should be using JSSE.

Anchal,

Java 1.7 or later is an absolute requirement if you want to use
TLSv1.2 (and you SHOULD want to use it).

I have recently spent a lot of time getting Solr 7.3.0 running with
TLS mutual-authentication, but I haven't worked with the 5.3.x line. I
can tell you have I've done things for my version, but they may need
some adjustments for yours.

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlr9fKYACgkQHPApP6U8
pFh8lRAAmmvBMUSk35keW0OG0/SHpUy/ExJK69JGIKGwi96ddbz2yH8MG+OjjE3G
GNq/o5+EMT7tP/nW6XuPQou5UQvA2nlA9jsskox3A+CqOH7e6cbSxfxIkTqf9YDl
Kxr4J6mYjvTIjJAqLXGF+ghJfswS6RjZezDgo1PdSUox+gUOvmY61tlSjuYTaAYw
vH1i1DRzb8PkkR4ULePF48Y4r5+ZYz/4ZwSvnJTTkyl97KCw93rZ/kI5v9p3cCHK
Ycuwi/ZirO/VNf/9ruAOtgET3aojNfuNCX/A+vrSbJfiY7mXo05lYKN+eT80elQr
X8OKQaqHP6haF2aNPHrqXGtY2YoiGrdyaGtrXkUHFDfXgQeOmlk/eSVWemcSsatk
eEHSWW9NALMaalRAM7NuXQtgqq1badJhKysiJwSqFgcdgVKcSt8SsQ/09qTPjaNE
Ce1/EHdR6j1hM0Bnv5Hzf85cZjM7PfLmh7P8fnUD5d8eSbBpeWYVBDsS+fXp8WWv
FO5axbnSYIScOIz33i0UZyxpJgcsAkABLGghL6WWQSkfBf4ANgdTumS7K9Pn7Thz
Uq+lD9QPEPWJ91Fc0gnCWtDAEIRjOyLLbYzgI4ebV5qo41GO1WDDHfQZEcqA0Vod
+K8oAMD8nnwU+TprTFkjlQwbDnW1q1efTD6IrpEL5H7h6Xw2cgg=
=RpO6
-END PGP SIGNATURE-

Re: Question regarding TLS version for solr

2018-05-17 Thread Shawn Heisey


On 5/17/2018 1:53 AM, Anchal Sharma2 wrote:

We are using solr version 5.3.0 and  have been  trying to enable security on 
our solr .We followed steps mentioned on site 
-https://lucene.apache.org/solr/guide/6_6/enabling-ssl.html .But by default it 
picks ,TLS version  1.0,which is causing an issue as our application uses TLSv 
1.2.We tried using online resources ,but could not find anything regarding TLS 
enablement for solr .

It will be a huge help if anyone can provide some suggestions as to how we can 
enable TLS v 1.2 for solr.


The choice of ciphers and encryption protocols is mostly made by Java.  
The servlet container might influence it as well. The only servlet 
container that is supported since Solr 5.0 is the Jetty that is bundled 
in the Solr download.


TLS 1.2 was added in Java 7, and it became default in Java 8.  If you 
can install the latest version of Java 8 and make sure that it has the 
policy files for unlimited crypto strength installed, support for TLS 
1.2 might happen automatically.


Solr 5.3.0 is running a fairly old version of Jetty -- 9.2.11.  
Information for 9.2.x versions is hard to find, so although I think it 
probably CAN do TLS 1.2 if the Java version supports it, I can't be 
absolutely sure.  You'll need to upgrade Solr to get an upgraded Jetty.


Thanks,
Shawn

Re: question about updates to shard leaders only

2018-05-15 Thread Mark Miller

Yeah, basically ConcurrentUpdateSolrClient is a shortcut to getting multi
threaded bulk API updates out of the single threaded, single update API.
The downsides to this are: It is not cloud aware - you have to point it at
a server, you have to add special code to see if there are any errors, you
don't get any fine grained error information back, you still basically have
to break up updates into batches of success/fail units but with fewer
guard rails.

If you want to bulk load it usually makes much more sense to use the bulk
api on CloudSolrServer and treat the whole group of updates as a single
success/fail unit.

- Mark

On Tue, May 15, 2018 at 9:25 AM Erick Erickson 
wrote:

> bq. But don't forget a final client.add(list) after the while-loop ;-)
>
> Ha! But only "if (list.size() > 0)"
>
> And then there was the memorable time I forgot the "list.clear()" when
> I sent the batch and wondered why my indexing progress got slower and
> slower...
>
> Not to mention the time I re-used the same SolrInputDocument that got
> bigger and bigger and bigger.
>
> Not to mention the other zillion screw-ups I've managed to perpetrate
> in my career "Who wrote this stupid code? Oh, wait, it was me.
> DON'T LOOK!!!"...
>
> Astronomy anecdote
>
> Dale Vrabeck...was at a party with [Rudolph] Minkowski and Dale said
> he’d heard about the astronomer who had exposed a plate all night and
> then put it in the hypo first. Minkowski said, “It was three nights,
> and it was me.”
>
> On Tue, May 15, 2018 at 10:10 AM, Shawn Heisey 
> wrote:
> > On 5/15/2018 12:12 AM, Bernd Fehling wrote:
> >>
> >> OK, I have the CloudSolrClient with SolrJ now running but it seams
> >> a bit slower compared to ConcurrentUpdateSolrClient.
> >> This was not expected.
> >> The logs show that CloudSolrClient send the docs only to the leaders.
> >>
> >> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
> >>
> >> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
> >> With CloudSolrClient I get only about 1200 docs/sec.
> >
> >
> > ConcurrentUpdateSolrClient internally puts all indexing requests on a
> queue
> > and then can use multiple threads to do parallel indexing in the
> backround.
> > The design of the client has one big disadvantage -- it returns control
> to
> > your program immediately (before indexing actually begins) and always
> > indicates success.  All indexing errors are swallowed.  They are logged,
> but
> > the calling program is never informed that any errors have occurred.
> >
> > Like all other SolrClient implementations, CloudSolrClient is
> thread-safe,
> > but it is not multi-threaded unless YOU create multiple threads that all
> use
> > the same client object.  Full error handling is possible with this
> client.
> > It is also fully cloud aware, adding and removing Solr servers as the
> > SolrCloud changes, without needing to be reconfigured or recreated.
> >
> > Thanks,
> > Shawn
> >
>
-- 
- Mark
about.me/markrmiller

Re: question about updates to shard leaders only

2018-05-15 Thread Erick Erickson

bq. But don't forget a final client.add(list) after the while-loop ;-)

Ha! But only "if (list.size() > 0)"

And then there was the memorable time I forgot the "list.clear()" when
I sent the batch and wondered why my indexing progress got slower and
slower...

Not to mention the time I re-used the same SolrInputDocument that got
bigger and bigger and bigger.

Not to mention the other zillion screw-ups I've managed to perpetrate
in my career "Who wrote this stupid code? Oh, wait, it was me.
DON'T LOOK!!!"...

Astronomy anecdote

Dale Vrabeck...was at a party with [Rudolph] Minkowski and Dale said
he’d heard about the astronomer who had exposed a plate all night and
then put it in the hypo first. Minkowski said, “It was three nights,
and it was me.”

On Tue, May 15, 2018 at 10:10 AM, Shawn Heisey  wrote:
> On 5/15/2018 12:12 AM, Bernd Fehling wrote:
>>
>> OK, I have the CloudSolrClient with SolrJ now running but it seams
>> a bit slower compared to ConcurrentUpdateSolrClient.
>> This was not expected.
>> The logs show that CloudSolrClient send the docs only to the leaders.
>>
>> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>>
>> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
>> With CloudSolrClient I get only about 1200 docs/sec.
>
>
> ConcurrentUpdateSolrClient internally puts all indexing requests on a queue
> and then can use multiple threads to do parallel indexing in the backround.
> The design of the client has one big disadvantage -- it returns control to
> your program immediately (before indexing actually begins) and always
> indicates success.  All indexing errors are swallowed.  They are logged, but
> the calling program is never informed that any errors have occurred.
>
> Like all other SolrClient implementations, CloudSolrClient is thread-safe,
> but it is not multi-threaded unless YOU create multiple threads that all use
> the same client object.  Full error handling is possible with this client.
> It is also fully cloud aware, adding and removing Solr servers as the
> SolrCloud changes, without needing to be reconfigured or recreated.
>
> Thanks,
> Shawn
>

Re: question about updates to shard leaders only

2018-05-15 Thread Shawn Heisey


On 5/15/2018 12:12 AM, Bernd Fehling wrote:

OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec. 


ConcurrentUpdateSolrClient internally puts all indexing requests on a 
queue and then can use multiple threads to do parallel indexing in the 
backround.  The design of the client has one big disadvantage -- it 
returns control to your program immediately (before indexing actually 
begins) and always indicates success.  All indexing errors are 
swallowed.  They are logged, but the calling program is never informed 
that any errors have occurred.


Like all other SolrClient implementations, CloudSolrClient is 
thread-safe, but it is not multi-threaded unless YOU create multiple 
threads that all use the same client object.  Full error handling is 
possible with this client.  It is also fully cloud aware, adding and 
removing Solr servers as the SolrCloud changes, without needing to be 
reconfigured or recreated.


Thanks,
Shawn

Re: question about updates to shard leaders only

2018-05-15 Thread Bernd Fehling




Am 15.05.2018 um 14:33 schrieb Erick Erickson:

You might find this useful:

https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/


I have seen that already and can confirm it.
From my observations about a 3x3 cluster with 3 server and my hardware:
- have at least 6 CPUs on each server to keep search performance during NRT 
indexing
- I tried with batch-/queue-size between 100 and 1.
--- with a batch size of 100 and nearly even distribution accross 3
shards I get about 33 docs per update per shard.
--- with a batch size of 1000 I get about 333 docs per update per shard
--- with a batch size of 1 it can go up to  docs per shard

Yes, the last is "it can go up to" because the size is obviuosly to high
and I get lots of smaler updates "FROMLEADER". So somewhere between
1000 and 1 is the best size for my 3x3 cluster with my hardware.

Another observation in a 3x3 cluster, a multi-node (3 JVM 4G instances per
server [3 nodes]) outperforms a multi-core (1 JVM 12G instance per
server [3 cores]) due to JAVA GC impact at multi-core.
A multi-node at 60qps has nearly the performance as a multi-core at 30qps.




One tricky bit: Assuming docs have a random distribution amongst
shards, you should batch so at least 100 docs go to each _shard_. You
can see from the link that the speedup is mostly going from 1 to 100.
So if you have 5 shards, I'd create batches of at least 500. That was
a fairly simple test with stupid-simple docs. Large complicated
documents wouldn't show the same curve.

Setup for PULL and TLOG isn't hard, just specify the number of TLOG or
PULL replicas you want at collection creation time. NOTE: this is only
on Solr 7x. See:
https://lucene.apache.org/solr/guide/7_3/shards-and-indexing-data-in-solrcloud.html#types-of-replicas


Unfortunately I'm still at solr 6.4.2 and therefore have to stay with NRT.



About creating your own queue, mine usually look like
List list...
while (more docs) {
   list.add(new_doc);
   if (list.size > X) {
   client.add(list);
   list.clear();
   }
}


Yes, mine looks similar, a recursive file traverser with for-loop over files.
But don't forget a final client.add(list) after the while-loop ;-)




Not exactly a sophisticated queue ;).

On Tue, May 15, 2018 at 8:15 AM, Bernd Fehling
 wrote:

Hi Erik,

yes indeed, batching solved it.
I used ConcurrentUpdateSolrClient with queue size of 1 but
CloudSolrClient doesn't have this feature.
I build my own queue now.

Ah!!! So I obviously use default NRT but actually don't need it because
I don't have any NRT data to index. A latency of several hours is OK for me.
Currently I'm testing with a 3x3 core-cluster (3 server, 3 cores per
server).

I also tested with 3x3 node-cluster (3 server, 3 nodes per server) which
performed
better, less influence of GarbageCollection.

I have to read more about PULL or TLOG replicas, how to set this up and so
on.
If it is to complex I will go with NRT and indexing is anyway during the
night.
Thanks for pointing this out.

Regards,
Bernd


Am 15.05.2018 um 13:28 schrieb Erick Erickson:


What did you do to solve your performance problem?

Batching updates is one thing that helps performance.

bq.  I thought that only the leaders are under load
until any commit and then replicate to the other replicas.

True if (and only if) you're using PULL or TLOG replicas.
When using the default NRT replicas, every replica indexes
the docs, it doesn't matter whether they are the leader or replica.
That's required for NRT. Using CloudSolrClient has no bearing
on that functionality.

Best,
Erick

On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
 wrote:


Thanks, solved, performance is good now.

Regards,
Bernd


Am 15.05.2018 um 08:12 schrieb Bernd Fehling:



OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and
cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search
requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:



You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists

Re: question about updates to shard leaders only

2018-05-15 Thread Erick Erickson

You might find this useful:

https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

One tricky bit: Assuming docs have a random distribution amongst
shards, you should batch so at least 100 docs go to each _shard_. You
can see from the link that the speedup is mostly going from 1 to 100.
So if you have 5 shards, I'd create batches of at least 500. That was
a fairly simple test with stupid-simple docs. Large complicated
documents wouldn't show the same curve.

Setup for PULL and TLOG isn't hard, just specify the number of TLOG or
PULL replicas you want at collection creation time. NOTE: this is only
on Solr 7x. See:
https://lucene.apache.org/solr/guide/7_3/shards-and-indexing-data-in-solrcloud.html#types-of-replicas

About creating your own queue, mine usually look like
List list...
while (more docs) {
  list.add(new_doc);
  if (list.size > X) {
  client.add(list);
  list.clear();
  }
}

Not exactly a sophisticated queue ;).

On Tue, May 15, 2018 at 8:15 AM, Bernd Fehling
 wrote:
> Hi Erik,
>
> yes indeed, batching solved it.
> I used ConcurrentUpdateSolrClient with queue size of 1 but
> CloudSolrClient doesn't have this feature.
> I build my own queue now.
>
> Ah!!! So I obviously use default NRT but actually don't need it because
> I don't have any NRT data to index. A latency of several hours is OK for me.
> Currently I'm testing with a 3x3 core-cluster (3 server, 3 cores per
> server).
>
> I also tested with 3x3 node-cluster (3 server, 3 nodes per server) which
> performed
> better, less influence of GarbageCollection.
>
> I have to read more about PULL or TLOG replicas, how to set this up and so
> on.
> If it is to complex I will go with NRT and indexing is anyway during the
> night.
> Thanks for pointing this out.
>
> Regards,
> Bernd
>
>
> Am 15.05.2018 um 13:28 schrieb Erick Erickson:
>>
>> What did you do to solve your performance problem?
>>
>> Batching updates is one thing that helps performance.
>>
>> bq.  I thought that only the leaders are under load
>> until any commit and then replicate to the other replicas.
>>
>> True if (and only if) you're using PULL or TLOG replicas.
>> When using the default NRT replicas, every replica indexes
>> the docs, it doesn't matter whether they are the leader or replica.
>> That's required for NRT. Using CloudSolrClient has no bearing
>> on that functionality.
>>
>> Best,
>> Erick
>>
>> On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
>>  wrote:
>>>
>>> Thanks, solved, performance is good now.
>>>
>>> Regards,
>>> Bernd
>>>
>>>
>>> Am 15.05.2018 um 08:12 schrieb Bernd Fehling:

 OK, I have the CloudSolrClient with SolrJ now running but it seams
 a bit slower compared to ConcurrentUpdateSolrClient.
 This was not expected.
 The logs show that CloudSolrClient send the docs only to the leaders.

 So the only advantage of CloudSolrClient is that it is "Cloud aware"?

 With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
 With CloudSolrClient I get only about 1200 docs/sec.

 The system monitoring shows that with CloudSolrClient all nodes and
 cores
 are under heavy load. I thought that only the leaders are under load
 until any commit and then replicate to the other replicas.
 And that the replicas which are no leader have capacity to answer search
 requests.

 I think I still don't get the advantage of CloudSolrClient?

 Regards,
 Bernd

 Am 09.05.2018 um 19:15 schrieb Erick Erickson:
>
>
> You may not need to deal with any of this.
>
> The default CloudSolrClient call creates a new LBHttpSolrClient for
> you. So unless you're doing something custom with any LBHttpSolrClient
> you create, you don't need to create one yourself.
>
> Second, the default for CloudSolrClient.add() is to take the list of
> documents you provide into sub-lists that consist of the docs destined
> for a particular shard and sends those to the leader.
>
> Do the default not work for you?
>
> Best,
> Erick
>
> On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
>  wrote:
>>
>>
>> Hi list,
>>
>> while going from single core master/slave to cloud multi core/node
>> with leader/replica I want to change my SolrJ loading, because
>> ConcurrentUpdateSolrClient isn't cloud aware and has performance
>> impacts.
>> I want to use CloudSolrClient with LBHttpSolrClient and updates
>> should only go to shard leaders.
>>
>> Question, what is the difference between sendUpdatesOnlyToShardLeaders
>> and sendDirectUpdatesToShardLeadersOnly?
>>
>> Regards,
>> Bernd

Re: question about updates to shard leaders only

2018-05-15 Thread Bernd Fehling


Hi Erik,

yes indeed, batching solved it.
I used ConcurrentUpdateSolrClient with queue size of 1 but
CloudSolrClient doesn't have this feature.
I build my own queue now.

Ah!!! So I obviously use default NRT but actually don't need it because
I don't have any NRT data to index. A latency of several hours is OK for me.
Currently I'm testing with a 3x3 core-cluster (3 server, 3 cores per server).

I also tested with 3x3 node-cluster (3 server, 3 nodes per server) which 
performed
better, less influence of GarbageCollection.

I have to read more about PULL or TLOG replicas, how to set this up and so on.
If it is to complex I will go with NRT and indexing is anyway during the night.
Thanks for pointing this out.

Regards,
Bernd


Am 15.05.2018 um 13:28 schrieb Erick Erickson:

What did you do to solve your performance problem?

Batching updates is one thing that helps performance.

bq.  I thought that only the leaders are under load
until any commit and then replicate to the other replicas.

True if (and only if) you're using PULL or TLOG replicas.
When using the default NRT replicas, every replica indexes
the docs, it doesn't matter whether they are the leader or replica.
That's required for NRT. Using CloudSolrClient has no bearing
on that functionality.

Best,
Erick

On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
 wrote:

Thanks, solved, performance is good now.

Regards,
Bernd


Am 15.05.2018 um 08:12 schrieb Bernd Fehling:


OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search
requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:


You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists that consist of the docs destined
for a particular shard and sends those to the leader.

Do the default not work for you?

Best,
Erick

On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
 wrote:


Hi list,

while going from single core master/slave to cloud multi core/node
with leader/replica I want to change my SolrJ loading, because
ConcurrentUpdateSolrClient isn't cloud aware and has performance
impacts.
I want to use CloudSolrClient with LBHttpSolrClient and updates
should only go to shard leaders.

Question, what is the difference between sendUpdatesOnlyToShardLeaders
and sendDirectUpdatesToShardLeadersOnly?

Regards,
Bernd

Re: question about updates to shard leaders only

2018-05-15 Thread Erick Erickson

What did you do to solve your performance problem?

Batching updates is one thing that helps performance.

bq.  I thought that only the leaders are under load
until any commit and then replicate to the other replicas.

True if (and only if) you're using PULL or TLOG replicas.
When using the default NRT replicas, every replica indexes
the docs, it doesn't matter whether they are the leader or replica.
That's required for NRT. Using CloudSolrClient has no bearing
on that functionality.

Best,
Erick

On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
 wrote:
> Thanks, solved, performance is good now.
>
> Regards,
> Bernd
>
>
> Am 15.05.2018 um 08:12 schrieb Bernd Fehling:
>>
>> OK, I have the CloudSolrClient with SolrJ now running but it seams
>> a bit slower compared to ConcurrentUpdateSolrClient.
>> This was not expected.
>> The logs show that CloudSolrClient send the docs only to the leaders.
>>
>> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>>
>> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
>> With CloudSolrClient I get only about 1200 docs/sec.
>>
>> The system monitoring shows that with CloudSolrClient all nodes and cores
>> are under heavy load. I thought that only the leaders are under load
>> until any commit and then replicate to the other replicas.
>> And that the replicas which are no leader have capacity to answer search
>> requests.
>>
>> I think I still don't get the advantage of CloudSolrClient?
>>
>> Regards,
>> Bernd
>>
>>
>>
>> Am 09.05.2018 um 19:15 schrieb Erick Erickson:
>>>
>>> You may not need to deal with any of this.
>>>
>>> The default CloudSolrClient call creates a new LBHttpSolrClient for
>>> you. So unless you're doing something custom with any LBHttpSolrClient
>>> you create, you don't need to create one yourself.
>>>
>>> Second, the default for CloudSolrClient.add() is to take the list of
>>> documents you provide into sub-lists that consist of the docs destined
>>> for a particular shard and sends those to the leader.
>>>
>>> Do the default not work for you?
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
>>>  wrote:

 Hi list,

 while going from single core master/slave to cloud multi core/node
 with leader/replica I want to change my SolrJ loading, because
 ConcurrentUpdateSolrClient isn't cloud aware and has performance
 impacts.
 I want to use CloudSolrClient with LBHttpSolrClient and updates
 should only go to shard leaders.

 Question, what is the difference between sendUpdatesOnlyToShardLeaders
 and sendDirectUpdatesToShardLeadersOnly?

 Regards,
 Bernd

Re: question about updates to shard leaders only

2018-05-15 Thread Bernd Fehling


Thanks, solved, performance is good now.

Regards,
Bernd

Am 15.05.2018 um 08:12 schrieb Bernd Fehling:

OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search 
requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:

You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists that consist of the docs destined
for a particular shard and sends those to the leader.

Do the default not work for you?

Best,
Erick

On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
 wrote:

Hi list,

while going from single core master/slave to cloud multi core/node
with leader/replica I want to change my SolrJ loading, because
ConcurrentUpdateSolrClient isn't cloud aware and has performance
impacts.
I want to use CloudSolrClient with LBHttpSolrClient and updates
should only go to shard leaders.

Question, what is the difference between sendUpdatesOnlyToShardLeaders
and sendDirectUpdatesToShardLeadersOnly?

Regards,
Bernd

Re: question about updates to shard leaders only

2018-05-15 Thread Bernd Fehling


OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search 
requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:

You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists that consist of the docs destined
for a particular shard and sends those to the leader.

Do the default not work for you?

Best,
Erick

On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
 wrote:

Hi list,

while going from single core master/slave to cloud multi core/node
with leader/replica I want to change my SolrJ loading, because
ConcurrentUpdateSolrClient isn't cloud aware and has performance
impacts.
I want to use CloudSolrClient with LBHttpSolrClient and updates
should only go to shard leaders.

Question, what is the difference between sendUpdatesOnlyToShardLeaders
and sendDirectUpdatesToShardLeadersOnly?

Regards,
Bernd

Re: question about updates to shard leaders only

2018-05-09 Thread Mark Miller

It's been a while since I've been in this deeply, but it should be
something like:

sendUpdateOnlyToShardLeaders will select the leaders for each shard as the
load balanced targets for update. The updates may not go to the *right*
leader, but only the leaders will be chosen, followers (non leader
replicas) will not be part of the load balanced server list.

sendDirectUpdatesToShardLeadersOnly is the same, followers are not part of
the mix, but also, updates are sent directly to the right leader as long as
the right hashing field is specified (id by default). We hash the id client
side and know where it should end up.

Optimally, you want sendDirectUpdatesToShardLeadersOnly to be true
configured with the correct id field.

- Mark

On Wed, May 9, 2018 at 4:54 AM Bernd Fehling 
wrote:

> Hi list,
>
> while going from single core master/slave to cloud multi core/node
> with leader/replica I want to change my SolrJ loading, because
> ConcurrentUpdateSolrClient isn't cloud aware and has performance
> impacts.
> I want to use CloudSolrClient with LBHttpSolrClient and updates
> should only go to shard leaders.
>
> Question, what is the difference between sendUpdatesOnlyToShardLeaders
> and sendDirectUpdatesToShardLeadersOnly?
>
> Regards,
> Bernd
>
-- 
- Mark
about.me/markrmiller

Re: question about updates to shard leaders only

2018-05-09 Thread Erick Erickson

You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists that consist of the docs destined
for a particular shard and sends those to the leader.

Do the default not work for you?

Best,
Erick

On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
 wrote:
> Hi list,
>
> while going from single core master/slave to cloud multi core/node
> with leader/replica I want to change my SolrJ loading, because
> ConcurrentUpdateSolrClient isn't cloud aware and has performance
> impacts.
> I want to use CloudSolrClient with LBHttpSolrClient and updates
> should only go to shard leaders.
>
> Question, what is the difference between sendUpdatesOnlyToShardLeaders
> and sendDirectUpdatesToShardLeadersOnly?
>
> Regards,
> Bernd

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 912 matches

Mail list logo