in-place updates
Hi, in http://lucene.472066.n3.nabble.com/In-Place-Updates-not-working-as-expected-tp4375621p4380035.html some restrictions on the supported fields are given. I could however not find if in-place updates are supported for are field types or if they only work for say numeric fields. thanks, Hendrik
Re: Solr 7.3.0 loading OpenNLPExtractNamedEntitiesUpdateProcessorFactory
Hi, I found the problem: there was an additional jar file in the /dist folder that needed to be loaded as well (dist/solr-analysis-extras-7.3.0.jar). I didn't see this one. Thanks, Ryan On Mon, 9 Apr 2018 at 14:58 Ryan Yacyshynwrote: > Hi Shawn, > > I'm pretty sure the paths to load the jars in analysis-extras is correct, > the jars in /contrib/analysis-extras/lib load fine. I verified this by > changing the name of solr.OpenNLPTokenizerFactory to > solr.OpenNLPTokenizerFactory2 > and saw the new error. Changing it back to solr.OpenNLPTokenizerFactory > (without the "2") doesn't throw any errors, so I'm assuming these two > jar files (opennlp-maxent-3.0.3.jar and opennlp-tools-1.8.3.jar) must be > loading. > > I tried swapping the order in which these jars are loaded as well, but no > luck there. > > I have attached my solr.log file after a restart. Also included is my > solrconfig.xml and managed-schema. The path to my config > is /Users/ryan/solr-7.3.0/server/solr/nlp/conf and this is where I have the > OpenNLP bin files (en-ner-person.bin, en-sent.bin, and en-token.bin). > Configs are derived from the _default configset. > > On a mac, and my Java version is: > > java version "1.8.0_45" > Java(TM) SE Runtime Environment (build 1.8.0_45-b14) > Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) > > Thanks, > Ryan > > > > On Sun, 8 Apr 2018 at 21:34 Shawn Heisey wrote: > >> On 4/8/2018 2:36 AM, Ryan Yacyshyn wrote: >> > I'm running into a small problem loading >> > the OpenNLPExtractNamedEntitiesUpdateProcessorFactory class, getting an >> > error saying it's not found. I'm loading all the required jar files, >> > according to the readme: >> >> You've got a element to load analysis-extras jars, but are you >> certain it's actually loading anything? >> >> Can you share a solr.log file created just after a Solr restart? Not >> just a reload -- I'm asking for a restart so the log is more complete. >> With that, I can see what's happening and then ask more questions that >> may pinpoint something. >> >> Thanks, >> Shawn >> >>
Re: Text in images are not extracted and indexed to content
Thanks for the reply. It was due to the Tesseract OCR problem, as I have tried out the new Tesseract 4 version on my system, and it does not set the path in the Environment Variables, unlike the older Tesseract 3, which set the path automatically during installation. Regards, Edwin On 10 April 2018 at 18:58, Shamik Sinhawrote: > To index text in images the image needs to be searchable i. e. text needs > to be overlayed on the image like a searchable pdf. You can do this using > ocr but it is a bit unreliable if the images are scanned copies of written > text. > > On 10-Apr-2018 4:12 PM, "Rahul Singh" > wrote: > > May need to extract outside SolR and index pure text with an external > ingestion process. You have much more control over the Tika attributes and > behaviors. > > -- > Rahul Singh > rahul.si...@anant.us > > Anant Corporation > > > On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo >, > wrote: > > Hi, > > > > Currently I am facing issue whereby the text in images file like jpg, bmp > > are not being extracted out and indexed. After the indexing, Tika did > > extract all the meta data out and index them under the fields attr_*. > > However, the content field is always empty for images file. For other > types > > of document files like .doc, the content is extracted correctly. > > > > I have already updated the tika-parsers-1.17.jar, under > > \prg\apache\tika\parser\pdf\ for extractInlineImages to true. > > > > > > What could be the reason? > > > > I have just upgraded to Solr 7.3.0. > > > > Regards, > > Edwin >
Using Solr to search website and external Oracle ServiceCloud
Hello, I have a drupal website that uses SOLR. When a user searches our website, I would like to now return results from 2 sources: (1) our website (2) our external Oracle ServiceCloud knowledge base Does anyone have any suggestions on how to do this? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Using Solr to search website and external Oracle ServiceCloud
Hello, I would like to use SOLR to search 2 different sources: (1) My website (2) My external knowledge base created in Oracle Service Could. Right now SOLR works great against my website. I am wanted to integrate my FAQs from the knowledge base into 1 search on the website to make it easier for the users. Any thoughts? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Solr7.1.0 - deleting collections when using HDFS
Hi All - I've noticed that if I delete a collection that is stored in HDFS, the files/directory in HDFS remain. If I then try to recreate the collection with the same name, I get an error about unable to open searcher. If I then remove the directory from HDFS, the error remains due to files stored in /etc/solr. Once those are also removed on all the noeds, then I can re-create the collection. -Joe
RE: Backup a solr cloud collection - timeout in 180s?
Erick: Good to know! Thx Robi -Original Message- From: Erick EricksonSent: Tuesday, April 10, 2018 12:42 PM To: solr-user Subject: Re: Backup a solr cloud collection - timeout in 180s? Robi: Yeah, the ref guide has lots and lots and lots of info, but at 1,100 pages and growing things can be "interesting" to find. Do be aware of one thing. The async ID should be unique and before 7.3 there was a bug that if you used the same ID twice (without waiting for completion and deleting it first) it lead to bewildering results. See: https://issues.apache.org/jira/browse/SOLR-11739. The operations would succeed, but you might not be getting the status of the task you think you are. Best, Erick On Tue, Apr 10, 2018 at 9:25 AM, Petersen, Robert (Contr) wrote: > HI Erick, > > > I *just* found that parameter in the guide... it was waaay down at the bottom > of the page (in proverbial small print)! > > > So for other readers the steps are this: > > # start the backup async enabled > > /admin/collections?action=BACKUP=addrsearchBackup=addr > search=/apps/logs/backups=1234 > > > # check on the status of the async job > > /admin/collections?action=REQUESTSTATUS=1234 > > > # clear out the status when done > > /admin/collections?action=DELETESTATUS=1234 > > > Thx > > Robi > > > From: Erick Erickson > Sent: Tuesday, April 10, 2018 8:24:20 AM > To: solr-user > Subject: Re: Backup a solr cloud collection - timeout in 180s? > > > WARNING: External email. Please verify sender before opening attachments or > clicking on links. > > > > > Specify the "async" property, see: > https://lucene.apache.org/solr/guide/6_6/collections-api.html > > There's also a way to check the status of the backup running in the > background. > > Best, > Erick > > On Mon, Apr 9, 2018 at 11:05 AM, Petersen, Robert (Contr) > wrote: >> Shouldn't this just create the backup file(s) asynchronously? Can the >> timeout be adjusted? >> >> >> Solr 7.2.1 with five nodes and the addrsearch collection is five >> shards x five replicas and "numFound":38837970 docs >> >> >> Thx >> >> Robi >> >> >> http://myServer.corp.pvt:8983/solr/admin/collections?action=BACKUP >> me=addrsearchBackup=addrsearch=/apps/logs/backups >> >> >> * >> * >> responseHeader: >> { >> * >> status: 500, >> * >> QTime: 180211 >> }, >> * >> error: >> { >> * >> metadata: >> [ >>* >> "error-class", >>* >> "org.apache.solr.common.SolrException", >>* >> "root-error-class", >>* >> "org.apache.solr.common.SolrException" >> ], >> * >> msg: "backup the collection time out:180s", >> * >> >> >> From the logs: >> >> >> 2018-04-09 17:47:32.667 INFO (qtp64830413-22) [ ] o.a.s.s.HttpSolrCall >> [admin] webapp=null path=/admin/collections >> params={name=addrsearchBackup=BACKUP=/apps/logs/backups=addrsearch} >> status=500 QTime=180211 >> 2018-04-09 17:47:32.667 ERROR (qtp64830413-22) [ ] o.a.s.s.HttpSolrCall >> null:org.apache.solr.common.SolrException: backup the collection time >> out:180s >> at >> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:314) >> at >> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:246) >> at >> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:224) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177) >> at >> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:735) >> at >> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:716) >> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:497) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte >> r.java:326) >> >> >> >> >> >> This communication is confidential. Frontier only sends and receives email >> on the basis of the terms set out at >> http://www.frontier.com/email_disclaimer.
Re: Backup a solr cloud collection - timeout in 180s?
Robi: Yeah, the ref guide has lots and lots and lots of info, but at 1,100 pages and growing things can be "interesting" to find. Do be aware of one thing. The async ID should be unique and before 7.3 there was a bug that if you used the same ID twice (without waiting for completion and deleting it first) it lead to bewildering results. See: https://issues.apache.org/jira/browse/SOLR-11739. The operations would succeed, but you might not be getting the status of the task you think you are. Best, Erick On Tue, Apr 10, 2018 at 9:25 AM, Petersen, Robert (Contr)wrote: > HI Erick, > > > I *just* found that parameter in the guide... it was waaay down at the bottom > of the page (in proverbial small print)! > > > So for other readers the steps are this: > > # start the backup async enabled > > /admin/collections?action=BACKUP=addrsearchBackup=addrsearch=/apps/logs/backups=1234 > > > # check on the status of the async job > > /admin/collections?action=REQUESTSTATUS=1234 > > > # clear out the status when done > > /admin/collections?action=DELETESTATUS=1234 > > > Thx > > Robi > > > From: Erick Erickson > Sent: Tuesday, April 10, 2018 8:24:20 AM > To: solr-user > Subject: Re: Backup a solr cloud collection - timeout in 180s? > > > WARNING: External email. Please verify sender before opening attachments or > clicking on links. > > > > > Specify the "async" property, see: > https://lucene.apache.org/solr/guide/6_6/collections-api.html > > There's also a way to check the status of the backup running in the > background. > > Best, > Erick > > On Mon, Apr 9, 2018 at 11:05 AM, Petersen, Robert (Contr) > wrote: >> Shouldn't this just create the backup file(s) asynchronously? Can the >> timeout be adjusted? >> >> >> Solr 7.2.1 with five nodes and the addrsearch collection is five shards x >> five replicas and "numFound":38837970 docs >> >> >> Thx >> >> Robi >> >> >> http://myServer.corp.pvt:8983/solr/admin/collections?action=BACKUP=addrsearchBackup=addrsearch=/apps/logs/backups >> >> >> * >> * >> responseHeader: >> { >> * >> status: 500, >> * >> QTime: 180211 >> }, >> * >> error: >> { >> * >> metadata: >> [ >>* >> "error-class", >>* >> "org.apache.solr.common.SolrException", >>* >> "root-error-class", >>* >> "org.apache.solr.common.SolrException" >> ], >> * >> msg: "backup the collection time out:180s", >> * >> >> >> From the logs: >> >> >> 2018-04-09 17:47:32.667 INFO (qtp64830413-22) [ ] o.a.s.s.HttpSolrCall >> [admin] webapp=null path=/admin/collections >> params={name=addrsearchBackup=BACKUP=/apps/logs/backups=addrsearch} >> status=500 QTime=180211 >> 2018-04-09 17:47:32.667 ERROR (qtp64830413-22) [ ] o.a.s.s.HttpSolrCall >> null:org.apache.solr.common.SolrException: backup the collection time >> out:180s >> at >> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:314) >> at >> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:246) >> at >> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:224) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177) >> at >> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:735) >> at >> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:716) >> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:497) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326) >> >> >> >> >> >> This communication is confidential. Frontier only sends and receives email >> on the basis of the terms set out at >> http://www.frontier.com/email_disclaimer.
Re: Recover a Solr Node
There's actually not much that's _required_ in core.properties, so if you want to try this just create a core directory by hand under SOLR_HOME and name it by the pattern that other cores use. Then use a core.properties file from another replica in the same collection and substitute every property in the file with the values from the deleted node in ZK. If you put the index in the data dir and restart Solr you should be good to go. But as Shawn says, there are other issues. WARNING: even a small mis-step have "interesting" results. No, I don't think this is something I think would be good to add to the collections API. Having replicas distributed among the cluster and backup/restore are the supported ways of not winding up in this situation. Having the collections command would be too easy to mis-use. Any reconstructed node would NOT have any data associated with it, so it requires being in a non-standard setup to be useful IMO. Frankly, to cover your situation, why not copy from one level up? I.e. save off "collection1_core1_replica1n"? Assuming you have a leader, ADDREPLICA is also a way to build out your cluster after a failure. Best, Erick On Tue, Apr 10, 2018 at 11:16 AM, Shawn Heiseywrote: > On 4/9/2018 2:28 PM, Karthik Ramachandran wrote: >> We are using Solr cloud with 3 nodes, no replication with 8 shard per node >> per collection. We have multiple collection on that node. >> >> We have backup of data the data folder, so we can recover it, is there a >> way to reconstruct core.properties for all the replica's for that node? > > If the router in use for a multi-shard collection is compositeId, then > the core.properties file does NOT contain all the information necessary > to restore the shard in the collection. Part of the information will > ONLY be in zookeeper. I am thinking specifically the hash range for the > shard. > > I think it might be a good idea to write all information necessary to > fully recreate a shard in a collection to core.properties. Make it > possible to reconstruct a SolrCloud if the zookeeper database is lost. > Store information currently only in ZK, like the router name, shard hash > range, etc. > > Thanks, > Shawn >
Re: Recover a Solr Node
On 4/9/2018 2:28 PM, Karthik Ramachandran wrote: > We are using Solr cloud with 3 nodes, no replication with 8 shard per node > per collection. We have multiple collection on that node. > > We have backup of data the data folder, so we can recover it, is there a > way to reconstruct core.properties for all the replica's for that node? If the router in use for a multi-shard collection is compositeId, then the core.properties file does NOT contain all the information necessary to restore the shard in the collection. Part of the information will ONLY be in zookeeper. I am thinking specifically the hash range for the shard. I think it might be a good idea to write all information necessary to fully recreate a shard in a collection to core.properties. Make it possible to reconstruct a SolrCloud if the zookeeper database is lost. Store information currently only in ZK, like the router name, shard hash range, etc. Thanks, Shawn
Re: Recover a Solr Node
Eric, Just throwing whats's in my mind. I see that collection cluster state has the all the information to create the core.properties. If I create the core.properties from the cluster state and then reload the collection will that bring the collection up? I did try the above step, but instead of reload I restarted the solr instance and collection was up and running. Do you think if would be worth putting the support in solr collection api? With Thanks & Regards Karthik
Re: Backup a solr cloud collection - timeout in 180s?
HI Erick, I *just* found that parameter in the guide... it was waaay down at the bottom of the page (in proverbial small print)! So for other readers the steps are this: # start the backup async enabled /admin/collections?action=BACKUP=addrsearchBackup=addrsearch=/apps/logs/backups=1234 # check on the status of the async job /admin/collections?action=REQUESTSTATUS=1234 # clear out the status when done /admin/collections?action=DELETESTATUS=1234 Thx Robi From: Erick EricksonSent: Tuesday, April 10, 2018 8:24:20 AM To: solr-user Subject: Re: Backup a solr cloud collection - timeout in 180s? WARNING: External email. Please verify sender before opening attachments or clicking on links. Specify the "async" property, see: https://lucene.apache.org/solr/guide/6_6/collections-api.html There's also a way to check the status of the backup running in the background. Best, Erick On Mon, Apr 9, 2018 at 11:05 AM, Petersen, Robert (Contr) wrote: > Shouldn't this just create the backup file(s) asynchronously? Can the timeout > be adjusted? > > > Solr 7.2.1 with five nodes and the addrsearch collection is five shards x > five replicas and "numFound":38837970 docs > > > Thx > > Robi > > > http://myServer.corp.pvt:8983/solr/admin/collections?action=BACKUP=addrsearchBackup=addrsearch=/apps/logs/backups > > > * > * > responseHeader: > { > * > status: 500, > * > QTime: 180211 > }, > * > error: > { > * > metadata: > [ >* > "error-class", >* > "org.apache.solr.common.SolrException", >* > "root-error-class", >* > "org.apache.solr.common.SolrException" > ], > * > msg: "backup the collection time out:180s", > * > > > From the logs: > > > 2018-04-09 17:47:32.667 INFO (qtp64830413-22) [ ] o.a.s.s.HttpSolrCall > [admin] webapp=null path=/admin/collections > params={name=addrsearchBackup=BACKUP=/apps/logs/backups=addrsearch} > status=500 QTime=180211 > 2018-04-09 17:47:32.667 ERROR (qtp64830413-22) [ ] o.a.s.s.HttpSolrCall > null:org.apache.solr.common.SolrException: backup the collection time out:180s > at > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:314) > at > org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:246) > at > org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:224) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177) > at > org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:735) > at > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:716) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:497) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326) > > > > > > This communication is confidential. Frontier only sends and receives email on > the basis of the terms set out at http://www.frontier.com/email_disclaimer.
Re: Backup a solr cloud collection - timeout in 180s?
Specify the "async" property, see: https://lucene.apache.org/solr/guide/6_6/collections-api.html There's also a way to check the status of the backup running in the background. Best, Erick On Mon, Apr 9, 2018 at 11:05 AM, Petersen, Robert (Contr)wrote: > Shouldn't this just create the backup file(s) asynchronously? Can the timeout > be adjusted? > > > Solr 7.2.1 with five nodes and the addrsearch collection is five shards x > five replicas and "numFound":38837970 docs > > > Thx > > Robi > > > http://myServer.corp.pvt:8983/solr/admin/collections?action=BACKUP=addrsearchBackup=addrsearch=/apps/logs/backups > > > * > * > responseHeader: > { > * > status: 500, > * > QTime: 180211 > }, > * > error: > { > * > metadata: > [ >* > "error-class", >* > "org.apache.solr.common.SolrException", >* > "root-error-class", >* > "org.apache.solr.common.SolrException" > ], > * > msg: "backup the collection time out:180s", > * > > > From the logs: > > > 2018-04-09 17:47:32.667 INFO (qtp64830413-22) [ ] o.a.s.s.HttpSolrCall > [admin] webapp=null path=/admin/collections > params={name=addrsearchBackup=BACKUP=/apps/logs/backups=addrsearch} > status=500 QTime=180211 > 2018-04-09 17:47:32.667 ERROR (qtp64830413-22) [ ] o.a.s.s.HttpSolrCall > null:org.apache.solr.common.SolrException: backup the collection time out:180s > at > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:314) > at > org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:246) > at > org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:224) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177) > at > org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:735) > at > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:716) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:497) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326) > > > > > > This communication is confidential. Frontier only sends and receives email on > the basis of the terms set out at http://www.frontier.com/email_disclaimer.
Re: Recover a Solr Node
Not that I know of. You might be able to do an ADDREPLICA for each one. This is a risk when running without replicas. Best, Erick On Mon, Apr 9, 2018 at 1:28 PM, Karthik Ramachandranwrote: > We are using Solr cloud with 3 nodes, no replication with 8 shard per node > per collection. We have multiple collection on that node. > > We have backup of data the data folder, so we can recover it, is there a > way to reconstruct core.properties for all the replica's for that node? > > -- > With Thanks & Regards > Karthik
Re: replication
bq: should we try to bite the solrcloud bullet and be done w it that's what I'd do. As of 7.0 there are different "flavors", TLOG, PULL and NRT so that's also a possibility, although you can't (yet) direct queries to one or the other. So just making them all NRT and forgetting about it is reasonable. bq: is there some more config work we could put in place to avoid ... commit issue and the ultra large merge dangers No. The very nature of merging is such that you will _always_ get large merges until you have 5G segments (by default). The max segment size (outside "optimize/forceMerge/expungeDeletes" which you shouldn't do) is 5G so the steady-state worst-case segment pull is limited to that. bq: maybe for our initial need we use Master for writing and user access in NRT events, but slaves for the heavier backend Quite possible, but you have to route things yourself. But in that case you're limited to one machine to handle all your NRT traffic. I skimmed your post so don't know whether your NRT traffic load is high enough to worry about. The very first thing I'd do is set up a simple SolrCloud setup and give it a spin. Unless your indexing load is quite heavy, the added work the NRT replicas have in SolrCloud isn't a problem so worrying about that is premature optimization unless you have a heavy load. Best, Erick On Mon, Apr 9, 2018 at 4:36 PM, John Blythewrote: > Thanks a bunch for the thorough reply, Shawn. > > Phew. We’d chosen to go w Master-slave replication instead of SolrCloud per > the sudden need we had encountered and the desire to avoid the nuances and > changes related to moving to SolrCloud. But so much for this being a more > straightforward solution, huh? > > Few questions: > - should we try to bite the solrcloud bullet and be done w it? > - is there some more config work we could put in place to avoid the soft > commit issue and the ultra large merge dangers, keeping the replications > happening quickly? > - maybe for our initial need we use Master for writing and user access in > NRT events, but slaves for the heavier backend processing. Thoughts? > - anyone do consulting on this that would be interested in chatting? > > Thanks again! > > On Mon, Apr 9, 2018 at 18:18 Shawn Heisey wrote: > >> On 4/9/2018 12:15 PM, John Blythe wrote: >> > we're starting to dive into master/slave replication architecture. we'll >> > have 1 master w 4 slaves behind it. our app is NRT. if user performs an >> > action in section A's data they may choose to jump to section B which >> will >> > be dependent on having the updates from their action in section A. as >> such, >> > we're thinking that the replication time should be set to 1-2s (the >> chances >> > of them arriving at section B quickly enough to catch the 2s gap is >> highly >> > unlikely at best). >> >> Once you start talking about master-slave replication, my assumption is >> that you're not running SolrCloud. You would NOT want to try and mix >> SolrCloud with replication. The features do not play well together. >> SolrCloud with NRT replicas (this is the only replica type that exists >> in 6.x and earlier) may be a better option than master-slave replication. >> >> > since the replicas will simply be looking for new files it seems like >> this >> > would be a lightweight operation even every couple seconds for 4 >> replicas. >> > that said, i'm going *entirely* off of assumption at this point and >> wanted >> > to check in w you all to see any nuances, gotchas, hidden landmines, etc. >> > that we should be considering before rolling things out. >> >> Most of the time, you'd be correct to think that indexing is going to >> create a new small segment and replication will have little work to do. >> But as you create more and more segments, eventually Lucene is going to >> start merging those segments. For discussion purposes, I'm going to >> describe a situation where each new segment during indexing is about >> 100KB in size, and the merge policy is left at the default settings. >> I'm also going to assume that no documents are getting deleted or >> reindexed (which will delete the old version). Deleted documents can >> have an impact on merging, but it will usually only be a dramatic impact >> if there are a LOT of deleted documents. >> >> The first ten segments created will be this 100KB size. Then Lucene is >> going to see that there are enough segments to trigger the merge policy >> - it's going to combine ten of those segments into one that's >> approximately one megabyte. Repeat this ten times, and ten of those 1 >> megabyte segments will be combined into one ten megabyte segment. >> Repeat all of THAT ten times, and there will be a 100 megabyte segment. >> And there will eventually be another level creating 1 gigabyte >> segments. If the index is below 5GB in size, the entire thing *could* >> be merged into one segment by this process. >> >> The end result of all this:
Query regarding LTR plugin in solr
Hi, I'm working on ltr feature in solr. I have a feature like : ''' { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}scale(query({!payload_score f=aggregated_terms func=max v=${query}}),0,100)" } } ''' Here the scaling function is taking a lot more time than expected. Is there a way I could implement a customized class or any other way by which I can reduce this time. So basically I just want to scale the value which looks at the whole result set instead of just the current document. Can I have/implement something during normalization?? Thanks in advance Regards, Prateek
Ignore Field from indexing
Hi I have document indexed. Email-Id is unique key in document. On updating I need to ignore few field if its already exists. Please let me know if something more required. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
I actually used solr 5.x, the more like this features, and a subset of human tagged data (about 10%) to apply subject coding with around a 95% accuracy rate to over 2 million documents, so it is definitely doable On Tue, Apr 10, 2018 at 10:40 AM, Alexandre Rafalovitchwrote: > I know it was a joke, but I've been thinking of something like that. > Not a chatbot per say, but perhaps something that uses Machine > Learning/topic clustering on the past discussions and match them to > the new questions. Still would need to be rechecked by a human for > final response, but could be very helpful. I certainly wished for that > many times as I was answering newbie's questions (or my own). > > And, I feel, current version of Solr actually has all the pieces to > make such thing happen. Could be a fun project/demo/service for > the next LuceneSolrRevolution for somebody with time on their hands > :-) > > Regards, >Alex. > > On 9 April 2018 at 13:24, Allison, Timothy B. wrote: > > +1 > > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > > > We should add a chatbot to the list that includes Charlie's advice and > the link to Erick's blog post whenever Tika is used. > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder > > wrote: > > > >> Hello! > >> > >> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > >> we have in our Sharepoint system. I have used the tika-app.jar > >> directly to extract the document in question and it does _not_ throw > >> an exception and extract the contents just fine. So it would seem Solr > >> is doing something different than a Tika standalone installation. > >> > >> After some Googling, I found out that Solr uses its custom HtmlMapper > >> (MostlyPassthroughHtmlMapper) which passes through all elements in the > >> HTML document to Tika. As Tika limits nested elements to 100, this > >> causes Tika to throw an exception: Suspected zip bomb: 100 levels of > >> XML element nesting. This is metioned in TIKA-2091 > >> (https://issues.apache.org/ jira/browse/TIKA-2091? > focusedCommentId=15514131=com.atlassian.jira. > >> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > >> "solution" is to use Tika's default parsing/mapping mechanism but no > >> details have been provided on how to configure this at Solr. > >> > >> I'm hoping some folks here have the knowledge on how to configure Solr > >> to effectively by-pass its built in MostlyPassthroughHtmlMapper and > >> use Tika's implementation. > >> > >> Thank you! > >> Harinder > >> > >> > >> > >> NOTICE - > >> This communication is intended ONLY for the use of the person or > >> entity named above and may contain information that is confidential or > >> legally privileged. If you are not the intended recipient named above > >> or a person responsible for delivering messages or communications to > >> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > >> distribution, or copying of this communication or any of the > >> information contained in it is strictly prohibited. If you have > >> received this communication in error, please notify us immediately by > >> telephone and then destroy or delete this communication, or return it > >> to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > >> >
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
I know it was a joke, but I've been thinking of something like that. Not a chatbot per say, but perhaps something that uses Machine Learning/topic clustering on the past discussions and match them to the new questions. Still would need to be rechecked by a human for final response, but could be very helpful. I certainly wished for that many times as I was answering newbie's questions (or my own). And, I feel, current version of Solr actually has all the pieces to make such thing happen. Could be a fun project/demo/service for the next LuceneSolrRevolution for somebody with time on their hands :-) Regards, Alex. On 9 April 2018 at 13:24, Allison, Timothy B.wrote: > +1 > > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > We should add a chatbot to the list that includes Charlie's advice and the > link to Erick's blog post whenever Tika is used. > > > -Original Message- > From: Charlie Hull [mailto:char...@flax.co.uk] > Sent: Monday, April 9, 2018 12:44 PM > To: solr-user@lucene.apache.org > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > I'd recommend you run Tika externally to Solr, which will allow you to catch > this kind of problem and prevent it bringing down your Solr installation. > > Cheers > > Charlie > > On 9 April 2018 at 16:59, Hanjan, Harinder > wrote: > >> Hello! >> >> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents >> we have in our Sharepoint system. I have used the tika-app.jar >> directly to extract the document in question and it does _not_ throw >> an exception and extract the contents just fine. So it would seem Solr >> is doing something different than a Tika standalone installation. >> >> After some Googling, I found out that Solr uses its custom HtmlMapper >> (MostlyPassthroughHtmlMapper) which passes through all elements in the >> HTML document to Tika. As Tika limits nested elements to 100, this >> causes Tika to throw an exception: Suspected zip bomb: 100 levels of >> XML element nesting. This is metioned in TIKA-2091 >> (https://issues.apache.org/ >> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. >> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The >> "solution" is to use Tika's default parsing/mapping mechanism but no >> details have been provided on how to configure this at Solr. >> >> I'm hoping some folks here have the knowledge on how to configure Solr >> to effectively by-pass its built in MostlyPassthroughHtmlMapper and >> use Tika's implementation. >> >> Thank you! >> Harinder >> >> >> >> NOTICE - >> This communication is intended ONLY for the use of the person or >> entity named above and may contain information that is confidential or >> legally privileged. If you are not the intended recipient named above >> or a person responsible for delivering messages or communications to >> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, >> distribution, or copying of this communication or any of the >> information contained in it is strictly prohibited. If you have >> received this communication in error, please notify us immediately by >> telephone and then destroy or delete this communication, or return it >> to us by mail if requested by us. The City of Calgary thanks you for your >> attention and co-operation. >>
Re: Confusing error when creating a new core with TLS, service enabled
On 4/10/2018 7:32 AM, Christopher Schultz wrote: What happened is that the new core directory was created as root, owned by root. Was it? If my server is running as solr, how can it create directories as root? Unless you run Solr in cloud mode (which means using zookeeper), the server cannot create the core directories itself. When running in standalone mode, the core directory is created by the bin/solr program doing the "create" -- which was running as root. I know that because you needed the "-force" option. So the core directory and its "conf" subdirectory (with the config) are created by the script, then Solr is asked (using the CoreAdmin API via http) to add that core. It can't, because the new directory was created by root, and Solr can't write the core.properties file that defines the core for Solr. When running Solr in cloud mode, the configs are in zookeeper, so the create command on the script doesn't have to make the core directory in order for Solr to find the configuration. It can simply upload the config to zookeeper and then tell Solr to create the collection, and Solr will do so, locating the configuration in ZooKeeper. See the big warning in the CREATE section of the CoreAdmin API documentation about the CREATE action needing to be able to find a configuration: https://lucene.apache.org/solr/guide/7_3/coreadmin-api.html You might be wondering why Solr can't create the core directories itself using the CoreAdmin API except in cloud mode. This is because the CoreAdmin API is *OLD* and its functionality has not really changed since it was created. Historically, it was only designed to add a core that had already been created. We probably need to "fix" this ... but it has never been a priority. There are bigger problems and features to work on. Cloud mode is much newer, and although the Collections API does utilize the CoreAdmin API behind the scenes, the user typically doesn't use CoreAdmin directly in cloud mode. The client may be running as root, but the server is running as 'solr'. And the error occurs on the server, not the client. So, what's really going on, here? I hope I've explained that clearly above. Thanks, Shawn
Re: Confusing error when creating a new core with TLS, service enabled
Shawn, On 4/9/18 8:04 PM, Shawn Heisey wrote: > On 4/9/2018 12:58 PM, Christopher Schultz wrote: >> After playing-around with a Solr 7.2.1 instance launched from the >> extracted tarball, I decided to go ahead and create a "real service" on >> my Debian-based server. >> >> I've run the 7.3.0 install script, configured Solr for TLS, and moved my >> existing configuration into the data directory, here: > > What was the *precise* command you used to install Solr? $ sudo bin/install_solr_service.sh ../solr-7.3.0.tgz -i /usr/local/ > Looking for > all the options you used, so I know where things are. There shouldn't > be anything sensitive in that command, so I don't think you need to > redact it at all. Also, what exactly did you add to > /etc/default/solr.in.sh? Redact any passwords you put there if you need to. # Set by installer SOLR_PID_DIR="/var/solr" SOLR_HOME="/var/solr/data" LOG4J_PROPS="/var/solr/log4j.properties" SOLR_LOGS_DIR="/var/solr/logs" SOLR_PORT="8983" # Set by me SOLR_JAVA_HOME=/usr/local/java-8 SOLR_SSL_KEY_STORE=/etc/solr/solr.p12 SOLR_SSL_KEY_STORE_PASSWORD=xxx SOLR_SSL_KEY_STORE_TYPE=PKCS12 SOLR_SSL_TRUST_STORE=/etc/solr/solr-client.p12 SOLR_SSL_TRUST_STORE_PASSWORD=xxx SOLR_SSL_TRUST_STORE_TYPE=PKCS12 >> When trying to create a new core, I get an NPE running: >> >> $ /usr/local/solr/bin/solr create -V -c new_core >> >> WARNING: Using _default configset with data driven schema functionality. >> NOT RECOMMENDED for production use. >> To turn off: bin/solr config -c new_core -p 8983 -property >> update.autoCreateFields -value false >> Exception in thread "main" java.lang.NullPointerException >> at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:731) >> at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:642) >> at org.apache.solr.util.SolrCLI$CreateTool.runImpl(SolrCLI.java:1773) >> at org.apache.solr.util.SolrCLI$ToolBase.runTool(SolrCLI.java:176) >> at org.apache.solr.util.SolrCLI.main(SolrCLI.java:282) > > Due to the way the code is written there in version 7.3, the exact > nature of the problem is lost and it's not possible to see it without a > change to the source code. If you want to build a patched version of > 7.3, you could re-run it to see exactly what happened. Here's an issue > for the NPE problem: > > https://issues.apache.org/jira/browse/SOLR-12206 Thanks. > Best guess about the error that it got: When you ran the create > command, I think that Java was not able to validate the SSL certificate > from the Solr server. This would be consistent with what I saw in the > source code. This particular scenario was that the solr client was trying to use HTTP on port 8983 (because solr.in.sh could not be read with the TLS hints) and getting a (broken) TLS handshake response. So it wasn't even an HTTP response, which is probably why the client was (very) confused. > For the problem you had later with "-force" ... this is *exactly* why > you shouldn't run bin/solr as root. Not running as root. I'm on the Tomcat security team. I'm obviously not wanting to run the server as root. $ ps aux | grep -e 'PID\|solr' USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND solr 18309 0.0 3.3 2148524 257164 ? Sl Apr09 0:22 [cmd] File permissions make sense, too: $ sudo ls -ld /var/solr/data drwxr-x--- 3 solr solr 4096 Apr 9 15:06 /var/solr/data $ sudo ls -l /var/solr/data total 12 drwxr-xr-x 4 solr solr 4096 Mar 5 15:12 test_core -rw-r- 1 solr solr 2117 Apr 9 09:49 solr.xml -rw-r- 1 solr solr 975 Apr 9 09:49 zoo.cfg > What happened is that the new core directory was created as root, > owned by root. Was it? If my server is running as solr, how can it create directories as root? > But then when Solr tried to add the core, it needed to write a > core.properties file to that directory, but was not able to do so, > probably because it's running as "solr" and has no write permission > in a directory owned by root. That makes absolutely no sense whatsoever. The server is running under a single egid, and it's 'solr', not 'root'. Also, there is no new directory in /var/solr/data (owned by either solr OR root) and if Solr was able to create that directory, it should be able to write to it. The client may be running as root, but the server is running as 'solr'. And the error occurs on the server, not the client. So, what's really going on, here? > The error in the message from the command with "-force" seems to have > schizophrenia. I absolutely edited the log and failed to do so completely. -chris
Re: Collapse in facet
It sounds like the JSON facet API could do what you are describing. I haven't tried the exclusion of the collapse filter with the JSON facet API but I suspect it will work. Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Apr 10, 2018 at 3:40 AM, Carl-Johan Syrénwrote: > Hi > > I use fq= '{!collapse field=id_publ}' in a question to get distinct > publications id:s, it works fine. Then i got three facets and in one of > them i want to collapse on another level. > For each organisation id i want to count distinct publication id:s. To do > that i start with suppressing the collapse in fq > (facet.field={!ex=collapse}id_publ), but i dont knew how to count the > distinct publication id:s for each organisation id. > Does anyone have an idea how to do it? > I use Solr 7.2.1 . > > Regards > > Carl-Johan Syrén > >
Re: Text in images are not extracted and indexed to content
To index text in images the image needs to be searchable i. e. text needs to be overlayed on the image like a searchable pdf. You can do this using ocr but it is a bit unreliable if the images are scanned copies of written text. On 10-Apr-2018 4:12 PM, "Rahul Singh"wrote: May need to extract outside SolR and index pure text with an external ingestion process. You have much more control over the Tika attributes and behaviors. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo , wrote: > Hi, > > Currently I am facing issue whereby the text in images file like jpg, bmp > are not being extracted out and indexed. After the indexing, Tika did > extract all the meta data out and index them under the fields attr_*. > However, the content field is always empty for images file. For other types > of document files like .doc, the content is extracted correctly. > > I have already updated the tika-parsers-1.17.jar, under > \prg\apache\tika\parser\pdf\ for extractInlineImages to true. > > > What could be the reason? > > I have just upgraded to Solr 7.3.0. > > Regards, > Edwin
Re: Text in images are not extracted and indexed to content
May need to extract outside SolR and index pure text with an external ingestion process. You have much more control over the Tika attributes and behaviors. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo, wrote: > Hi, > > Currently I am facing issue whereby the text in images file like jpg, bmp > are not being extracted out and indexed. After the indexing, Tika did > extract all the meta data out and index them under the fields attr_*. > However, the content field is always empty for images file. For other types > of document files like .doc, the content is extracted correctly. > > I have already updated the tika-parsers-1.17.jar, under > \prg\apache\tika\parser\pdf\ for extractInlineImages to true. > > > What could be the reason? > > I have just upgraded to Solr 7.3.0. > > Regards, > Edwin
Re: Score certain documents higher based on a weight field
Hi, In case you are using (e)dismax query parser, you can use bf (additive) or boost (multiplier) to boost results. You have field function to access the field value (can also just use field name in most places. HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 10 Apr 2018, at 01:26, OTHwrote: > > Hello, > > Is there a way to assign a higher score to certain documents based on a > 'weight' field? > > > E.g., if I have the following two documents: > { >"name":"United Kingdom", >"weight":2730, > } { >"name":"United States of America", >"weight":11246, > } > > Currently, if I issue the following query: > q=name:united > > These are the scores I get: > { >"name":"United Kingdom", >"weight":2730, >"score":9.464103}, > } { >"name":"United States of America", >"weight":11246, >"score":7.766276}] > } > > > However, I'd like the score to somehow factor in the number in the "weight" > column. (And hence, increase the score assigned to "United States of > America" in this case.) > > Much thanks
Collapse in facet
Hi I use fq= '{!collapse field=id_publ}' in a question to get distinct publications id:s, it works fine. Then i got three facets and in one of them i want to collapse on another level. For each organisation id i want to count distinct publication id:s. To do that i start with suppressing the collapse in fq (facet.field={!ex=collapse}id_publ), but i dont knew how to count the distinct publication id:s for each organisation id. Does anyone have an idea how to do it? I use Solr 7.2.1 . Regards Carl-Johan Syrén
Re: SOLR with Sitecore SXA
You should be subscribed to the list [1], then just mail in - if you're not subscribed you don't get any follow up mails from the list (besides all other mails that happen) [1] http://lucene.apache.org/solr/community.html#mailing-lists-irc -Stefan On Mon, Apr 9, 2018, 5:03 PM Saul Nachmanwrote: > Do I ask for a subscription here first and then mail the main thread? > > > Regards > > Saul > On Apr 9, 2018 5:03 PM, "Saul Nachman" wrote: Do I ask for a subscription here first and then mail the main thread? Regards Saul