Thank you Walter. -----Original Message----- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Friday, June 10, 2016 3:53 PM To: solr-user@lucene.apache.org Subject: Re: Questions regarding re-index when using Solr as a data source
Those are brand new features that I have not used, so I can’t comment on them. But I know they do not make Solr into a database. If you need a transactional database that can support search, you probably want MarkLogic. I worked at MarkLogic for a couple of years. In some ways, MarkLogic is like Solr, but the support for transactions goes very deep. It is not something you can put on top of a search engine. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 10, 2016, at 12:39 PM, Hui Liu <h...@opentext.com> wrote: > > What if we plan to use Solr version 6.x? this url says it support 2 different > update modes: atomic update and optimistic concurrency: > > https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents > > I tested 'optimistic concurrency' and it appears to be working, i.e if a > document I am updating got changed by another person I will get error if I > supply a _version_ value, So maybe you are referring to an older version of > Solr? > > Regards, > Hui > > -----Original Message----- > From: Walter Underwood [mailto:wun...@wunderwood.org] > Sent: Friday, June 10, 2016 11:18 AM > To: solr-user@lucene.apache.org > Subject: Re: Questions regarding re-index when using Solr as a data source > > Solr does not have transactions at all. The “commit” is really “submit batch”. > > Solr does not have update. You can add, delete, or replace an entire document. > > There is no optimistic concurrency control because there is no concurrency > control. Clients can concurrently add documents to a batch, then any client > can submit the entire batch. > > Replication is not transactional. Replication is a file copy of the > underlying indexes (classic) or copying the documents in a batch (Solr Cloud). > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Jun 10, 2016, at 7:41 AM, Hui Liu <h...@opentext.com> wrote: >> >> Walter, >> >> Thank you for your advice. We are new to Solr and have been using >> Oracle for past 10+ years, so we are used to the idea of having a tool that >> can be used as both data store and also searchable by having indexes on top >> of it. I guess the reason we are considering Solr as data store is due to it >> has some features of a database that our application requires, such as 1) be >> able to detect duplicate record by having a unique field; 2) allow us to do >> concurrent update by using Optimistic concurrency control feature; 3) its >> 'replication' feature allowing us to store multiple copies of data; so if we >> were to use a file system, we will not have the above features (at least not >> 1 and 2) and have to implement those ourselves. The other option is to pick >> another database tool such as Mysql or Cassandra, then we will need to learn >> and support an additional tool besides Solr; but you brought up several very >> good points about operational factors we should consider if we pick Solr as >> a data store. Also our application is more of a OLTP than OLAP. I will >> update our colleagues and stakeholders about these concerns. Thanks again! >> >> Regards, >> Hui >> -----Original Message----- >> From: Walter Underwood [mailto:wun...@wunderwood.org] >> Sent: Thursday, June 09, 2016 1:24 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Questions regarding re-index when using Solr as a data source >> >> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: >> "Don't do this unless you have no other option. Solr is not really designed >> for this role.” So don’t start by planning to do this. >> >> Using a second copy of Solr is still using Solr as a repository. That >> doesn’t satisfy any sort of requirements for disaster recovery. How do you >> know that data is good? How do you make a third copy? How do you roll back >> to a previous version? How do you deal with a security breach that affects >> all your systems? Are the systems in the same data center? How do you deal >> with ransomware (U. of Calgary paid $20K yesterday)? >> >> If a consultant suggested this to me, I’d probably just give up and get a >> different consultant. >> >> Here is what we do for batch loading. >> >> 1. For each Solr collection, we define a JSONL feed format, with a JSON >> Schema. >> 2. The owners of the data write an extractor to pull the data out of >> wherever it is, then generate the JSON feed. >> 3. We validate the JSON feed against the JSON schema. >> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which >> lists the version of the JSON Schema. >> 5. Then a multi-threaded loader reads the feed and sends it to Solr. >> >> Reloading is safe and easy, because all the feeds in S3 are valid. >> >> Storing backups in S3 instead of running a second Solr is massively cheaper, >> easier, and safer. >> >> We also have a clear contract between the content owners and the search >> team. That contract is enforced by the JSON Schema on every single batch. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Jun 9, 2016, at 9:51 AM, Hui Liu <h...@opentext.com> wrote: >>> >>> Hi Walter, >>> >>> Thank you for the reply, sorry I need to clarify what I mean by 'migrate >>> tables' from Oracle to Solr, we are not literally move existing records >>> from Oracle to Solr, instead, we are building a new application directly >>> feed data into Solr as document and fields, in parallel of another existing >>> application which feeds the same data into Oracle tables/columns, of >>> course, the Solr schema will be somewhat different than Oracle; also we >>> only keep those data for 90 days for user to search on, we hope once we run >>> both system in parallel for some time (> 90 days), we will build up enough >>> new data in Solr and we no longer need any old data in Oracle, by then we >>> will be able to use Solr as our only data store. >>> >>> It sounds to me that we may need to consider save the data into either file >>> system, or another database, in case we need to rebuild the indexes; and >>> the reason I mentioned to save data into another Solr system is by reading >>> this info from https://wiki.apache.org/solr/HowToReindex : so just trying >>> to get a feedback on if there is any update on this approach? And any >>> better way to do this to minimize the downtime caused by the schema change >>> and re-index? For example, in Oracle, we are able to add a new column or >>> new index online without any impact of existing queries as existing indexes >>> are intact. >>> >>> Alternatives when a traditional reindex isn't possible >>> >>> Sometimes the option of "do your indexing again" is difficult. Perhaps the >>> original data is very slow to access, or it may be difficult to get in the >>> first place. >>> >>> Here's where we go against our own advice that we just gave you. Above we >>> said "don't use Solr itself as a datasource" ... but one way to deal with >>> data availability problems is to set up a completely separate Solr instance >>> (not distributed, which for SolrCloud means numShards=1) whose only job is >>> to store the data, then use the SolrEntityProcessor in the >>> DataImportHandler to index from that instance to your real Solr install. If >>> you need to reindex, just run the import again on your real installation. >>> Your schema for the intermediate Solr install would have stored="true" and >>> indexed="false" for all fields, and would only use basic types like int, >>> long, and string. It would not have any copyFields. >>> >>> This is the approach used by the Smithsonian for their Solr installation, >>> because getting access to the source databases for the individual entities >>> within the organization is very difficult. This way they can reindex the >>> online Solr at any time without having to get special permission from all >>> those entities. When they index new content, it goes into a copy of Solr >>> configured for storage only, not in-depth searching. Their main Solr >>> instance uses SolrEntityProcessor to import from the intermediate Solr >>> servers, so they can always reindex. >>> >>> Regards, >>> Hui >>> >>> -----Original Message----- >>> From: Walter Underwood [mailto:wun...@wunderwood.org] >>> Sent: Thursday, June 09, 2016 12:19 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: Questions regarding re-index when using Solr as a data source >>> >>> First, using Solr as a repository is pretty risky. I would keep the >>> official copy of the data in a database, not in Solr. >>> >>> Second, you can’t “migrate tables” because Solr doesn’t have tables. You >>> need to turn the tables into documents, then index the documents. It can >>> take a lot of joins to flatten a relational schema into Solr documents. >>> >>> Solr does not support schema migration, so yes, you will need to save off >>> all the documents, then reload them. I would save them to files. It makes >>> no sense to put them in another copy of Solr. >>> >>> Changing the schema will be difficult and time-consuming, but you’ll >>> probably run into much worse problems trying to use Solr as a repository. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org<mailto:wun...@wunderwood.org> >>> http://observer.wunderwood.org/ (my blog) >>> >>> >>>> On Jun 9, 2016, at 8:50 AM, Hui Liu >>>> <h...@opentext.com<mailto:h...@opentext.com>> wrote: >>>> >>>> Hi, >>>> >>>> We are porting an application currently hosted in Oracle 11g to >>>> Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections >>>> in Solr, index them, and build search tools on top of this; the goal is we >>>> won't be using Oracle at all after this has been implemented; every fields >>>> in Solr will have 'stored=true' and selectively a subset of searchable >>>> fields will have 'indexed=true'; the question is what steps we should >>>> follow if we need to re-index a collection after making some schema >>>> changes - mostly we only add new fields to store, or make a non-indexed >>>> field as indexed, we normally do not delete or rename any existing fields; >>>> according to this url: https://wiki.apache.org/solr/HowToReindex it seems >>>> we need to setup a 'intermediate' Solr1 to only store the data themselves >>>> without any indexing, then have another Solr2 setup to store the indexed >>>> data, and in case of re-index, just delete all the documents in Solr2 for >>>> the collection and re-import data from Solr1 into Solr2 using >>>> SolrEntityProcessor (from dataimport handler)? Is this still the >>>> recommended approach? I can see the downside of this approach is if we >>>> have tremendous amount of data for a collection (some of our collection >>>> could have several billions of documents), re-import it from Solr1 to >>>> Solr2 may take a few hours or even days, and during this time, users >>>> cannot query the data, is there any better way to do this and avoid this >>>> type of down time? Any feedback is appreciated! >>>> >>>> Regards, >>>> Hui Liu >>>> Opentext, Inc. >>> >>> >> >