RE: Questions regarding re-index when using Solr as a data source

Hui Liu Fri, 10 Jun 2016 13:15:32 -0700

Thank you Walter.

-----Original Message-----
From: Walter Underwood [mailto:[email protected]] 
Sent: Friday, June 10, 2016 3:53 PM
To: [email protected]
Subject: Re: Questions regarding re-index when using Solr as a data source


Those are brand new features that I have not used, so I can’t comment on them.

But I know they do not make Solr into a database.

If you need a transactional database that can support search, you probably want 
MarkLogic. I worked at MarkLogic for a couple of years. In some ways, MarkLogic 
is like Solr, but the support for transactions goes very deep. It is not 
something you can put on top of a search engine.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)


> On Jun 10, 2016, at 12:39 PM, Hui Liu <[email protected]> wrote:
> 
> What if we plan to use Solr version 6.x? this url says it support 2 different 
> update modes: atomic update and optimistic concurrency:
> 
> https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
> 
> I tested 'optimistic concurrency' and it appears to be working, i.e if a 
> document I am updating got changed by another person I will get error if I 
> supply a _version_ value, So maybe you are referring to an older version of 
> Solr?
> 
> Regards,
> Hui
> 
> -----Original Message-----
> From: Walter Underwood [mailto:[email protected]] 
> Sent: Friday, June 10, 2016 11:18 AM
> To: [email protected]
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> Solr does not have transactions at all. The “commit” is really “submit batch”.
> 
> Solr does not have update. You can add, delete, or replace an entire document.
> 
> There is no optimistic concurrency control because there is no concurrency 
> control. Clients can concurrently add documents to a batch, then any client 
> can submit the entire batch.
> 
> Replication is not transactional. Replication is a file copy of the 
> underlying indexes (classic) or copying the documents in a batch (Solr Cloud).
> 
> wunder
> Walter Underwood
> [email protected]
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 10, 2016, at 7:41 AM, Hui Liu <[email protected]> wrote:
>> 
>> Walter,
>> 
>>      Thank you for your advice. We are new to Solr and have been using 
>> Oracle for past 10+ years, so we are used to the idea of having a tool that 
>> can be used as both data store and also searchable by having indexes on top 
>> of it. I guess the reason we are considering Solr as data store is due to it 
>> has some features of a database that our application requires, such as 1) be 
>> able to detect duplicate record by having a unique field; 2) allow us to do 
>> concurrent update by using Optimistic concurrency control feature; 3) its 
>> 'replication' feature allowing us to store multiple copies of data; so if we 
>> were to use a file system, we will not have the above features (at least not 
>> 1 and 2) and have to implement those ourselves. The other option is to pick 
>> another database tool such as Mysql or Cassandra, then we will need to learn 
>> and support an additional tool besides Solr; but you brought up several very 
>> good points about operational factors we should consider if we pick Solr as 
>> a data store. Also our application is more of a OLTP than OLAP. I will 
>> update our colleagues and stakeholders about these concerns. Thanks again!
>> 
>> Regards,
>> Hui
>> -----Original Message-----
>> From: Walter Underwood [mailto:[email protected]] 
>> Sent: Thursday, June 09, 2016 1:24 PM
>> To: [email protected]
>> Subject: Re: Questions regarding re-index when using Solr as a data source
>> 
>> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: 
>> "Don't do this unless you have no other option. Solr is not really designed 
>> for this role.” So don’t start by planning to do this.
>> 
>> Using a second copy of Solr is still using Solr as a repository. That 
>> doesn’t satisfy any sort of requirements for disaster recovery. How do you 
>> know that data is good? How do you make a third copy? How do you roll back 
>> to a previous version? How do you deal with a security breach that affects 
>> all your systems? Are the systems in the same data center? How do you deal 
>> with ransomware (U. of Calgary paid $20K yesterday)?
>> 
>> If a consultant suggested this to me, I’d probably just give up and get a 
>> different consultant.
>> 
>> Here is what we do for batch loading.
>> 
>> 1. For each Solr collection, we define a JSONL feed format, with a JSON 
>> Schema.
>> 2. The owners of the data write an extractor to pull the data out of 
>> wherever it is, then generate the JSON feed.
>> 3. We validate the JSON feed against the JSON schema.
>> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which 
>> lists the version of the JSON Schema.
>> 5. Then a multi-threaded loader reads the feed and sends it to Solr.
>> 
>> Reloading is safe and easy, because all the feeds in S3 are valid.
>> 
>> Storing backups in S3 instead of running a second Solr is massively cheaper, 
>> easier, and safer.
>> 
>> We also have a clear contract between the content owners and the search 
>> team. That contract is enforced by the JSON Schema on every single batch.
>> 
>> wunder
>> Walter Underwood
>> [email protected]
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jun 9, 2016, at 9:51 AM, Hui Liu <[email protected]> wrote:
>>> 
>>> Hi Walter,
>>> 
>>> Thank you for the reply, sorry I need to clarify what I mean by 'migrate 
>>> tables' from Oracle to Solr, we are not literally move existing records 
>>> from Oracle to Solr, instead, we are building a new application directly 
>>> feed data into Solr as document and fields, in parallel of another existing 
>>> application which feeds the same data into Oracle tables/columns, of 
>>> course, the Solr schema will be somewhat different than Oracle; also we 
>>> only keep those data for 90 days for user to search on, we hope once we run 
>>> both system in parallel for some time (> 90 days), we will build up enough 
>>> new data in Solr and we no longer need any old data in Oracle, by then we 
>>> will be able to use Solr as our only data store.
>>> 
>>> It sounds to me that we may need to consider save the data into either file 
>>> system, or another database, in case we need to rebuild the indexes; and 
>>> the reason I mentioned to save data into another Solr system is by reading 
>>> this info from https://wiki.apache.org/solr/HowToReindex : so just trying 
>>> to get a feedback on if there is any update on this approach? And any 
>>> better way to do this to minimize the downtime caused by the schema change 
>>> and re-index? For example, in Oracle, we are able to add a new column or 
>>> new index online without any impact of existing queries as existing indexes 
>>> are intact.
>>> 
>>> Alternatives when a traditional reindex isn't possible
>>> 
>>> Sometimes the option of "do your indexing again" is difficult. Perhaps the 
>>> original data is very slow to access, or it may be difficult to get in the 
>>> first place.
>>> 
>>> Here's where we go against our own advice that we just gave you. Above we 
>>> said "don't use Solr itself as a datasource" ... but one way to deal with 
>>> data availability problems is to set up a completely separate Solr instance 
>>> (not distributed, which for SolrCloud means numShards=1) whose only job is 
>>> to store the data, then use the SolrEntityProcessor in the 
>>> DataImportHandler to index from that instance to your real Solr install. If 
>>> you need to reindex, just run the import again on your real installation. 
>>> Your schema for the intermediate Solr install would have stored="true" and 
>>> indexed="false" for all fields, and would only use basic types like int, 
>>> long, and string. It would not have any copyFields.
>>> 
>>> This is the approach used by the Smithsonian for their Solr installation, 
>>> because getting access to the source databases for the individual entities 
>>> within the organization is very difficult. This way they can reindex the 
>>> online Solr at any time without having to get special permission from all 
>>> those entities. When they index new content, it goes into a copy of Solr 
>>> configured for storage only, not in-depth searching. Their main Solr 
>>> instance uses SolrEntityProcessor to import from the intermediate Solr 
>>> servers, so they can always reindex.
>>> 
>>> Regards,
>>> Hui
>>> 
>>> -----Original Message-----
>>> From: Walter Underwood [mailto:[email protected]]
>>> Sent: Thursday, June 09, 2016 12:19 PM
>>> To: [email protected]
>>> Subject: Re: Questions regarding re-index when using Solr as a data source
>>> 
>>> First, using Solr as a repository is pretty risky. I would keep the 
>>> official copy of the data in a database, not in Solr.
>>> 
>>> Second, you can’t “migrate tables” because Solr doesn’t have tables. You 
>>> need to turn the tables into documents, then index the documents. It can 
>>> take a lot of joins to flatten a relational schema into Solr documents.
>>> 
>>> Solr does not support schema migration, so yes, you will need to save off 
>>> all the documents, then reload them. I would save them to files. It makes 
>>> no sense to put them in another copy of Solr.
>>> 
>>> Changing the schema will be difficult and time-consuming, but you’ll 
>>> probably run into much worse problems trying to use Solr as a repository.
>>> 
>>> wunder
>>> Walter Underwood
>>> [email protected]<mailto:[email protected]>
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jun 9, 2016, at 8:50 AM, Hui Liu 
>>>> <[email protected]<mailto:[email protected]>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>>           We are porting an application currently hosted in Oracle 11g to 
>>>> Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections 
>>>> in Solr, index them, and build search tools on top of this; the goal is we 
>>>> won't be using Oracle at all after this has been implemented; every fields 
>>>> in Solr will have 'stored=true' and selectively a subset of searchable 
>>>> fields will have 'indexed=true'; the question is what steps we should 
>>>> follow if we need to re-index a collection after making some schema 
>>>> changes - mostly we only add new fields to store, or make a non-indexed 
>>>> field as indexed, we normally do not delete or rename any existing fields; 
>>>> according to this url: https://wiki.apache.org/solr/HowToReindex it seems 
>>>> we need to setup a 'intermediate' Solr1 to only store the data themselves 
>>>> without any indexing, then have another Solr2 setup to store the indexed 
>>>> data, and in case of re-index, just delete all the documents in Solr2 for 
>>>> the collection and re-import data from Solr1 into Solr2 using 
>>>> SolrEntityProcessor (from dataimport handler)? Is this still the 
>>>> recommended approach? I can see the downside of this approach is if we 
>>>> have tremendous amount of data for a collection (some of our collection 
>>>> could have several billions of documents), re-import it from Solr1 to 
>>>> Solr2 may take a few hours or even days, and during this time, users 
>>>> cannot query the data, is there any better way to do this and avoid this 
>>>> type of down time? Any feedback is appreciated!
>>>> 
>>>> Regards,
>>>> Hui Liu
>>>> Opentext, Inc.
>>> 
>>> 
>> 
>

RE: Questions regarding re-index when using Solr as a data source

Reply via email to