Re: Question about best way to architect a Solr application with many data sources

Walter Underwood Tue, 21 Feb 2017 20:25:00 -0800

Reindexing is exactly why you want the Single Source of Truth to be in a 
repository outside of Solr.


For our slowly-changing data sets, we have an intermediate JSONL batch. That is 
created from the source repositories and saved in Amazon S3. Then we load it 
into Solr nightly. That allows us to reload whenever we need to, like loading 
prod data in test or moving search to a different Amazon region.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 21, 2017, at 7:34 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Dave:
> 
> Oh, I agree that a DB is a perfectly valid place to store the data and
> you're absolutely right that it allows better interaction than flat
> files; you can ask questions of an RDBMS that you can't easily ask the
> disk ;). Storing to disk is an alternative if you're unwilling to deal
> with a DB is all.
> 
> But the main point is you'll change your schema sometime and have to
> re-index. Having the data you're indexing stored locally in whatever
> form will allow much faster turn-around rather than re-crawling. Of
> course it'll result in out of date data so you'll have to refresh
> somehow sometime.
> 
> Erick
> 
> On Tue, Feb 21, 2017 at 6:07 PM, Dave <hastings.recurs...@gmail.com> wrote:
>> Ha I think I went to one of your training seminars in NYC maybe 4 years ago 
>> Eric. I'm going to have to respectfully disagree about the rdbms.  It's such 
>> a well know data format that you could hire a high school programmer to help 
>> with the db end if you knew how to flatten it to solr. Besides it's easy to 
>> visualize and interact with the data before it goes to solr. A Json/Nosql 
>> format would work just as well, but I really think a database has its place 
>> in a scenario like this
>> 
>>> On Feb 21, 2017, at 8:20 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>>> 
>>> I'll add that I _guarantee_ you'll want to re-index the data as you
>>> change your schema
>>> and the like. You'll be able to do that much more quickly if the data
>>> is stored locally somehow.
>>> 
>>> A RDBMS is not necessary however. You could simply store the data on
>>> disk in some format
>>> you could re-read and send to Solr.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Tue, Feb 21, 2017 at 5:17 PM, Dave <hastings.recurs...@gmail.com> wrote:
>>>> B is a better option long term. Solr is meant for retrieving flat data, 
>>>> fast, not hierarchical. That's what a database is for and trust me you 
>>>> would rather have a real database on the end point.  Each tool has a 
>>>> purpose, solr can never replace a relational database, and a relational 
>>>> database could not replace solr. Start with the slow model (database) for 
>>>> control/display and enhance with the fast model (solr) for retrieval/search
>>>> 
>>>> 
>>>> 
>>>>> On Feb 21, 2017, at 7:57 PM, Robert Hume <rhum...@gmail.com> wrote:
>>>>> 
>>>>> To learn how to properly use Solr, I'm building a little experimental
>>>>> project with it to search for used car listings.
>>>>> 
>>>>> Car listings appear on a variety of different places ... central places
>>>>> Craigslist and also many many individual Used Car dealership websites.
>>>>> 
>>>>> I am wondering, should I:
>>>>> 
>>>>> (a) deploy a Solr search engine and build individual indexers for every
>>>>> type of web site I want to find listings on?
>>>>> 
>>>>> or
>>>>> 
>>>>> (b) build my own database to store car listings, and then build services
>>>>> that scrape data from different sites and feed entries into the database;
>>>>> then point my Solr search to my database, one simple source of listings?
>>>>> 
>>>>> My concerns are:
>>>>> 
>>>>> With (a) ... I have to be smart enough to understand all those different
>>>>> data sources and remove/update listings when they change; while this be
>>>>> harder to do with custom Solr indexers than writing something from 
>>>>> scratch?
>>>>> 
>>>>> With (b) ... I'm maintaining a huge database of all my listings which 
>>>>> seems
>>>>> redundant; google doesn't make a *copy* of everything on the internet, it
>>>>> just knows it's there.  Is maintaining my own database a bad design?
>>>>> 
>>>>> Thanks for reading!

Re: Question about best way to architect a Solr application with many data sources

Reply via email to