I've been working on a less-complex thing along the same lines - taking all the data from our corporate database and pumping it into Kafka for long-term storage -- and the ability to "play back" all the Kafka messages any time we need to re-index.
That simpler scenario has worked like a charm. I don't need to massage the data much once it's at rest in Kafka, so that was a straightforward solution, although I could have gone with a DB and just stored the solr documents with their ID's one per row in a RDBMS... The rest sounds like good ideas for your situation as Solr isn't the best candidate for the kind of manipulation of data you're proposing and a database excels at that. It's more work, but you get a lot more flexibility and you de-couple Solr from the data crawling as you say. It all sounds pretty good to me, but I've only been on the list here a short time - so I'll leave it to others to add their comments. On Fri, May 13, 2016 at 2:46 PM, Pryor, Clayton J <cjpr...@sandia.gov> wrote: > Question: > Do any of you have your crawlers write to a database rather than directly > to Solr and then use a connector to index to Solr from the database? If > so, have you encountered any issues with this approach? If not, why not? > > I have searched forums and the Solr/Lucene email archives (including > browsing of http://www.apache.org/foundation/public-archives.html) but > have not found any discussions of this idea. I am certain that I am not > the first person to think of it. I suspect that I have just not figured > out the proper queries to find what I am looking for. Please forgive me if > this idea has been discussed before and I just couldn't find the > discussions. > > Background: > I am new to Solr and have been asked to make improvements to our Solr > configurations and crawlers. I have read that the Solr index should not be > considered a source of record data. It is in essence a highly optimized > index to be used for generating search results rather than a retainer for > record copies of data. The better approach is to rely on corporate data > sources for record data and retain the ability to completely blow away a > Solr index and repopulate it as needed for changing search requirements. > This made me think that perhaps it would be a good idea for us to create a > database of crawled data for our Solr index. The idea is that the crawlers > would write their findings to a corporate supported database of our own > design for our own purposes and then we would populate our Solr index from > this database using a connector that writes from the database to the Solr > index. > The only disadvantage that I can think of for this approach is that we > will need to write a simple interface to the database that allows our admin > personnel to "Delete" a record from the Solr index. Of course, it won't be > deleted from the database but simply flagged as not to be indexed to Solr. > It will then send a delete command to Solr for any successfully "deleted" > records from the database. I suspect this admin interface will grow over > time but we really only need to be able to delete records from the database > for now. All of the rest of our admin work is query related which can > still be done through the Solr Console. > I can think of the following advantages: > > * We have a corporate sponsored and backed up repository for our > crawled data which would buffer us from any inadvertent losses of our Solr > index. > * We would divorce the time it takes to crawl web pages from the time > it takes to populate our Solr index with data from the crawlers. I have > found that my Solr Connector takes minutes to populate the entire Solr > index from the current Solr prod to the new Solr instances. Compare that > to hours and even days to actually crawl the web pages. > * We use URLs for our unique IDs in our Solr index. We can resolve > the problem of retaining the shortest URL when duplicate content is > detected in Solr simply by sorting the query used to populate Solr from the > database by id length descending - this will ensure the last URL > encountered for any duplicate is always the shortest. > * We can easily ensure that certain classes of crawled content are > always added last (or first if you prefer) whenever the data is indexed to > Solr - rather than having to rely on the timing of crawlers. > * We could quickly and easily rebuild our Solr index from scratch at > any time. This would be very valuable when changes to our Solr > configurations require re-indexing our data. > * We can assign unique boost values to individual "documents" at index > time by assigning a boost value for that document in the database and then > applying that boost at index time. > * We can continuously run a batch program that removes broken links > against this database with no impact to Solr and then refresh Solr on a > more frequent basis than we do now because the connector will take minutes > rather than hours/days to refresh the content. > * We can store additional information for the crawler to populate to > Solr when available - such as: > * actual document last updated dates > * boost value for that document in the database > * This database could be used for other purposes such as: > * Identifying a subset of representative data to use for evaluation > of configuration changes. > * Easy access to "indexed" data for analysis work done by those not > familiar with Solr. > Thanks in advance for your feedback. > Sincerely, > Clay Pryor > R&D SE Computer Science > 9537 - Knowledge Systems > Sandia National Laboratories >