Re: Does anybody crawl to a database and then index from the database to Solr?

John Bickerstaff Fri, 13 May 2016 14:05:20 -0700

I've been working on a less-complex thing along the same lines - taking all
the data from our corporate database and pumping it into Kafka for
long-term storage -- and the ability to "play back" all the Kafka messages
any time we need to re-index.


That simpler scenario has worked like a charm.  I don't need to massage the
data much once it's at rest in Kafka, so that was a straightforward
solution, although I could have gone with a DB and just stored the solr
documents with their ID's one per row in a RDBMS...

The rest sounds like good ideas for your situation as Solr isn't the best
candidate for the kind of manipulation of data you're proposing and a
database excels at that.  It's more work, but you get a lot more
flexibility and you de-couple Solr from the data crawling as you say.

It all sounds pretty good to me, but I've only been on the list here a
short time - so I'll leave it to others to add their comments.

On Fri, May 13, 2016 at 2:46 PM, Pryor, Clayton J <cjpr...@sandia.gov>
wrote:

> Question:
> Do any of you have your crawlers write to a database rather than directly
> to Solr and then use a connector to index to Solr from the database?  If
> so, have you encountered any issues with this approach?  If not, why not?
>
> I have searched forums and the Solr/Lucene email archives (including
> browsing of http://www.apache.org/foundation/public-archives.html) but
> have not found any discussions of this idea.  I am certain that I am not
> the first person to think of it.  I suspect that I have just not figured
> out the proper queries to find what I am looking for.  Please forgive me if
> this idea has been discussed before and I just couldn't find the
> discussions.
>
> Background:
> I am new to Solr and have been asked to make improvements to our Solr
> configurations and crawlers.  I have read that the Solr index should not be
> considered a source of record data.  It is in essence a highly optimized
> index to be used for generating search results rather than a retainer for
> record copies of data.  The better approach is to rely on corporate data
> sources for record data and retain the ability to completely blow away a
> Solr index and repopulate it as needed for changing search requirements.
> This made me think that perhaps it would be a good idea for us to create a
> database of crawled data for our Solr index.  The idea is that the crawlers
> would write their findings to a corporate supported database of our own
> design for our own purposes and then we would populate our Solr index from
> this database using a connector that writes from the database to the Solr
> index.
> The only disadvantage that I can think of for this approach is that we
> will need to write a simple interface to the database that allows our admin
> personnel to "Delete" a record from the Solr index.  Of course, it won't be
> deleted from the database but simply flagged as not to be indexed to Solr.
> It will then send a delete command to Solr for any successfully "deleted"
> records from the database.  I suspect this admin interface will grow over
> time but we really only need to be able to delete records from the database
> for now.  All of the rest of our admin work is query related which can
> still be done through the Solr Console.
> I can think of the following advantages:
>
>   *   We have a corporate sponsored and backed up repository for our
> crawled data which would buffer us from any inadvertent losses of our Solr
> index.
>   *   We would divorce the time it takes to crawl web pages from the time
> it takes to populate our Solr index with data from the crawlers.  I have
> found that my Solr Connector takes minutes to populate the entire Solr
> index from the current Solr prod to the new Solr instances.  Compare that
> to hours and even days to actually crawl the web pages.
>   *   We use URLs for our unique IDs in our Solr index.  We can resolve
> the problem of retaining the shortest URL when duplicate content is
> detected in Solr simply by sorting the query used to populate Solr from the
> database by id length descending - this will ensure the last URL
> encountered for any duplicate is always the shortest.
>   *   We can easily ensure that certain classes of crawled content are
> always added last (or first if you prefer) whenever the data is indexed to
> Solr - rather than having to rely on the timing of crawlers.
>   *   We could quickly and easily rebuild our Solr index from scratch at
> any time.  This would be very valuable when changes to our Solr
> configurations require re-indexing our data.
>   *   We can assign unique boost values to individual "documents" at index
> time by assigning a boost value for that document in the database and then
> applying that boost at index time.
>   *   We can continuously run a batch program that removes broken links
> against this database with no impact to Solr and then refresh Solr on a
> more frequent basis than we do now because the connector will take minutes
> rather than hours/days to refresh the content.
>   *   We can store additional information for the crawler to populate to
> Solr when available - such as:
>      *   actual document last updated dates
>      *   boost value for that document in the database
>   *   This database could be used for other purposes such as:
>      *   Identifying a subset of representative data to use for evaluation
> of configuration changes.
>      *   Easy access to "indexed" data for analysis work done by those not
> familiar with Solr.
> Thanks in advance for your feedback.
> Sincerely,
> Clay Pryor
> R&D SE Computer Science
> 9537 - Knowledge Systems
> Sandia National Laboratories
>

Re: Does anybody crawl to a database and then index from the database to Solr?

Reply via email to