Chip:

Another thing that might work for you are the streaming/export
capabilities. It has the capacity to efficiently return some data
(docValues only) for very large result sets. You'd have to have some
automated way to feed it what to look for.

But that's a fallback, I'd first look at Nutch as I bet someone's had
a similar problem before in Nutch-land ;)
On Wed, Sep 19, 2018 at 11:18 AM Chip Calhoun <ccalh...@aip.org> wrote:
>
> I do use Nutch as my crawler, but just as my crawler, so I hadn't thought to 
> look for an answer there. I will do so. thank you.
>
>
> Chip
>
> ________________________________
> From: Alexandre Rafalovitch <arafa...@gmail.com>
> Sent: Wednesday, September 19, 2018 2:05:41 PM
> To: solr-user
> Subject: Re: Seeking a simple way to test my index.
>
> Have you looked at Apache Nutch? Seems like the direct match for your
> - growing - requirements and it does integrate with Solr. Or one of
> the other solutions, like http://stormcrawler.net/
> http://www.norconex.com/collectors/
>
> Otherwise, this does not really feel like a Solr question.
>
> Regards,
>    Alex.
>
> On 19 September 2018 at 14:01, Chip Calhoun <ccalh...@aip.org> wrote:
> > I've got a Solr instance which crawls roughly 3,500 seed pages, depth of 1, 
> > at 240 institutions, all but 1 of which I don't control. I recrawl once a 
> > month or so. Naturally if one of the sites I crawl changes, then I need to 
> > know to update my seed URLs. I've been checking this by hand, which was 
> > tenable when my site was smaller, but is now completely unreasonable.
> >
> >
> > Is there a way to test my index without actually having to run a lot of 
> > manual searches? Perhaps an output I could skim? Any suggestions would be 
> > helpful.
> >
> >
> > Thanks,
> >
> > Chip

Reply via email to