I would suggest you define a custom index plugin.  The index plugin could
evaluate the nutch document based on your parameters. You can add, modify,
remove fields as needed.  You should be able to to set the nutch document
to null if it does not meet your criteria.   This would prevent the
document from being sent to Solr.

Nutch has the ability to handle documents set to null by the indexer.

I am not aware of a method to call to set a nutch document to null. Is
anyone aware of this option?

If now one responds, here are a couple of options to purse testing:
1. Test what happens when you set NutchDocument doc = null;
2. In NutchIndexAction.java there is a reference to a variable called byte
Delete = 1. You can try changing the variable byte action = DELETE;
3. Test what would happen in your filter if you called NutchDocument doc =
new NutchDocument(); This should reset the doc to null as it is an empty
class.

The indexer-basic is a good plugin to copy and make a custom indexer from.

Let us know the results of what you find.

jeff

On Wed, Apr 29, 2015 at 2:50 PM, Eyeris RodrIguez Rueda <[email protected]>
wrote:

> Hi all.
> Im using nutch 1.9 and solr 4.10 in my environment.
> I want to skip of the indexing process, all document that have the field
> title empty (or another), and of course, avoid it go to solr.
>
> My first solution was clean all document with empty title in solr. this is
> not good idea for me because i need to execute the clean query after all
> indexing
>
> The second solution that I thought was put the fields as required in
> schema.xml
>
> <field name="title" type="text" stored="true" indexed="true"
> multiValued="true" required="true"/>
>
> After do that, i found that when nutch try to send a batch of 250
> documents, if there is one document with title empty, solr fails and nutch
> throw Job Failed Exception, because solr don't permit to index one document
> without title value, therefore solr index nothing.
>
> Is there any way that nutch take required option in schema.xml and clean
> it document from the collection of document before to index to solr?
>
> Please any body can give me one advice, comment about it or what is the
> best way to restrict documents with empty field before to index ?.
>
> Eyeris.
>
>

Reply via email to