Hi folks,
Sorry again for asking a question on the same grounds. I am kind of getting
lost with various fragments of the source code.
I am extending the IndexingFilter, try to check the contents for some words
and if they are present I am returning a NULL. When I try to test this
IndexingFilter(the test class replicated as in TestMoreIndexingFilter) I see
I have to get the contents from the Parse object like "parse.getText()". The
content I get here is "foo bar" which is passed from the test class.
Now if I have to get the contents of the URL passed in as Text(new Text("
http://nutch.apache.org/index.html");) which of the argument should I be
using.
The filter method in the implementation is as follows,
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
String content = parse.getText();
System.out.println("Content : "+content);
System.out.println("Contains : "+content.contains("nutch"));
if(content.contains("nutch")){
System.out.println("Nutch keyword found! Hence not indexing the
doc :)");
return null;
}
return doc;
}
The test method is as follows,
public void testRemoveIndex(){
Configuration conf = NutchConfiguration.create();
RemoveIndexingPlugin filter = new RemoveIndexingPlugin ();
filter.setConf(conf);
assertNotNull(filter);
NutchDocument doc = new NutchDocument();
ParseImpl parse = new ParseImpl("foo bar", new ParseData());
try{
filter.filter(doc, parse, new Text("
http://nutch.apache.org/index.html"), new CrawlDatum(), new Inlinks());
}catch(Exception e){
e.printStackTrace();
fail(e.getMessage());
}
}
Am I doing something wrong or do I have to write my utility method to get
the contents form the Text(the url)? I see this quite different from the
HTMLParseFilter implementation. Should I be chaining the Index filtering or
something similar?
I think I am lacking depth of information on Nutch for doing this. Your
guidance would be much appreciated. Thanks!
./Abi
On Tue, Feb 8, 2011 at 9:23 AM, .: Abhishek :. <[email protected]> wrote:
> Thanks Arkadi. Thanks all for your patience and guidance.
>
>
> On Tue, Feb 8, 2011 at 8:48 AM, <[email protected]> wrote:
>
>> You can exclude documents by returning NULL from an index filter.
>>
>> Regards,
>>
>> Arkadi
>>
>> >-----Original Message-----
>> >From: .: Abhishek :. [mailto:[email protected]]
>> >Sent: Tuesday, February 08, 2011 11:44 AM
>> >To: [email protected]; [email protected]
>> >Subject: Re: Indexing question - Setting low boost
>> >
>> >Hi folks,
>> >
>> > Some help would be appreciated. Thanks a bunch..
>> >
>> >Cheers,
>> >Abi
>> >
>> >
>> >On Mon, Feb 7, 2011 at 10:46 AM, .: Abhishek :. <[email protected]>
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >> Thanks again for your time and patience.
>> >>
>> >> The boost makes sense now. I am kind of not sure how to exclude the
>> >entire
>> >> document because there are only two methods,
>> >>
>> >> - public NutchDocument filter(NutchDocument doc, Parse parse, Text
>> >url,
>> >> CrawlDatum datum, Inlinks inlinks)
>> >> throws IndexingException
>> >> - public void addIndexBackendOptions(Configuration conf)
>> >>
>> >>
>> >> May be should I add nothing in the document and/or return a null??
>> >>
>> >> ./Abi
>> >>
>> >>
>> >> On Mon, Feb 7, 2011 at 10:07 AM, Markus Jelsma
>> ><[email protected]
>> >> > wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> A high boost depends on the index and query time boosts on other
>> >fields.
>> >>> If the
>> >>> highest boost on a field is N, then N*100 will certainly do the
>> >trick.
>> >>>
>> >>> I haven't studied the LuceneWriter but storing and indexing
>> >parameters are
>> >>> very familiar. Storing a field means it can be retrieved along with
>> >the
>> >>> document if it's queried. Having it indexed just means it can be
>> >queried.
>> >>> But
>> >>> this is about fields, not on the entire document itself.
>> >>>
>> >>> In an indexing filter you want to exclude the entire document.
>> >>>
>> >>> Cheers,
>> >>>
>> >>> > Hi Markus,
>> >>> >
>> >>> > Thanks for the quick reply.
>> >>> >
>> >>> > Could you tell me a possible a value for the high boost such that
>> >its
>> >>> to
>> >>> > be negated? or Is there a way I can calculate or find that out.
>> >>> >
>> >>> > Also, for the other approach on using indexing filter does the
>> >("...",
>> >>> > LuceneWriter.STORE.YES, LuceneWriter.INDEX.NO, conf); does the
>> >work?
>> >>> >
>> >>> > Thanks,
>> >>> > Abi
>> >>> >
>> >>> > On Mon, Feb 7, 2011 at 9:34 AM, Markus Jelsma
>> >>> <[email protected]>wrote:
>> >>> > > Hi,
>> >>> > >
>> >>> > > A negative boost does not exist and a very low boost is still a
>> >boost.
>> >>> In
>> >>> > > queries, you can work around the problem by giving a very high
>> >boost
>> >>> do
>> >>> > > documents that do not match; the negation parameter with a high
>> >boost
>> >>> > > will do
>> >>> > > the trick.
>> >>> > >
>> >>> > > If you don't want to index certain documents then you'll need an
>> >>> indexing
>> >>> > > filter. That's a different approach.
>> >>> > >
>> >>> > > Cheers,
>> >>> > >
>> >>> > > > Hi all,
>> >>> > > >
>> >>> > > > I was looking at the following example,
>> >>> > > >
>> >>> > > > http://wiki.apache.org/nutch/WritingPluginExample
>> >>> > > >
>> >>> > > > In the example, the author sets a boost of 5.0f for the
>> >recommended
>> >>> > > > tag.
>> >>> > > >
>> >>> > > > In this same way, can I also set a boost value such that a tag
>> >or
>> >>> > >
>> >>> > > content
>> >>> > >
>> >>> > > > is never indexed at all? If so, what would be the boost value?
>> >On a
>> >>> > >
>> >>> > > related
>> >>> > >
>> >>> > > > note, what are the default content that are usually(by default)
>> >>> indexed
>> >>> > >
>> >>> > > by
>> >>> > >
>> >>> > > > Lucene?
>> >>> > > >
>> >>> > > > Thanks a bunch for all your time and patience. Have a good
>> >day.
>> >>> > > >
>> >>> > > > Cheers,
>> >>> > > > Abi
>> >>>
>> >>
>> >>
>>
>
>