Notice that the getAnchors() in Inlinks only returns a single text per
domain. (This is not in Nutch2). Tricky.

So, in order to correctly setup your test, create all test values with
different domains.

Inlinks inlinks = new Inlinks();
inlinks.add(new Inlink("http://test1.com/";, "text1"));
inlinks.add(new Inlink("http://test2.com/";, "text2"));
inlinks.add(new Inlink("http://test3.com/";, "text2"));

This should yield in the expected 2 texts with deduplication, and 3 texts
without.


On Mon, Nov 12, 2012 at 4:23 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
>
> I'm attempting to test the AnchorIndexingFilter by adding numerous
> inlinks and their anchor text then check whether the deduplication is
> working sufficiently.
>
> Can someone show me how I simulate the following using the trunk API
>
> // This is 2.x API
> WebPage page = new WebPage();
> page.putToInlinks(new Utf8("$inlink1"), new Utf8("$anchor_text1"));
> page.putToInlinks(new Utf8("$inlink2"), new Utf8("$anchor_text1"));
> page.putToInlinks(new Utf8("$inlink3"), new Utf8("$anchor_text2"));
>
> If anchor deduplication is set to boolean true value then we could
> only allow two anchor entries for the page inlinks. I wish therefore
> to simulate this in trunk API using Inlinks, Inlink or
> NutchDocument.add function however I am stuck...
>
> Thank you very much in advance for any help.
>
> Best
>
> Lewis
>
> --
> Lewis
>

Reply via email to