Hi all,

I'm using a custom parse filter (and indexer) plugin in order to index and
store all iframe sources in url page. I'm running nutch 1.6 with Solr 3.6.2.

Apart from deploying the plugin, I add:

<field name="iframe" type="string" stored="true" indexed="true"
multiValued="true"/> in schema.xml

and

<field dest="iframe" source="iframe"/> in solrindex-mapping.xml.

Since, HTML may contain more then one iframe tag, the field in schema.xml
is multiValued.

Strange thing is that when I saw the results I noticed that all iframe
sources were indexed but one of them is always duplicated. i.e, if HTML had
3 (different) iframe tags, solr has 4 stored where 2 of them are identical.

To clarify about my plugin: the parse filter adds the parsed iframe sources
with
Metadata.add("iframe", "parsed_source").
And the indexer adds to NutchDocument all the values corresponding to
"iframe" key.

Any ideas ? maybe something in my configurations ?

Thanks,

Amit.

Reply via email to