Hi all,
I'm using a custom parse filter (and indexer) plugin in order to index and
store all iframe sources in url page. I'm running nutch 1.6 with Solr 3.6.2.
Apart from deploying the plugin, I add:
<field name="iframe" type="string" stored="true" indexed="true"
multiValued="true"/> in schema.xml
and
<field dest="iframe" source="iframe"/> in solrindex-mapping.xml.
Since, HTML may contain more then one iframe tag, the field in schema.xml
is multiValued.
Strange thing is that when I saw the results I noticed that all iframe
sources were indexed but one of them is always duplicated. i.e, if HTML had
3 (different) iframe tags, solr has 4 stored where 2 of them are identical.
To clarify about my plugin: the parse filter adds the parsed iframe sources
with
Metadata.add("iframe", "parsed_source").
And the indexer adds to NutchDocument all the values corresponding to
"iframe" key.
Any ideas ? maybe something in my configurations ?
Thanks,
Amit.