Hi again,
solved the problem. A Subcollection mußt be part of the Start-url.
The Crawler just go deeper in the Url-tree and don't to a url on the same
Level.

Starturl http: xyz.org/hans/
Subcollection  xzy.org/sepp/wiki
won't work even hans links to sepp.

Starturl http: xyz.org/
Subcollection  xzy.org/sepp/wiki
works

> Hi,
> my Nutch crawl job and the Indexing with solr works fine.Except for the
> Subcollcetion. I configured the subcollcetion.xml
> *<subcollections>
>     <subcollection>
>         <name>wiki</name>
>         <id>wiki</id>
>         <whitelist>/plugins/mediawiki/wiki/</whitelist>
>         <blacklist />
>     </subcollection>
> </subcollections>*
>
> and add the Plugin in teh nutch-site.xml
> <configuration>
>     <property>
>         <name>http.agent.name</name>
>         <value>mediawiki</value>
>     </property>
>
>
>     <property>
>         <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|subcollection|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>     </property>
>
> when I take a look with Luke to the Index there is no subcollcetion-field.
>
> Have anybody exprience with this problem or an idea which may help?
> Thanks and greetings
>
> psimone
>


Reply via email to