Hi.
I just tested 1.1 and the problem now seems to be solved. The sub collection values are now added without a space prefixed, and i still use the same Solr configuration and subcollections.xml configuration. Very nice unexpected change that's not in the changelog ;) Cheers, -----Original message----- From: Markus Jelsma <[email protected]> Sent: Sun 20-06-2010 00:36 To: [email protected]; Subject: RE: Re: prefixed space in subcollection field Hello Chris! I enable the plugin in my nutch-site.xml configuration: <value>subcollection|protocol-http|urlfilter-regex|parse-html|index-(basic|more|anchor)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> As you can see, i have no query plugins defined because i don't use them. My Solr's schema is based on the one shipped with Nutch, i just added a type and field for spell checking. Anyway, here's my subcollection field definition, type is just the primitive string so no transformation or whatsoever.: <field name="subcollection" type="string" stored="true" indexed="true"/> I hope to try Nutch 1.1 tomorrow. It may be a long shot but it's worth a try. Thanks so far :) Cheers, -----Original message----- From: Chris Mattmann <[email protected]> Sent: Sat 19-06-2010 23:58 To: [email protected]; Subject: Re: prefixed space in subcollection field Hi Markus, Thanks much. How are you activating the subcollections plugin in nutch-default.xml? Looking at its plugin.xml here: http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/subcollection/plugin. xml It seems that it declares 2 plugins which are activated, an indexing plugin as well as a query filter plugin. Can I see the following 2 things? * your solr schema.xml (I 1;m wondering if you declared a corresponding subcollection field there and if so, if the text is being transformed somehow) * your nutch-default.xml so I can see how you turned on the subcollection plugin Thanks! Cheers, Chris On 6/19/10 10:28 AM, "Markus Jelsma" <[email protected]> wrote: > > > Chris, thanks for your reply! > > > > The only additional information i can give is the Nutch subcollection > configuration, result i get from Solr's index and that i'm using a nightly > build that's not more than two weeks old. I'm testing Nutch/Solr by creating > an index of some newspaper so i define categories such as economy, sport, film > etc. Here's one of my subcollection definitions: > > > > <subcollection> > <name>buitenland</name> > <id>buitenland</id> > <whitelist> > http://www.DOMAIN.nl/buitenland/ > </whitelist> > <blacklist /> > </subcollection> > > > > There are about 10 definitions like this one for now. All specifiy some URL > and the name and id field without the prefixed space, as you can see. Here is > the subcollection field in some document in an resultset: > > > > <str name="subcollection"> binnenland</str> > > > > This problem is consistent throughout all resultsets and with all values for > the subcollection field. All other fields in my Solr index are fine, it's just > this field that's troublesome. There is no useful information in hadoop.log, > nor in Solr's log as far as i can see. The plugin.includes property in my > Nutch config just includes the subcollection plugin in the regex. > > > > Cheers, > > > -----Original message----- > From: Chris Mattmann <[email protected]> > Sent: Sat 19-06-2010 19:08 > To: [email protected]; > Subject: Re: prefixed space in subcollection field > > Hi Markus, > > I read the documentation for the subcollection plugin here: > > http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/subcollection/README. > txt > > It didn 1;t mention anything about prefixing your field names with a space. > So, I went and checked: > > http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/subcollection/src/jav > a/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java > > It seems like the only thing it does beyond your normal NutchDocument that 1;s > indexed is add the sub collection name to the indexed set of fields, so I 1;m > wondering what you 1;re seeing here. Do you have any further information? > > Cheers, > Chris > > > On 6/19/10 9:55 AM, "Markus Jelsma" <[email protected]> wrote: > >> > I'm sorry, but i need to bump this one. Any suggestions? >> > >> > -----Original message----- >> > From: Markus Jelsma <[email protected]> >> > Sent: Tue 15-06-2010 10:51 >> > To: [email protected]; >> > Subject: prefixed space in subcollection field >> > >> > Hi list, >> > >> > >> > >> > Fields created by the subcollection plugin end up with a prefixed space in >> my >> > Solr index but the name and id fields in my subcollection.xml don't have >> that >> > same space prefixed, i checked it three times just to be certain i didn't >> mess >> > up the configuration. I am unsure where the space comes from and where to >> fix >> > it. Any ideas on this one? >> > >> > >> > >> > Cheers, >> > > > >

