Hi.

 

I just tested 1.1 and the problem now seems to be solved. The sub collection 
values are now added without a space prefixed, and i still use the same Solr 
configuration and subcollections.xml configuration. Very nice unexpected change 
that's not in the changelog ;)

 

Cheers,
 
-----Original message-----
From: Markus Jelsma <[email protected]>
Sent: Sun 20-06-2010 00:36
To: [email protected]; 
Subject: RE: Re: prefixed space in subcollection field

Hello Chris!

 

I enable the plugin in my nutch-site.xml configuration:

 

    
<value>subcollection|protocol-http|urlfilter-regex|parse-html|index-(basic|more|anchor)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

 

As you can see, i have no query plugins defined because i don't use them. My 
Solr's schema is based on the one shipped with Nutch, i just added a type and 
field for spell checking. Anyway, here's my subcollection field definition, 
type is just the primitive string so no transformation or whatsoever.:

 

    <field name="subcollection" type="string" stored="true" indexed="true"/>

 

I hope to try Nutch 1.1 tomorrow. It may be a long shot but it's worth a try. 
Thanks so far :)

 

Cheers, 
-----Original message-----
From: Chris Mattmann <[email protected]>
Sent: Sat 19-06-2010 23:58
To: [email protected]; 
Subject: Re: prefixed space in subcollection field

Hi Markus,

Thanks much. How are you activating the subcollections plugin in
nutch-default.xml? Looking at its plugin.xml here:

http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/subcollection/plugin.
xml

It seems that it declares 2 plugins which are activated, an indexing plugin
as well as a query filter plugin.

Can I see the following 2 things?

* your solr schema.xml (I 1;m wondering if you declared a corresponding
subcollection field there and if so, if the text is being transformed
somehow) 
* your nutch-default.xml so I can see how you turned on the subcollection
plugin

Thanks!

Cheers,
Chris






On 6/19/10 10:28 AM, "Markus Jelsma" <[email protected]> wrote:

>  
> 
> Chris, thanks for your reply!
> 
>  
> 
> The only additional information i can give is the Nutch subcollection
> configuration, result i get from Solr's index and that i'm using a nightly
> build that's not more than two weeks old. I'm testing Nutch/Solr by creating
> an index of some newspaper so i define categories such as economy, sport, film
> etc. Here's one of my subcollection definitions:
> 
>  
> 
>  <subcollection>
>   <name>buitenland</name>
>   <id>buitenland</id>
>    <whitelist>
>     http://www.DOMAIN.nl/buitenland/
>    </whitelist>
>   <blacklist />
>  </subcollection>
> 
>  
> 
> There are about 10 definitions like this one for now. All specifiy some URL
> and the name and id field without the prefixed space, as you can see. Here is
> the subcollection field in some document in an resultset:
> 
>  
> 
> <str name="subcollection"> binnenland</str>
> 
>  
> 
> This problem is consistent throughout all resultsets and with all values for
> the subcollection field. All other fields in my Solr index are fine, it's just
> this field that's troublesome. There is no useful information in hadoop.log,
> nor in Solr's log as far as i can see. The plugin.includes property in my
> Nutch config just includes the subcollection plugin in the regex.
> 
>  
> 
> Cheers,
> 
>  
> -----Original message-----
> From: Chris Mattmann <[email protected]>
> Sent: Sat 19-06-2010 19:08
> To: [email protected];
> Subject: Re: prefixed space in subcollection field
> 
> Hi Markus,
> 
> I read the documentation for the subcollection plugin here:
> 
> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/subcollection/README.
> txt
> 
> It didn 1;t mention anything about prefixing your field names with a space.
> So, I went and checked:
> 
> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/subcollection/src/jav
> a/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
> 
> It seems like the only thing it does beyond your normal NutchDocument that 1;s
> indexed is add the sub collection name to the indexed set of fields, so I 1;m
> wondering what you 1;re seeing here. Do you have any further information?
> 
> Cheers,
> Chris
> 
> 
> On 6/19/10 9:55 AM, "Markus Jelsma" <[email protected]> wrote:
> 
>> > I'm sorry, but i need to bump this one. Any suggestions?
>> >  
>> > -----Original message-----
>> > From: Markus Jelsma <[email protected]>
>> > Sent: Tue 15-06-2010 10:51
>> > To: [email protected];
>> > Subject: prefixed space in subcollection field
>> >
>> > Hi list,
>> >
>> >  
>> >
>> > Fields created by the subcollection plugin end up with a prefixed space in
>> my
>> > Solr index but the name and id fields in my subcollection.xml don't have
>> that
>> > same space prefixed, i checked it three times just to be certain i didn't
>> mess
>> > up the configuration. I am unsure where the space comes from and where to
>> fix
>> > it. Any ideas on this one?
>> >
>> >  
>> >
>> > Cheers,
>> >
> 
> 
> 



 

Reply via email to