Woot!

Thanks, Markus!

Cheers,
Chris



On 6/22/10 3:04 PM, "Markus Jelsma" <[email protected]> wrote:

Hi.



I just tested 1.1 and the problem now seems to be solved. The sub collection 
values are now added without a space prefixed, and i still use the same Solr 
configuration and subcollections.xml configuration. Very nice unexpected change 
that's not in the changelog ;)



Cheers,

-----Original message-----
From: Markus Jelsma <[email protected]>
Sent: Sun 20-06-2010 00:36
To: [email protected];
Subject: RE: Re: prefixed space in subcollection field

Hello Chris!



I enable the plugin in my nutch-site.xml configuration:



    
<value>subcollection|protocol-http|urlfilter-regex|parse-html|index-(basic|more|anchor)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>



As you can see, i have no query plugins defined because i don't use them. My 
Solr's schema is based on the one shipped with Nutch, i just added a type and 
field for spell checking. Anyway, here's my subcollection field definition, 
type is just the primitive string so no transformation or whatsoever.:



    <field name="subcollection" type="string" stored="true" indexed="true"/>



I hope to try Nutch 1.1 tomorrow. It may be a long shot but it's worth a try. 
Thanks so far :)



Cheers,
-----Original message-----
From: Chris Mattmann <[email protected]>
Sent: Sat 19-06-2010 23:58
To: [email protected];
Subject: Re: prefixed space in subcollection field

Hi Markus,

Thanks much. How are you activating the subcollections plugin in
nutch-default.xml? Looking at its plugin.xml here:

http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/subcollection/plugin.
xml

It seems that it declares 2 plugins which are activated, an indexing plugin
as well as a query filter plugin.

Can I see the following 2 things?

* your solr schema.xml (I 1;m wondering if you declared a corresponding
subcollection field there and if so, if the text is being transformed
somehow)
* your nutch-default.xml so I can see how you turned on the subcollection
plugin

Thanks!

Cheers,
Chris






On 6/19/10 10:28 AM, "Markus Jelsma" <[email protected]> wrote:

>
>
> Chris, thanks for your reply!
>
>
>
> The only additional information i can give is the Nutch subcollection
> configuration, result i get from Solr's index and that i'm using a nightly
> build that's not more than two weeks old. I'm testing Nutch/Solr by creating
> an index of some newspaper so i define categories such as economy, sport, film
> etc. Here's one of my subcollection definitions:
>
>
>
>  <subcollection>
>   <name>buitenland</name>
>   <id>buitenland</id>
>    <whitelist>
>     http://www.DOMAIN.nl/buitenland/
>    </whitelist>
>   <blacklist />
>  </subcollection>
>
>
>
> There are about 10 definitions like this one for now. All specifiy some URL
> and the name and id field without the prefixed space, as you can see. Here is
> the subcollection field in some document in an resultset:
>
>
>
> <str name="subcollection"> binnenland</str>
>
>
>
> This problem is consistent throughout all resultsets and with all values for
> the subcollection field. All other fields in my Solr index are fine, it's just
> this field that's troublesome. There is no useful information in hadoop.log,
> nor in Solr's log as far as i can see. The plugin.includes property in my
> Nutch config just includes the subcollection plugin in the regex.
>
>
>
> Cheers,
>
>
> -----Original message-----
> From: Chris Mattmann <[email protected]>
> Sent: Sat 19-06-2010 19:08
> To: [email protected];
> Subject: Re: prefixed space in subcollection field
>
> Hi Markus,
>
> I read the documentation for the subcollection plugin here:
>
> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/subcollection/README.
> txt
>
> It didn 1;t mention anything about prefixing your field names with a space.
> So, I went and checked:
>
> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/subcollection/src/jav
> a/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
>
> It seems like the only thing it does beyond your normal NutchDocument that 1;s
> indexed is add the sub collection name to the indexed set of fields, so I 1;m
> wondering what you 1;re seeing here. Do you have any further information?
>
> Cheers,
> Chris
>
>
> On 6/19/10 9:55 AM, "Markus Jelsma" <[email protected]> wrote:
>
>> > I'm sorry, but i need to bump this one. Any suggestions?
>> >
>> > -----Original message-----
>> > From: Markus Jelsma <[email protected]>
>> > Sent: Tue 15-06-2010 10:51
>> > To: [email protected];
>> > Subject: prefixed space in subcollection field
>> >
>> > Hi list,
>> >
>> >
>> >
>> > Fields created by the subcollection plugin end up with a prefixed space in
>> my
>> > Solr index but the name and id fields in my subcollection.xml don't have
>> that
>> > same space prefixed, i checked it three times just to be certain i didn't
>> mess
>> > up the configuration. I am unsure where the space comes from and where to
>> fix
>> > it. Any ideas on this one?
>> >
>> >
>> >
>> > Cheers,
>> >
>
>
>







++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to