we enhanced index-extra plugin to take meta tag defined in
index-extra.xml, if anyone interested, I can upload the plugin
Claus Daldorph Nielsen wrote:
It would be great if we could configure the indexer from an xml file, so
we wont need to edit the java files to store other meta tags. But that is
just my wish for the next version.
It's working great now storing all the meta tags I want and our customer
is pleased.
Again, thanks for the help.
Claus Daldorph Nielsen
Julien Nioche <[email protected]>
25-05-2010 13:22
Please respond to
[email protected]
To
[email protected]
cc
Subject
Re: Parse and index meta tags in Nutch 1.0
I'm not sure what the problem was but I made some changes to
MetaTagsIndexer.java, adding new doc.adds.
Also I have updated the
schema.xml and nutch-site.xml so perhaps there was some mismatch between
fieldnames.
by default the indexer handles only metatag.keywords and
metatag.description. In your case you probably had to add 'metatag.title'
as
well.
Making it more generic would be doable but annoying as it requires some
knowledge about the nature of the fields e.g. split into multiple fields
as
we do for keywords etc... Definitely easier to do like you did and modify
the indexer if needs be - at least the parsing side of things is already
sorted
I had to do this to enable custom fields to be indexed. It would be
great
if this could be done in configuration only.
do you mean adding the corresponding field names in schema.xml? We could
do
that as part of the commit for the default fields added by the indexer
(keywords + description).
J.
Claus Daldorph Nielsen
Theilgaard Mortensen a/s
Julien Nioche <[email protected]>
25-05-2010 11:45
Please respond to
[email protected]
To
[email protected]
cc
Subject
Re: Parse and index meta tags in Nutch 1.0
Hi Claus,
Glad you got it to work. Do you know what the problem was?
BTW you can vote for issues you like in Jira - if enough people find
this
plugin useful I'll commit it to the trunk
J.
On 25 May 2010 08:57, Claus Daldorph Nielsen <[email protected]> wrote:
Julien,
Thank you so much I really appreciate your help. I have now managed to
get
Nutch to index meta tags in my Solr index (I am using Luke to verify
that
the correct content is in my index). Only thing left now is to find
out
how to search and get content from the new fields in Solr.
Claus Daldorph Nielsen
Theilgaard Mortensen a/s
Julien Nioche <[email protected]>
21-05-2010 17:18
Please respond to
[email protected]
To
[email protected]
cc
Subject
Re: Parse and index meta tags in Nutch 1.0
You can :
- run *bin/nutch org.apache.nutch.parse.ParserChecker *and check that
you
are getting metatag.* in the parse-metadata
- check in the log that the parse-metatags is really loaded
- run 'ant test-plugins' and see the output in build/parse-metatags
- check that you've added the field definitions in the SOLR schema
- index with Lucene and use Luke to check that the fields are created
On 21 May 2010 15:54, Claus Daldorph Nielsen <[email protected]> wrote:
I never got this to work. So if anybody have some ideas for
debugging
then
please post your ideas.
The problem is that the meta tags are never found or added to the
Solr
index. I have no idea why.
Claus Daldorph Nielsen
Theilgaard Mortensen a/s
Niels Hemmingsens gade 9
1153 København K
Tlf: 33448555
Julien Nioche <[email protected]>
21-05-2010 13:33
Please respond to
[email protected]
To
[email protected]
cc
Subject
Re: Parse and index meta tags in Nutch 1.0
Have you checked the discussion in
http://lucene.472066.n3.nabble.com/description-and-keywords-td690681.html?
What have you modified in nutch-site.xml?
j.
On 21 May 2010 12:15, Claus Daldorph Nielsen <[email protected]> wrote:
Julien,
Thanks it looks much like what I need. I have applied the patch
and
added
the lines to nutch-site.xml and then rebuild the Nutch project.
But
still
I don't see any metatags in my index. Do you have any suggestions
to
what
I might be doing wrong? Perhaps some configuration that I missed?
Claus Daldorph Nielsen
Theilgaard Mortensen a/s
Niels Hemmingsens gade 9
1153 København K
Tlf: 33448555
Julien Nioche <[email protected]>
21-05-2010 09:39
Please respond to
[email protected]
To
[email protected]
cc
Subject
Re: Parse and index meta tags in Nutch 1.0
Claus,
See https://issues.apache.org/jira/browse/NUTCH-809 and a related
discussion
on
http://lucene.472066.n3.nabble.com/description-and-keywords-td690681.html
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
On 21 May 2010 08:26, Claus Daldorph Nielsen <[email protected]> wrote:
Hi,
I am new to Nutch and trying to get Nutch to index meta tags
from
html
pages and store them for searching in Solr. The tags are on this
form:
<meta name="TITLE" content="Some title" />
<meta name="KEYWORDS" content="Forum, help, build, stuff" />
I would like to store the tags as two different fields in the
index.
I
have tried the example explaining how to create a plugin but the
example
is for Nutch 0.9 and only helps me getting started.
I think that I should look at :
$NUTCH_HOME/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
and find the line:
HTMLMetaProcessor.getMetaTags(metaTags, root, base);
But I'm not sure how to go on from here. Any help would be
appreciated
and
you are welcome to inform me if you know of an existing plugin
that
will
index the meta tags.
Claus Daldorph Nielsen
Theilgaard Mortensen a/s
--
DigitalPebble Ltd
http://www.digitalpebble.com
--
DigitalPebble Ltd
http://www.digitalpebble.com
--
DigitalPebble Ltd
http://www.digitalpebble.com