Hi Lewis,

thank you very much for your reply.

I will have a look at the two other patches. My personal need for the meta data extraction is as follows:

I am using Nutch for crawling our university network. This network has been growing for over ten years and there never was a central administration for the web pages. In fact most of our departments made their own web sites. The only thing these pages have in common is that their share the same domain name. The task of my bachelor thesis will be: Building a solr based search engine which offers the ability of faceted searching in predefined classes. The main task is the automated classification of the pages.

For the university network I have two assumptions:

1) It works like a small "internet" since there is no central administration for ALL web sites. 2) Methods you can't use in the internet because of spamming could work well in my case since I hope that no department will do it! ;-)

So, to cut a long story short:

For the classification I use own meta tags which are generated in a CMS (15 % of all pages) which can tell me directly the class of that pages (deduced by the hierarchically structure in the cms)
It may also for interest using common keys like keywords and description.

The next thing is the classification on its own. Perhaps it would make sense to do this in the nutch enviroment since there are all data available. Then the possibility of using meta data in the crawl db as a label for the class of the page could be very useful in that case.

I'll start with all this in the first January week. I'll keep a close look at the two patches.

At the moment I am really wondering if it isn't such a bad idea to implement the classification algorithms in nutch. But I think it will show up when I start with it :)

And explicit to everyone:

Perhaps someone has tried something similar already? Pleas let me know!

THANKS

Am 22.12.2011 10:57, schrieb Lewis John Mcgibbney:
Hey Marek,

Apologies for taking ages to get back. The patch you found was
originally intended for inclusion in 1.3, however as you will see it
has been closely linked to two other patches

NUTCH-422&  NUTCH-1005.

I wonder if it is possible for you to have a look at them both (if you
have time), as our plans were to do a merge of sorts. It would be
great to get some direct feedback from the community to see how this
would best work and how the best solution could be integrated into the
Nutch codebase.

Thanks for taking the time to look at the problem.

Lewis

On Wed, Dec 21, 2011 at 3:36 PM, Markus Jelsma
<[email protected]>  wrote:
thanks for sharing!

On Wednesday 21 December 2011 16:17:17 Marek Bachmann wrote:
I solved it by myself and want to report it if anyone else have the same
problem:

As far as I see, in Nutch 1.4 the meta tag are ignored. But I found this
patch:

https://issues.apache.org/jira/browse/NUTCH-809

It worked "out of the box" for me.

With this plugin it is possible to define a set of meta-tag names that
should be parsed. They will be stored in Parse Metadata.

Am 21.12.2011 01:15, schrieb Marek Bachmann:
Anyone? :-)

-------- Original-Nachricht --------
Betreff: Meta Tags
Datum: Mon, 19 Dec 2011 15:30:12 +0100
Von: Marek Bachmann<[email protected]>
Antwort an: [email protected]
An: [email protected]

Hello again,

I want to extract specific meta tag from HTML pages, like:

<meta name="uniks-fb" value="fb16" />

But it seems that they aren't extracted by the parser. I dumped the
segment of a page (Since the readseg doesn't work for me :-/ ) and
inspected the values for this example page:

http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0

This page contains these metatags:
<meta name="uniks-fb" content="default" />
<meta name="keywords"
content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" />
<meta name="robots" content="index" />
<meta name="DC.Description" content="Der Internetauftritt der
Universität Kassel" />
<meta name="DC.Subject"
content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" />
<meta name="generator" content="TYPO3 4.2 CMS" />

But these tags don't appear in the segment as shown above. I thought
I'll find them in "Parse Metadata" but there are only this two values:
"CharEncodingForConversion=utf-8" "OriginalCharEncoding=utf-8"

I use the value parse-(html|tika) in my plugin.includes as well as
urlmeta.

Any suggestions what I am doing wrong?

THANK YOU!

Snippet from segment dump:

Recno:: 97
URL:: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:44:49 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 0.0
Signature: 7260839eaf4927f64b03dd86dcd0918a
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:51 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Dec 17 14:45:49 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata: _ngt_: 1324289811219_pst_: exception(16), lastModified=0:
java.net.SocketTimeoutException: Read timed out

CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Mon Dec 19 12:25:59 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata: _ngt_: 1324289811219_pst_: success(1), lastModified=0

Content::
Version: -1
url: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
base: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
contentType: application/xhtml+xml
metadata: Date=Mon, 19 Dec 2011 10:24:09 GMT Vary=Accept-Encoding
Content-Length=3886 Content-Encoding=gzip Via=1.0 cms.uni-kassel.de
_fst_=33 Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb;
path=/unicms/ nutch.segment.name=20111219111925
Content-Type=text/html;charset=utf-8 Connection=close
Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
X-Powered-By=PHP/5.2.0-8+etch16
Content:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html

       PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>

<html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en" lang="de">
(...)
</html>

ParseData::
Version: 5
Status: success(1,0)
Title: 2004 - Universität Kassel
Outlinks: 35

    outlink: toUrl:
http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0#navigation
anchor: Zur Hauptnavigation (Nutzergruppen-Navigation)
(...)
Content Metadata: Content-Length=3886 _fst_=33
Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb; path=/unicms/
nutch.segment.name=20111219111925 Connection=close Server=Apache/2.2.3
(Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c X-Powered-By=PHP/5.2.0-8+etch16
nutch.content.digest=7260839eaf4927f64b03dd86dcd0918a Date=Mon, 19 Dec
2011 10:24:09 GMT Vary=Accept-Encoding Content-Encoding=gzip Via=1.0
cms.uni-kassel.de Content-Type=text/html;charset=utf-8
Parse Metadata: CharEncodingForConversion=utf-8
OriginalCharEncoding=utf-8

ParseText::
2004 - Universität Kassel Zur Hauptnavigation (Nutzergruppen-Navigation)
. Zur Unternavigation . Zum Inhalt . Zu verwandten Links und
Informationen . Infos für: Universität Studium Forschung Fachbereiche
Einrichtungen International students and scholars     Sie befinden sich
hier:  HFK>     Ehemalige Mitarbeiter>     Früchting>     Liste der
Veröffentlichungen>     2004 Veröffentlichungen im Fachgebiet
Hochfrequenztechnik/Kommunikationssysteme 2004: [113] Semmelrodt, S.;
Kattenbach, R.; Früchting, H.: Toolbox for Spectral Analysis and Linear
Prediction of Stationary and Non-Stationary Signals, COST 273 TD(04)019,
Athen, Greece, January 26-28, 2004 [114] Semmelrodt, S.: Maximum
Likelihood Based Parameter Estimation of Stationary and Non-Stationary
Multi-Component Signals, FREQUENZ 58 (2004) 1-2, S. 20-24. [115]
Semmelrodt, S.: Methoden zur prädiktiven Kanalschätzung für adaptive
Übertragungstechniken im Mobilfunk, Dissertation Universität Kassel,
Kassel: Kassel University Press 2004, ISBN 3-89958-041-9. [116] Henze,
N.: Efficiency Measurement of Planar Solar Cell Antennas using the
Wheeler Cap Method, 8th International Student Conference on Electrical
Engineering, Technical University Prague, Czech, May 20, 2004. [117]
Weitz, M.: A Planar Solar Cell Antenna for Vehicular Mobile
Communication Systems, 8th International Student Conference on
Electrical Engineering, Technical University Prague, Czech, May 20,
2004. [118] Schäfer, A.: Construction of a 200 MHz and 400 MHz
Clock-Oszillator for an Indoor Channel Sounder, 8th International
Student Conference on Electrical Engineering, Technical University
Prague, Czech, May 20, 2004. [119] Semmelrodt, S.: Spectral Analysis and
Linear Prediction Toolbox for Stationary and Non-Stationary Signals,
FREQUENZ 58 (2004) 7-8, S. 185-187. [120] Henze, N.; Weitz, M.; Hofmann,
P.; Bendel, C.; Kirchhof, J.; Früchting, H..: Investigation of Planar
Antennas with Photovoltaic Solar Cells for Mobile Communications, in
Proceedings of the 15th IEEE International Symposium on Personal, Indoor
and Mobile Radio Communications (PIMRC 2004), Barcelona, Spain,
September 5-8, 2004. Liste der Veröffentlichungen 2005 2004 2003 2002
2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 bis1990 Impressum
Google-Suche über Uni-Seiten   Softlink Letzte Änderung: 29.12.2009
ComLab

--
Markus Jelsma - CTO - Openindex




Reply via email to