Anyone? :-)
-------- Original-Nachricht --------
Betreff: Meta Tags
Datum: Mon, 19 Dec 2011 15:30:12 +0100
Von: Marek Bachmann<[email protected]>
Antwort an: [email protected]
An: [email protected]
Hello again,
I want to extract specific meta tag from HTML pages, like:
<meta name="uniks-fb" value="fb16" />
But it seems that they aren't extracted by the parser. I dumped the
segment of a page (Since the readseg doesn't work for me :-/ ) and
inspected the values for this example page:
http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
This page contains these metatags:
<meta name="uniks-fb" content="default" />
<meta name="keywords"
content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" />
<meta name="robots" content="index" />
<meta name="DC.Description" content="Der Internetauftritt der
Universität Kassel" />
<meta name="DC.Subject"
content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" />
<meta name="generator" content="TYPO3 4.2 CMS" />
But these tags don't appear in the segment as shown above. I thought
I'll find them in "Parse Metadata" but there are only this two values:
"CharEncodingForConversion=utf-8" "OriginalCharEncoding=utf-8"
I use the value parse-(html|tika) in my plugin.includes as well as urlmeta.
Any suggestions what I am doing wrong?
THANK YOU!
Snippet from segment dump:
Recno:: 97
URL:: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:44:49 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 0.0
Signature: 7260839eaf4927f64b03dd86dcd0918a
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Dec 19 12:42:51 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Dec 17 14:45:49 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata: _ngt_: 1324289811219_pst_: exception(16), lastModified=0:
java.net.SocketTimeoutException: Read timed out
CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Mon Dec 19 12:25:59 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null
Metadata: _ngt_: 1324289811219_pst_: success(1), lastModified=0
Content::
Version: -1
url: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
base: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
contentType: application/xhtml+xml
metadata: Date=Mon, 19 Dec 2011 10:24:09 GMT Vary=Accept-Encoding
Content-Length=3886 Content-Encoding=gzip Via=1.0 cms.uni-kassel.de
_fst_=33 Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb;
path=/unicms/ nutch.segment.name=20111219111925
Content-Type=text/html;charset=utf-8 Connection=close
Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
X-Powered-By=PHP/5.2.0-8+etch16
Content:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="de">
(...)
</html>
ParseData::
Version: 5
Status: success(1,0)
Title: 2004 - Universität Kassel
Outlinks: 35
outlink: toUrl:
http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0#navigation
anchor: Zur Hauptnavigation (Nutzergruppen-Navigation)
(...)
Content Metadata: Content-Length=3886 _fst_=33
Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb; path=/unicms/
nutch.segment.name=20111219111925 Connection=close Server=Apache/2.2.3
(Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c X-Powered-By=PHP/5.2.0-8+etch16
nutch.content.digest=7260839eaf4927f64b03dd86dcd0918a Date=Mon, 19 Dec
2011 10:24:09 GMT Vary=Accept-Encoding Content-Encoding=gzip Via=1.0
cms.uni-kassel.de Content-Type=text/html;charset=utf-8
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
ParseText::
2004 - Universität Kassel Zur Hauptnavigation (Nutzergruppen-Navigation)
. Zur Unternavigation . Zum Inhalt . Zu verwandten Links und
Informationen . Infos für: Universität Studium Forschung Fachbereiche
Einrichtungen International students and scholars Sie befinden sich
hier: HFK> Ehemalige Mitarbeiter> Früchting> Liste der
Veröffentlichungen> 2004 Veröffentlichungen im Fachgebiet
Hochfrequenztechnik/Kommunikationssysteme 2004: [113] Semmelrodt, S.;
Kattenbach, R.; Früchting, H.: Toolbox for Spectral Analysis and Linear
Prediction of Stationary and Non-Stationary Signals, COST 273 TD(04)019,
Athen, Greece, January 26-28, 2004 [114] Semmelrodt, S.: Maximum
Likelihood Based Parameter Estimation of Stationary and Non-Stationary
Multi-Component Signals, FREQUENZ 58 (2004) 1-2, S. 20-24. [115]
Semmelrodt, S.: Methoden zur prädiktiven Kanalschätzung für adaptive
Übertragungstechniken im Mobilfunk, Dissertation Universität Kassel,
Kassel: Kassel University Press 2004, ISBN 3-89958-041-9. [116] Henze,
N.: Efficiency Measurement of Planar Solar Cell Antennas using the
Wheeler Cap Method, 8th International Student Conference on Electrical
Engineering, Technical University Prague, Czech, May 20, 2004. [117]
Weitz, M.: A Planar Solar Cell Antenna for Vehicular Mobile
Communication Systems, 8th International Student Conference on
Electrical Engineering, Technical University Prague, Czech, May 20,
2004. [118] Schäfer, A.: Construction of a 200 MHz and 400 MHz
Clock-Oszillator for an Indoor Channel Sounder, 8th International
Student Conference on Electrical Engineering, Technical University
Prague, Czech, May 20, 2004. [119] Semmelrodt, S.: Spectral Analysis and
Linear Prediction Toolbox for Stationary and Non-Stationary Signals,
FREQUENZ 58 (2004) 7-8, S. 185-187. [120] Henze, N.; Weitz, M.; Hofmann,
P.; Bendel, C.; Kirchhof, J.; Früchting, H..: Investigation of Planar
Antennas with Photovoltaic Solar Cells for Mobile Communications, in
Proceedings of the 15th IEEE International Symposium on Personal, Indoor
and Mobile Radio Communications (PIMRC 2004), Barcelona, Spain,
September 5-8, 2004. Liste der Veröffentlichungen 2005 2004 2003 2002
2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 bis1990 Impressum
Google-Suche über Uni-Seiten Softlink Letzte Änderung: 29.12.2009 ComLab