Anyone? :-) -------- Original-Nachricht -------- Betreff: Meta Tags Datum: Mon, 19 Dec 2011 15:30:12 +0100 Von: Marek Bachmann <[email protected]> Antwort an: [email protected] An: [email protected]
Hello again, I want to extract specific meta tag from HTML pages, like: <meta name="uniks-fb" value="fb16" /> But it seems that they aren't extracted by the parser. I dumped the segment of a page (Since the readseg doesn't work for me :-/ ) and inspected the values for this example page: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0 This page contains these metatags: <meta name="uniks-fb" content="default" /> <meta name="keywords" content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" /> <meta name="robots" content="index" /> <meta name="DC.Description" content="Der Internetauftritt der Universität Kassel" /> <meta name="DC.Subject" content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" /> <meta name="generator" content="TYPO3 4.2 CMS" /> But these tags don't appear in the segment as shown above. I thought I'll find them in "Parse Metadata" but there are only this two values: "CharEncodingForConversion=utf-8" "OriginalCharEncoding=utf-8" I use the value parse-(html|tika) in my plugin.includes as well as urlmeta. Any suggestions what I am doing wrong? THANK YOU! Snippet from segment dump: Recno:: 97 URL:: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0 CrawlDatum:: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 19 12:44:49 CET 2011 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 603450 seconds (6 days) Score: 0.0 Signature: null Metadata: CrawlDatum:: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 19 12:42:04 CET 2011 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 603450 seconds (6 days) Score: 0.0 Signature: null Metadata: CrawlDatum:: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 19 12:42:04 CET 2011 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 603450 seconds (6 days) Score: 0.0 Signature: null Metadata: CrawlDatum:: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 19 12:42:04 CET 2011 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 603450 seconds (6 days) Score: 0.0 Signature: null Metadata: CrawlDatum:: Version: 7 Status: 65 (signature) Fetch time: Mon Dec 19 12:42:04 CET 2011 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 0.0 Signature: 7260839eaf4927f64b03dd86dcd0918a Metadata: CrawlDatum:: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 19 12:42:04 CET 2011 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 603450 seconds (6 days) Score: 0.0 Signature: null Metadata: CrawlDatum:: Version: 7 Status: 67 (linked) Fetch time: Mon Dec 19 12:42:51 CET 2011 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 603450 seconds (6 days) Score: 0.0 Signature: null Metadata: CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Sat Dec 17 14:45:49 CET 2011 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 1 Retry interval: 603450 seconds (6 days) Score: 0.0 Signature: null Metadata: _ngt_: 1324289811219_pst_: exception(16), lastModified=0: java.net.SocketTimeoutException: Read timed out CrawlDatum:: Version: 7 Status: 33 (fetch_success) Fetch time: Mon Dec 19 12:25:59 CET 2011 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 1 Retry interval: 603450 seconds (6 days) Score: 0.0 Signature: null Metadata: _ngt_: 1324289811219_pst_: success(1), lastModified=0 Content:: Version: -1 url: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0 base: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0 contentType: application/xhtml+xml metadata: Date=Mon, 19 Dec 2011 10:24:09 GMT Vary=Accept-Encoding Content-Length=3886 Content-Encoding=gzip Via=1.0 cms.uni-kassel.de _fst_=33 Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb; path=/unicms/ nutch.segment.name=20111219111925 Content-Type=text/html;charset=utf-8 Connection=close Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c X-Powered-By=PHP/5.2.0-8+etch16 Content: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="de"> (...) </html> ParseData:: Version: 5 Status: success(1,0) Title: 2004 - Universität Kassel Outlinks: 35 outlink: toUrl: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0#navigation anchor: Zur Hauptnavigation (Nutzergruppen-Navigation) (...) Content Metadata: Content-Length=3886 _fst_=33 Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb; path=/unicms/ nutch.segment.name=20111219111925 Connection=close Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c X-Powered-By=PHP/5.2.0-8+etch16 nutch.content.digest=7260839eaf4927f64b03dd86dcd0918a Date=Mon, 19 Dec 2011 10:24:09 GMT Vary=Accept-Encoding Content-Encoding=gzip Via=1.0 cms.uni-kassel.de Content-Type=text/html;charset=utf-8 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 ParseText:: 2004 - Universität Kassel Zur Hauptnavigation (Nutzergruppen-Navigation) . Zur Unternavigation . Zum Inhalt . Zu verwandten Links und Informationen . Infos für: Universität Studium Forschung Fachbereiche Einrichtungen International students and scholars Sie befinden sich hier: HFK > Ehemalige Mitarbeiter > Früchting > Liste der Veröffentlichungen > 2004 Veröffentlichungen im Fachgebiet Hochfrequenztechnik/Kommunikationssysteme 2004: [113] Semmelrodt, S.; Kattenbach, R.; Früchting, H.: Toolbox for Spectral Analysis and Linear Prediction of Stationary and Non-Stationary Signals, COST 273 TD(04)019, Athen, Greece, January 26-28, 2004 [114] Semmelrodt, S.: Maximum Likelihood Based Parameter Estimation of Stationary and Non-Stationary Multi-Component Signals, FREQUENZ 58 (2004) 1-2, S. 20-24. [115] Semmelrodt, S.: Methoden zur prädiktiven Kanalschätzung für adaptive Übertragungstechniken im Mobilfunk, Dissertation Universität Kassel, Kassel: Kassel University Press 2004, ISBN 3-89958-041-9. [116] Henze, N.: Efficiency Measurement of Planar Solar Cell Antennas using the Wheeler Cap Method, 8th International Student Conference on Electrical Engineering, Technical University Prague, Czech, May 20, 2004. [117] Weitz, M.: A Planar Solar Cell Antenna for Vehicular Mobile Communication Systems, 8th International Student Conference on Electrical Engineering, Technical University Prague, Czech, May 20, 2004. [118] Schäfer, A.: Construction of a 200 MHz and 400 MHz Clock-Oszillator for an Indoor Channel Sounder, 8th International Student Conference on Electrical Engineering, Technical University Prague, Czech, May 20, 2004. [119] Semmelrodt, S.: Spectral Analysis and Linear Prediction Toolbox for Stationary and Non-Stationary Signals, FREQUENZ 58 (2004) 7-8, S. 185-187. [120] Henze, N.; Weitz, M.; Hofmann, P.; Bendel, C.; Kirchhof, J.; Früchting, H..: Investigation of Planar Antennas with Photovoltaic Solar Cells for Mobile Communications, in Proceedings of the 15th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC 2004), Barcelona, Spain, September 5-8, 2004. Liste der Veröffentlichungen 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 bis1990 Impressum Google-Suche über Uni-Seiten Softlink Letzte Änderung: 29.12.2009 ComLab

