Hey Marek, Apologies for taking ages to get back. The patch you found was originally intended for inclusion in 1.3, however as you will see it has been closely linked to two other patches
NUTCH-422 & NUTCH-1005. I wonder if it is possible for you to have a look at them both (if you have time), as our plans were to do a merge of sorts. It would be great to get some direct feedback from the community to see how this would best work and how the best solution could be integrated into the Nutch codebase. Thanks for taking the time to look at the problem. Lewis On Wed, Dec 21, 2011 at 3:36 PM, Markus Jelsma <[email protected]> wrote: > thanks for sharing! > > On Wednesday 21 December 2011 16:17:17 Marek Bachmann wrote: >> I solved it by myself and want to report it if anyone else have the same >> problem: >> >> As far as I see, in Nutch 1.4 the meta tag are ignored. But I found this >> patch: >> >> https://issues.apache.org/jira/browse/NUTCH-809 >> >> It worked "out of the box" for me. >> >> With this plugin it is possible to define a set of meta-tag names that >> should be parsed. They will be stored in Parse Metadata. >> >> Am 21.12.2011 01:15, schrieb Marek Bachmann: >> > Anyone? :-) >> > >> > -------- Original-Nachricht -------- >> > Betreff: Meta Tags >> > Datum: Mon, 19 Dec 2011 15:30:12 +0100 >> > Von: Marek Bachmann<[email protected]> >> > Antwort an: [email protected] >> > An: [email protected] >> > >> > Hello again, >> > >> > I want to extract specific meta tag from HTML pages, like: >> > >> > <meta name="uniks-fb" value="fb16" /> >> > >> > But it seems that they aren't extracted by the parser. I dumped the >> > segment of a page (Since the readseg doesn't work for me :-/ ) and >> > inspected the values for this example page: >> > >> > http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0 >> > >> > This page contains these metatags: >> > <meta name="uniks-fb" content="default" /> >> > <meta name="keywords" >> > content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" /> >> > <meta name="robots" content="index" /> >> > <meta name="DC.Description" content="Der Internetauftritt der >> > Universität Kassel" /> >> > <meta name="DC.Subject" >> > content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" /> >> > <meta name="generator" content="TYPO3 4.2 CMS" /> >> > >> > But these tags don't appear in the segment as shown above. I thought >> > I'll find them in "Parse Metadata" but there are only this two values: >> > "CharEncodingForConversion=utf-8" "OriginalCharEncoding=utf-8" >> > >> > I use the value parse-(html|tika) in my plugin.includes as well as >> > urlmeta. >> > >> > Any suggestions what I am doing wrong? >> > >> > THANK YOU! >> > >> > Snippet from segment dump: >> > >> > Recno:: 97 >> > URL:: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0 >> > >> > CrawlDatum:: >> > Version: 7 >> > Status: 67 (linked) >> > Fetch time: Mon Dec 19 12:44:49 CET 2011 >> > Modified time: Thu Jan 01 01:00:00 CET 1970 >> > Retries since fetch: 0 >> > Retry interval: 603450 seconds (6 days) >> > Score: 0.0 >> > Signature: null >> > Metadata: >> > >> > CrawlDatum:: >> > Version: 7 >> > Status: 67 (linked) >> > Fetch time: Mon Dec 19 12:42:04 CET 2011 >> > Modified time: Thu Jan 01 01:00:00 CET 1970 >> > Retries since fetch: 0 >> > Retry interval: 603450 seconds (6 days) >> > Score: 0.0 >> > Signature: null >> > Metadata: >> > >> > CrawlDatum:: >> > Version: 7 >> > Status: 67 (linked) >> > Fetch time: Mon Dec 19 12:42:04 CET 2011 >> > Modified time: Thu Jan 01 01:00:00 CET 1970 >> > Retries since fetch: 0 >> > Retry interval: 603450 seconds (6 days) >> > Score: 0.0 >> > Signature: null >> > Metadata: >> > >> > CrawlDatum:: >> > Version: 7 >> > Status: 67 (linked) >> > Fetch time: Mon Dec 19 12:42:04 CET 2011 >> > Modified time: Thu Jan 01 01:00:00 CET 1970 >> > Retries since fetch: 0 >> > Retry interval: 603450 seconds (6 days) >> > Score: 0.0 >> > Signature: null >> > Metadata: >> > >> > CrawlDatum:: >> > Version: 7 >> > Status: 65 (signature) >> > Fetch time: Mon Dec 19 12:42:04 CET 2011 >> > Modified time: Thu Jan 01 01:00:00 CET 1970 >> > Retries since fetch: 0 >> > Retry interval: 0 seconds (0 days) >> > Score: 0.0 >> > Signature: 7260839eaf4927f64b03dd86dcd0918a >> > Metadata: >> > >> > CrawlDatum:: >> > Version: 7 >> > Status: 67 (linked) >> > Fetch time: Mon Dec 19 12:42:04 CET 2011 >> > Modified time: Thu Jan 01 01:00:00 CET 1970 >> > Retries since fetch: 0 >> > Retry interval: 603450 seconds (6 days) >> > Score: 0.0 >> > Signature: null >> > Metadata: >> > >> > CrawlDatum:: >> > Version: 7 >> > Status: 67 (linked) >> > Fetch time: Mon Dec 19 12:42:51 CET 2011 >> > Modified time: Thu Jan 01 01:00:00 CET 1970 >> > Retries since fetch: 0 >> > Retry interval: 603450 seconds (6 days) >> > Score: 0.0 >> > Signature: null >> > Metadata: >> > >> > CrawlDatum:: >> > Version: 7 >> > Status: 1 (db_unfetched) >> > Fetch time: Sat Dec 17 14:45:49 CET 2011 >> > Modified time: Thu Jan 01 01:00:00 CET 1970 >> > Retries since fetch: 1 >> > Retry interval: 603450 seconds (6 days) >> > Score: 0.0 >> > Signature: null >> > Metadata: _ngt_: 1324289811219_pst_: exception(16), lastModified=0: >> > java.net.SocketTimeoutException: Read timed out >> > >> > CrawlDatum:: >> > Version: 7 >> > Status: 33 (fetch_success) >> > Fetch time: Mon Dec 19 12:25:59 CET 2011 >> > Modified time: Thu Jan 01 01:00:00 CET 1970 >> > Retries since fetch: 1 >> > Retry interval: 603450 seconds (6 days) >> > Score: 0.0 >> > Signature: null >> > Metadata: _ngt_: 1324289811219_pst_: success(1), lastModified=0 >> > >> > Content:: >> > Version: -1 >> > url: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0 >> > base: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0 >> > contentType: application/xhtml+xml >> > metadata: Date=Mon, 19 Dec 2011 10:24:09 GMT Vary=Accept-Encoding >> > Content-Length=3886 Content-Encoding=gzip Via=1.0 cms.uni-kassel.de >> > _fst_=33 Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb; >> > path=/unicms/ nutch.segment.name=20111219111925 >> > Content-Type=text/html;charset=utf-8 Connection=close >> > Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c >> > X-Powered-By=PHP/5.2.0-8+etch16 >> > Content: >> > <?xml version="1.0" encoding="utf-8"?> >> > <!DOCTYPE html >> > >> > PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> > >> > <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="de"> >> > (...) >> > </html> >> > >> > ParseData:: >> > Version: 5 >> > Status: success(1,0) >> > Title: 2004 - Universität Kassel >> > Outlinks: 35 >> > >> > outlink: toUrl: >> > http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0#navigation >> > anchor: Zur Hauptnavigation (Nutzergruppen-Navigation) >> > (...) >> > Content Metadata: Content-Length=3886 _fst_=33 >> > Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb; path=/unicms/ >> > nutch.segment.name=20111219111925 Connection=close Server=Apache/2.2.3 >> > (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c X-Powered-By=PHP/5.2.0-8+etch16 >> > nutch.content.digest=7260839eaf4927f64b03dd86dcd0918a Date=Mon, 19 Dec >> > 2011 10:24:09 GMT Vary=Accept-Encoding Content-Encoding=gzip Via=1.0 >> > cms.uni-kassel.de Content-Type=text/html;charset=utf-8 >> > Parse Metadata: CharEncodingForConversion=utf-8 >> > OriginalCharEncoding=utf-8 >> > >> > ParseText:: >> > 2004 - Universität Kassel Zur Hauptnavigation (Nutzergruppen-Navigation) >> > . Zur Unternavigation . Zum Inhalt . Zu verwandten Links und >> > Informationen . Infos für: Universität Studium Forschung Fachbereiche >> > Einrichtungen International students and scholars Sie befinden sich >> > hier: HFK> Ehemalige Mitarbeiter> Früchting> Liste der >> > Veröffentlichungen> 2004 Veröffentlichungen im Fachgebiet >> > Hochfrequenztechnik/Kommunikationssysteme 2004: [113] Semmelrodt, S.; >> > Kattenbach, R.; Früchting, H.: Toolbox for Spectral Analysis and Linear >> > Prediction of Stationary and Non-Stationary Signals, COST 273 TD(04)019, >> > Athen, Greece, January 26-28, 2004 [114] Semmelrodt, S.: Maximum >> > Likelihood Based Parameter Estimation of Stationary and Non-Stationary >> > Multi-Component Signals, FREQUENZ 58 (2004) 1-2, S. 20-24. [115] >> > Semmelrodt, S.: Methoden zur prädiktiven Kanalschätzung für adaptive >> > Übertragungstechniken im Mobilfunk, Dissertation Universität Kassel, >> > Kassel: Kassel University Press 2004, ISBN 3-89958-041-9. [116] Henze, >> > N.: Efficiency Measurement of Planar Solar Cell Antennas using the >> > Wheeler Cap Method, 8th International Student Conference on Electrical >> > Engineering, Technical University Prague, Czech, May 20, 2004. [117] >> > Weitz, M.: A Planar Solar Cell Antenna for Vehicular Mobile >> > Communication Systems, 8th International Student Conference on >> > Electrical Engineering, Technical University Prague, Czech, May 20, >> > 2004. [118] Schäfer, A.: Construction of a 200 MHz and 400 MHz >> > Clock-Oszillator for an Indoor Channel Sounder, 8th International >> > Student Conference on Electrical Engineering, Technical University >> > Prague, Czech, May 20, 2004. [119] Semmelrodt, S.: Spectral Analysis and >> > Linear Prediction Toolbox for Stationary and Non-Stationary Signals, >> > FREQUENZ 58 (2004) 7-8, S. 185-187. [120] Henze, N.; Weitz, M.; Hofmann, >> > P.; Bendel, C.; Kirchhof, J.; Früchting, H..: Investigation of Planar >> > Antennas with Photovoltaic Solar Cells for Mobile Communications, in >> > Proceedings of the 15th IEEE International Symposium on Personal, Indoor >> > and Mobile Radio Communications (PIMRC 2004), Barcelona, Spain, >> > September 5-8, 2004. Liste der Veröffentlichungen 2005 2004 2003 2002 >> > 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 bis1990 Impressum >> > Google-Suche über Uni-Seiten Softlink Letzte Änderung: 29.12.2009 >> > ComLab > > -- > Markus Jelsma - CTO - Openindex -- Lewis

