thanks for sharing!

On Wednesday 21 December 2011 16:17:17 Marek Bachmann wrote:
> I solved it by myself and want to report it if anyone else have the same
> problem:
> 
> As far as I see, in Nutch 1.4 the meta tag are ignored. But I found this
> patch:
> 
> https://issues.apache.org/jira/browse/NUTCH-809
> 
> It worked "out of the box" for me.
> 
> With this plugin it is possible to define a set of meta-tag names that
> should be parsed. They will be stored in Parse Metadata.
> 
> Am 21.12.2011 01:15, schrieb Marek Bachmann:
> > Anyone? :-)
> > 
> > -------- Original-Nachricht --------
> > Betreff: Meta Tags
> > Datum: Mon, 19 Dec 2011 15:30:12 +0100
> > Von: Marek Bachmann<[email protected]>
> > Antwort an: [email protected]
> > An: [email protected]
> > 
> > Hello again,
> > 
> > I want to extract specific meta tag from HTML pages, like:
> > 
> > <meta name="uniks-fb" value="fb16" />
> > 
> > But it seems that they aren't extracted by the parser. I dumped the
> > segment of a page (Since the readseg doesn't work for me :-/ ) and
> > inspected the values for this example page:
> > 
> > http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
> > 
> > This page contains these metatags:
> > <meta name="uniks-fb" content="default" />
> > <meta name="keywords"
> > content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" />
> > <meta name="robots" content="index" />
> > <meta name="DC.Description" content="Der Internetauftritt der
> > Universität Kassel" />
> > <meta name="DC.Subject"
> > content="Universitt,Kassel,Forschung,Lehre,Wissenschaft" />
> > <meta name="generator" content="TYPO3 4.2 CMS" />
> > 
> > But these tags don't appear in the segment as shown above. I thought
> > I'll find them in "Parse Metadata" but there are only this two values:
> > "CharEncodingForConversion=utf-8" "OriginalCharEncoding=utf-8"
> > 
> > I use the value parse-(html|tika) in my plugin.includes as well as
> > urlmeta.
> > 
> > Any suggestions what I am doing wrong?
> > 
> > THANK YOU!
> > 
> > Snippet from segment dump:
> > 
> > Recno:: 97
> > URL:: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 67 (linked)
> > Fetch time: Mon Dec 19 12:44:49 CET 2011
> > Modified time: Thu Jan 01 01:00:00 CET 1970
> > Retries since fetch: 0
> > Retry interval: 603450 seconds (6 days)
> > Score: 0.0
> > Signature: null
> > Metadata:
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 67 (linked)
> > Fetch time: Mon Dec 19 12:42:04 CET 2011
> > Modified time: Thu Jan 01 01:00:00 CET 1970
> > Retries since fetch: 0
> > Retry interval: 603450 seconds (6 days)
> > Score: 0.0
> > Signature: null
> > Metadata:
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 67 (linked)
> > Fetch time: Mon Dec 19 12:42:04 CET 2011
> > Modified time: Thu Jan 01 01:00:00 CET 1970
> > Retries since fetch: 0
> > Retry interval: 603450 seconds (6 days)
> > Score: 0.0
> > Signature: null
> > Metadata:
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 67 (linked)
> > Fetch time: Mon Dec 19 12:42:04 CET 2011
> > Modified time: Thu Jan 01 01:00:00 CET 1970
> > Retries since fetch: 0
> > Retry interval: 603450 seconds (6 days)
> > Score: 0.0
> > Signature: null
> > Metadata:
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 65 (signature)
> > Fetch time: Mon Dec 19 12:42:04 CET 2011
> > Modified time: Thu Jan 01 01:00:00 CET 1970
> > Retries since fetch: 0
> > Retry interval: 0 seconds (0 days)
> > Score: 0.0
> > Signature: 7260839eaf4927f64b03dd86dcd0918a
> > Metadata:
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 67 (linked)
> > Fetch time: Mon Dec 19 12:42:04 CET 2011
> > Modified time: Thu Jan 01 01:00:00 CET 1970
> > Retries since fetch: 0
> > Retry interval: 603450 seconds (6 days)
> > Score: 0.0
> > Signature: null
> > Metadata:
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 67 (linked)
> > Fetch time: Mon Dec 19 12:42:51 CET 2011
> > Modified time: Thu Jan 01 01:00:00 CET 1970
> > Retries since fetch: 0
> > Retry interval: 603450 seconds (6 days)
> > Score: 0.0
> > Signature: null
> > Metadata:
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 1 (db_unfetched)
> > Fetch time: Sat Dec 17 14:45:49 CET 2011
> > Modified time: Thu Jan 01 01:00:00 CET 1970
> > Retries since fetch: 1
> > Retry interval: 603450 seconds (6 days)
> > Score: 0.0
> > Signature: null
> > Metadata: _ngt_: 1324289811219_pst_: exception(16), lastModified=0:
> > java.net.SocketTimeoutException: Read timed out
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 33 (fetch_success)
> > Fetch time: Mon Dec 19 12:25:59 CET 2011
> > Modified time: Thu Jan 01 01:00:00 CET 1970
> > Retries since fetch: 1
> > Retry interval: 603450 seconds (6 days)
> > Score: 0.0
> > Signature: null
> > Metadata: _ngt_: 1324289811219_pst_: success(1), lastModified=0
> > 
> > Content::
> > Version: -1
> > url: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
> > base: http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0
> > contentType: application/xhtml+xml
> > metadata: Date=Mon, 19 Dec 2011 10:24:09 GMT Vary=Accept-Encoding
> > Content-Length=3886 Content-Encoding=gzip Via=1.0 cms.uni-kassel.de
> > _fst_=33 Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb;
> > path=/unicms/ nutch.segment.name=20111219111925
> > Content-Type=text/html;charset=utf-8 Connection=close
> > Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
> > X-Powered-By=PHP/5.2.0-8+etch16
> > Content:
> > <?xml version="1.0" encoding="utf-8"?>
> > <!DOCTYPE html
> > 
> >       PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> >       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> > 
> > <html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en" lang="de">
> > (...)
> > </html>
> > 
> > ParseData::
> > Version: 5
> > Status: success(1,0)
> > Title: 2004 - Universität Kassel
> > Outlinks: 35
> > 
> >    outlink: toUrl:
> > http://cms.uni-kassel.de/unicms/index.php?id=29216&L=0#navigation
> > anchor: Zur Hauptnavigation (Nutzergruppen-Navigation)
> > (...)
> > Content Metadata: Content-Length=3886 _fst_=33
> > Set-Cookie=fe_typo_user=cb42ebddb40df7e8a04b0183f79c41cb; path=/unicms/
> > nutch.segment.name=20111219111925 Connection=close Server=Apache/2.2.3
> > (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c X-Powered-By=PHP/5.2.0-8+etch16
> > nutch.content.digest=7260839eaf4927f64b03dd86dcd0918a Date=Mon, 19 Dec
> > 2011 10:24:09 GMT Vary=Accept-Encoding Content-Encoding=gzip Via=1.0
> > cms.uni-kassel.de Content-Type=text/html;charset=utf-8
> > Parse Metadata: CharEncodingForConversion=utf-8
> > OriginalCharEncoding=utf-8
> > 
> > ParseText::
> > 2004 - Universität Kassel Zur Hauptnavigation (Nutzergruppen-Navigation)
> > . Zur Unternavigation . Zum Inhalt . Zu verwandten Links und
> > Informationen . Infos für: Universität Studium Forschung Fachbereiche
> > Einrichtungen International students and scholars     Sie befinden sich
> > hier:  HFK>   Ehemalige Mitarbeiter>   Früchting>   Liste der
> > Veröffentlichungen>   2004 Veröffentlichungen im Fachgebiet
> > Hochfrequenztechnik/Kommunikationssysteme 2004: [113] Semmelrodt, S.;
> > Kattenbach, R.; Früchting, H.: Toolbox for Spectral Analysis and Linear
> > Prediction of Stationary and Non-Stationary Signals, COST 273 TD(04)019,
> > Athen, Greece, January 26-28, 2004 [114] Semmelrodt, S.: Maximum
> > Likelihood Based Parameter Estimation of Stationary and Non-Stationary
> > Multi-Component Signals, FREQUENZ 58 (2004) 1-2, S. 20-24. [115]
> > Semmelrodt, S.: Methoden zur prädiktiven Kanalschätzung für adaptive
> > Übertragungstechniken im Mobilfunk, Dissertation Universität Kassel,
> > Kassel: Kassel University Press 2004, ISBN 3-89958-041-9. [116] Henze,
> > N.: Efficiency Measurement of Planar Solar Cell Antennas using the
> > Wheeler Cap Method, 8th International Student Conference on Electrical
> > Engineering, Technical University Prague, Czech, May 20, 2004. [117]
> > Weitz, M.: A Planar Solar Cell Antenna for Vehicular Mobile
> > Communication Systems, 8th International Student Conference on
> > Electrical Engineering, Technical University Prague, Czech, May 20,
> > 2004. [118] Schäfer, A.: Construction of a 200 MHz and 400 MHz
> > Clock-Oszillator for an Indoor Channel Sounder, 8th International
> > Student Conference on Electrical Engineering, Technical University
> > Prague, Czech, May 20, 2004. [119] Semmelrodt, S.: Spectral Analysis and
> > Linear Prediction Toolbox for Stationary and Non-Stationary Signals,
> > FREQUENZ 58 (2004) 7-8, S. 185-187. [120] Henze, N.; Weitz, M.; Hofmann,
> > P.; Bendel, C.; Kirchhof, J.; Früchting, H..: Investigation of Planar
> > Antennas with Photovoltaic Solar Cells for Mobile Communications, in
> > Proceedings of the 15th IEEE International Symposium on Personal, Indoor
> > and Mobile Radio Communications (PIMRC 2004), Barcelona, Spain,
> > September 5-8, 2004. Liste der Veröffentlichungen 2005 2004 2003 2002
> > 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 bis1990 Impressum
> > Google-Suche über Uni-Seiten   Softlink Letzte Änderung: 29.12.2009
> > ComLab

-- 
Markus Jelsma - CTO - Openindex

Reply via email to