Hi Folks, Is the aim to have identical output from parse-tika and parse-html for rendering of parse metadata? With Nutch 1.10-SNAPSHOT with no local source code modifications, if we take the following page [0], and turn metatags.names to wildcard *, with parse-tika I get
Parse Metadata: og:type=article test=displayAbstract metatag.og:image= http://journals.cambridge.org/cover_images/OPL/OPL.jpg metatag.og:image= http://journals.cambridge.org/cover_images/OPL/OPL.jpg fb:app_id=102729586536954 Cache-Control=no-store Pragma=no-cache description=We present a comparative study of optical absorption, photoluminescence (PL), and photoconductivity in bulk heterojunctions comprising a high performance functionalized anthradithiophene (ADT) derivative or the benchmark polymer P3HT as donor and functionalized pentacene (Pn) derivative or PCBM as acceptor. Of all D/A blends studied, the ADT/PCBM blend exhibited the highest charge photogeneration efficiencies under 532 nm excitation, leading to the highest amplitudes of time-resolved and continuous wave (cw) photocurrents. At nanosecond time scales after photoexcitation, both ADT-TES-F-based blends and the P3HT/Pn-TIPS-F8 blend exhibited photocurrents which were higher by a factor of 2-10, depending on the blend, than that in the P3HT/PCBM blend. However, cw photocurrents showed a different trend, with the ADT-TES-F/PCBM blend exhibiting only a factor of 1.5-2.5 lower photoresponse than that in the P3HT/PCBM blend, due to other contributions, such as that of charge trap-limited transport, to cw photoresponse. verify-v1=P40xFgT/ywJlpV7zP/etM8pJVJZ4CjdOId2dmmiCb+4= metatag.og:type=article metatag.og:type=article metatag.expires=-1 metatag.expires=-1 format-detection=telephone=no metatag.verify-v1=P40xFgT/ywJlpV7zP/etM8pJVJZ4CjdOId2dmmiCb+4= metatag.verify-v1=P40xFgT/ywJlpV7zP/etM8pJVJZ4CjdOId2dmmiCb+4= metatag.description=We present a comparative study of optical absorption, photoluminescence (PL), and photoconductivity in bulk heterojunctions comprising a high performance functionalized anthradithiophene (ADT) derivative or the benchmark polymer P3HT as donor and functionalized pentacene (Pn) derivative or PCBM as acceptor. Of all D/A blends studied, the ADT/PCBM blend exhibited the highest charge photogeneration efficiencies under 532 nm excitation, leading to the highest amplitudes of time-resolved and continuous wave (cw) photocurrents. At nanosecond time scales after photoexcitation, both ADT-TES-F-based blends and the P3HT/Pn-TIPS-F8 blend exhibited photocurrents which were higher by a factor of 2-10, depending on the blend, than that in the P3HT/PCBM blend. However, cw photocurrents showed a different trend, with the ADT-TES-F/PCBM blend exhibiting only a factor of 1.5-2.5 lower photoresponse than that in the P3HT/PCBM blend, due to other contributions, such as that of charge trap-limited transport, to cw photoresponse. metatag.description=We present a comparative study of optical absorption, photoluminescence (PL), and photoconductivity in bulk heterojunctions comprising a high performance functionalized anthradithiophene (ADT) derivative or the benchmark polymer P3HT as donor and functionalized pentacene (Pn) derivative or PCBM as acceptor. Of all D/A blends studied, the ADT/PCBM blend exhibited the highest charge photogeneration efficiencies under 532 nm excitation, leading to the highest amplitudes of time-resolved and continuous wave (cw) photocurrents. At nanosecond time scales after photoexcitation, both ADT-TES-F-based blends and the P3HT/Pn-TIPS-F8 blend exhibited photocurrents which were higher by a factor of 2-10, depending on the blend, than that in the P3HT/PCBM blend. However, cw photocurrents showed a different trend, with the ADT-TES-F/PCBM blend exhibiting only a factor of 1.5-2.5 lower photoresponse than that in the P3HT/PCBM blend, due to other contributions, such as that of charge trap-limited transport, to cw photoresponse. dc:title=Cambridge Journals Online - MRS Online Proceedings Library - Abstract - Charge carrier dynamics in small-molecule- and polymer-based donor-acceptor blends metatag.og:url= http://journals.cambridge.org/abstract_S1946427414009567 metatag.og:url= http://journals.cambridge.org/abstract_S1946427414009567 metatag.test=displayAbstract metatag.test=displayAbstract metatag.content-encoding=UTF-8 metatag.content-encoding=UTF-8 Expires=-1 metatag.pragma=no-cache metatag.pragma=no-cache metatag.dc:title=Cambridge Journals Online - MRS Online Proceedings Library - Abstract - Charge carrier dynamics in small-molecule- and polymer-based donor-acceptor blends metatag.dc:title=Cambridge Journals Online - MRS Online Proceedings Library - Abstract - Charge carrier dynamics in small-molecule- and polymer-based donor-acceptor blends metatag.cache-control=no-store metatag.cache-control=no-store metatag.format-detection=telephone=no metatag.format-detection=telephone=no og:image= http://journals.cambridge.org/cover_images/OPL/OPL.jpg og:url= http://journals.cambridge.org/abstract_S1946427414009567 metatag.content-type=text/html; charset=UTF-8 metatag.content-type=text/html; charset=UTF-8 Content-Encoding=UTF-8 metatag.fb:app_id=102729586536954 metatag.fb:app_id=102729586536954 Content-Type=text/html; charset=UTF-8 with parse-html, I get Parse Metadata: metatag.test=displayAbstract caching.forbidden=content metatag.pragma=no-cache metatag.cache-control=no-store metatag.title=Charge carrier dynamics in small-molecule- and polymer-based donor-acceptor blends metatag.format-detection=telephone=no metatag.content-type=text/html; charset=UTF-8 CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 metatag.expires=-1 metatag.originalcharencoding=utf-8 metatag.verify-v1=P40xFgT/ywJlpV7zP/etM8pJVJZ4CjdOId2dmmiCb+4= metatag.description=We present a comparative study of optical absorption, photoluminescence (PL), and photoconductivity in bulk heterojunctions comprising a high performance functionalized anthradithiophene (ADT) derivative or the benchmark polymer P3HT as donor and functionalized pentacene (Pn) derivative or PCBM as acceptor. Of all D/A blends studied, the ADT/PCBM blend exhibited the highest charge photogeneration efficiencies under 532 nm excitation, leading to the highest amplitudes of time-resolved and continuous wave (cw) photocurrents. At nanosecond time scales after photoexcitation, both ADT-TES-F-based blends and the P3HT/Pn-TIPS-F8 blend exhibited photocurrents which were higher by a factor of 2-10, depending on the blend, than that in the P3HT/PCBM blend. However, cw photocurrents showed a different trend, with the ADT-TES-F/PCBM blend exhibiting only a factor of 1.5-2.5 lower photoresponse than that in the P3HT/PCBM blend, due to other contributions, such as that of charge trap-limited transport, to cw photoresponse. metatag.charencodingforconversion=utf-8 Immediate observation is that parse-tika seems to be duplicating a lot of the fields... take for example the description. This is repeated thrice. If we could get conversation going on this it would be ideal. Thanks folks Lewis [0] http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=9493586&fulltextType=RA&fileId=S1946427414009567 -- *Lewis*

