> stoopid question, but I can't find any info on it... can we now parse Open
> Graph metatags?

parse-tika extracts og:* metatags

% bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' 
http://ogp.me/
...
Parse Metadata: og:image=http://ogp.me/logo.png og:type=website 
og:image:width=300
  og:image:alt=The Open Graph logo og:title=Open Graph protocol ...

% bin/nutch indexchecker -Dindex.parse.md=og:image,og:title,og:description \
    -Dplugin.includes='protocol-http|parse-tika|index-metadata' http://ogp.me/
...
og:image :      http://ogp.me/logo.png
og:title :      Open Graph protocol
digest :        f98d6d5e5894ef83561630ebef3bf060
id :    http://ogp.me/
og:description :        The Open Graph protocol enables any web page to become 
a rich object in a
social graph.


On 06/11/2018 11:44 PM, BlackIce wrote:
> +1
> 
> stoopid question, but I can't find any info on it... can we now parse Open
> Graph metatags?
> 
> Greetz
> 
> On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández <roan...@uci.cu>
> wrote:
> 
>> +1
>>
>> Regards
>>
>> ----- Chris Mattmann <mattm...@apache.org> escribió:
>>> ++1!
>>>
>>>
>>>
>>> Sounds great.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Sebastian Nagel <wastl.na...@googlemail.com>
>>> Reply-To: "d...@nutch.apache.org" <d...@nutch.apache.org>
>>> Date: Monday, June 11, 2018 at 7:35 AM
>>> To: "user@nutch.apache.org" <user@nutch.apache.org>
>>> Cc: "d...@nutch.apache.org" <d...@nutch.apache.org>
>>> Subject: Preparing to release Nutch 1.15 ?
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> almost 80 fixes and improvements are done now and include:
>>>
>>>
>>>
>>> NUTCH-2375 upgrade to new mapreduce API
>>>
>>>   It was a huge change affecting more than 10,000 lines of code. Thanks,
>> Omkar!
>>>
>>>   Well, there have been some regressions but those are resolved now.
>> Tests in
>>>
>>>   pseudo-distributed mode [1] succeeded and also a mid-size test crawl
>> (180
>>>
>>>   million pages) on a Hadoop cluster.
>>>
>>>   Would be great if anybody is able to test the Nutch master in
>> combination with
>>>
>>>   a non-HDFS file system (e.g. s3://)! Please let us know whether this
>> works. Thanks!
>>>
>>>
>>>
>>> NUTCH-1480: Multiple index writer instances with different configurations
>>>
>>>   Thanks to Roannel it's now possible to index into multiple Solr or
>> Elasticsearch
>>>
>>>   instances. With NUTCH- (needs to be reviewed) also the routing to of
>> documents
>>>
>>>   to the index will be configurable.
>>>
>>>
>>>
>>> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>>>
>>>    Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests
>> need
>>>
>>>    to be addressed in NUTCH-2596.
>>>
>>>
>>>
>>> And two important issues are almost ready to be committed soon:
>>>
>>>
>>>
>>> NUTCH-2549: a long list of fixes and improvements to protocol-http.
>> Thanks to
>>>
>>>    Gerard Bouchard!
>>>
>>>
>>>
>>> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
>> based
>>>
>>>    on the okhttp library. Supports HTTP/2.
>>>
>>>
>>>
>>>
>>>
>>> The full list of fixes and improvements is available at [2].
>>>
>>>
>>>
>>> I'll plan to work through the remaining 70 open issues during the next
>>>
>>> days and hope to commit/resolve 15-25 of them and move the remaining
>>>
>>> ones to Nutch 1.16.
>>>
>>>
>>>
>>> Please vote for issues you want to get included. If there are open
>>>
>>> pull requests, it will help if these can be merged, the unit tests
>>>
>>> pass, and any review comments are addressed. Thanks!
>>>
>>>
>>>
>>> If there are any objections or blockers, please also let us know!
>>>
>>>
>>>
>>> I'll also plan to run a test crawl on Hadoop mid of this week.
>>>
>>> But any help in testing is welcome.
>>>
>>>
>>>
>>> Note that the tutorial needs to be updated (will be done after 1.15
>>>
>>> is finally released) to reflect the changes related to NUTCH-1480.
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Sebastian
>>>
>>>
>>>
>>>
>>>
>>> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>>>
>>> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>>>
>>>
>>>
>>>
>>>
>>
>> UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad
>> de las Ciencias Informáticas.
>> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu
>>
> 

Reply via email to