Yes, of course, defining properties in the nutch-site.xml (but not "site.xml")
does also work. It's the usual hiearchy:
 bin/nutch command -Dkey=value ...
  overwrites property in nutch-site.xml
     (must be on classpath: runtime/local/conf resp. inside the nutch.job)
   overwrites definition in nutch-default.xml

On 06/12/2018 02:26 PM, BlackIce wrote:
> PS: Does this work when configured in site.xml like regular metatdata?
> 
> On Tue, Jun 12, 2018 at 1:31 PM BlackIce <[email protected]> wrote:
> 
>> sweet thnx!
>>
>> On Tue, Jun 12, 2018 at 1:29 PM Sebastian Nagel <
>> [email protected]> wrote:
>>
>>>> stoopid question, but I can't find any info on it... can we now parse
>>> Open
>>>> Graph metatags?
>>>
>>> parse-tika extracts og:* metatags
>>>
>>> % bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika'
>>> http://ogp.me/
>>> ...
>>> Parse Metadata: og:image=http://ogp.me/logo.png og:type=website
>>> og:image:width=300
>>>   og:image:alt=The Open Graph logo og:title=Open Graph protocol ...
>>>
>>> % bin/nutch indexchecker -Dindex.parse.md=og:image,og:title,og:description
>>> \
>>>     -Dplugin.includes='protocol-http|parse-tika|index-metadata'
>>> http://ogp.me/
>>> ...
>>> og:image :      http://ogp.me/logo.png
>>> og:title :      Open Graph protocol
>>> digest :        f98d6d5e5894ef83561630ebef3bf060
>>> id :    http://ogp.me/
>>> og:description :        The Open Graph protocol enables any web page to
>>> become a rich object in a
>>> social graph.
>>>
>>>
>>> On 06/11/2018 11:44 PM, BlackIce wrote:
>>>> +1
>>>>
>>>> stoopid question, but I can't find any info on it... can we now parse
>>> Open
>>>> Graph metatags?
>>>>
>>>> Greetz
>>>>
>>>> On Mon, Jun 11, 2018 at 9:11 PM Roannel Fernández Hernández <
>>> [email protected]>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Regards
>>>>>
>>>>> ----- Chris Mattmann <[email protected]> escribió:
>>>>>> ++1!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sounds great.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> From: Sebastian Nagel <[email protected]>
>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>> Date: Monday, June 11, 2018 at 7:35 AM
>>>>>> To: "[email protected]" <[email protected]>
>>>>>> Cc: "[email protected]" <[email protected]>
>>>>>> Subject: Preparing to release Nutch 1.15 ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>>
>>>>>> almost 80 fixes and improvements are done now and include:
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-2375 upgrade to new mapreduce API
>>>>>>
>>>>>>   It was a huge change affecting more than 10,000 lines of code.
>>> Thanks,
>>>>> Omkar!
>>>>>>
>>>>>>   Well, there have been some regressions but those are resolved now.
>>>>> Tests in
>>>>>>
>>>>>>   pseudo-distributed mode [1] succeeded and also a mid-size test crawl
>>>>> (180
>>>>>>
>>>>>>   million pages) on a Hadoop cluster.
>>>>>>
>>>>>>   Would be great if anybody is able to test the Nutch master in
>>>>> combination with
>>>>>>
>>>>>>   a non-HDFS file system (e.g. s3://)! Please let us know whether this
>>>>> works. Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-1480: Multiple index writer instances with different
>>> configurations
>>>>>>
>>>>>>   Thanks to Roannel it's now possible to index into multiple Solr or
>>>>> Elasticsearch
>>>>>>
>>>>>>   instances. With NUTCH- (needs to be reviewed) also the routing to of
>>>>> documents
>>>>>>
>>>>>>   to the index will be configurable.
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>>>>>>
>>>>>>    Nutch now runs and compiles on Java 9 + 10. Only errors in unit
>>> tests
>>>>> need
>>>>>>
>>>>>>    to be addressed in NUTCH-2596.
>>>>>>
>>>>>>
>>>>>>
>>>>>> And two important issues are almost ready to be committed soon:
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-2549: a long list of fixes and improvements to protocol-http.
>>>>> Thanks to
>>>>>>
>>>>>>    Gerard Bouchard!
>>>>>>
>>>>>>
>>>>>>
>>>>>> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation
>>>>> based
>>>>>>
>>>>>>    on the okhttp library. Supports HTTP/2.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The full list of fixes and improvements is available at [2].
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'll plan to work through the remaining 70 open issues during the next
>>>>>>
>>>>>> days and hope to commit/resolve 15-25 of them and move the remaining
>>>>>>
>>>>>> ones to Nutch 1.16.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please vote for issues you want to get included. If there are open
>>>>>>
>>>>>> pull requests, it will help if these can be merged, the unit tests
>>>>>>
>>>>>> pass, and any review comments are addressed. Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> If there are any objections or blockers, please also let us know!
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'll also plan to run a test crawl on Hadoop mid of this week.
>>>>>>
>>>>>> But any help in testing is welcome.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Note that the tutorial needs to be updated (will be done after 1.15
>>>>>>
>>>>>> is finally released) to reflect the changes related to NUTCH-1480.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Sebastian
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>>>>>>
>>>>>> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> UCIENCIA 2018: III Conferencia Científica Internacional de la
>>> Universidad
>>>>> de las Ciencias Informáticas.
>>>>> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu
>>> http://eventos.uci.cu
>>>>>
>>>>
>>>
>>>
> 

Reply via email to