you need to edit the ${NUTCH_HOME}/conf/nutch-site.xml the other one
(${NUTCH_HOME}/runtime/local/conf/nutch-site.xml) is actually a copy of
the other one and gets created at build time ( the who runtime directory
gets created at build time). although if you are running nutch in local
mode the conf file the gets read ar runtime is the one in
runtime/runtime/local/conf/ so if you want to just change something and
test it right away without building the project again then you should
modify the one in runtime/local directory.
On 04/02/2013 09:38 PM, Yves S. Garret wrote:
Hello, I'm using 2.x. When I said nutch-site.xml, is it
${NUTCH_HOME}/conf/nutch-site.xml or
${NUTCH_HOME}/runtime/local/conf/nutch-site.xml.
There are two of them and I'm not sure which one is meant to be modified.
On Tue, Apr 2, 2013 at 11:58 PM, Tejas Patil <[email protected]>wrote:
Hi Yves,
I am able to crawl that url at my end using nutch 1.x trunk. The
configuration file to be modified is nutch-site.xml which is present inside
"conf" directory from where you are running nutch. You need to place the
property (that I mentioned in my earlier email) in that file.
In case if you are using 2.x, you need to modify the config file as above,
re-build nutch using 'ant runtime' command and then run a crawl.
Thanks,
Tejas
On Tue, Apr 2, 2013 at 6:58 PM, Yves S. Garret
<[email protected]>wrote:
Wait, one more question, which specific nutch-site.xml file should I
modify? I know
that there are two conf files, so I'd like to confirm that I'm editing
the
correct one.
On Tue, Apr 2, 2013 at 8:21 PM, Alvaro Cabrerizo <[email protected]>
wrote:
Hello:
I have had no problem indexing the page
https://plus.google.com/+projectglass (i made a test using this single
url)
following the instructions pointed by Tejas. I've just changed my
nutch-site.xml removing protocol-http and adding protocol-httpclient
within plugin includes node.
Check in your log file (default logs/hadoop.log) if the plugin
httpclient
is loaded:
...........
*2013-04-03 02:03:19,489 INFO plugin.PluginRepository - Http /
Https Protocol Plug-in (protocol-httpclient)*
*..........*
Regards
On Wed, Apr 3, 2013 at 1:30 AM, Yves S. Garret
<[email protected]>wrote:
I don't think anything has changed:
http://bin.cakephp.org/view/1498883341
This is my ${NUTCH_HOME}/conf/nutch-site.xml:
http://bin.cakephp.org/view/883727476
On Tue, Apr 2, 2013 at 6:51 PM, Tejas Patil <
[email protected]
wrote:
Looks like the required plugin aint present in the configuration
file
[0].Can you try adding "protocol-httpclient" to plugin.includes
property
in
conf/nutch-site.xml ? Like this
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin.
By
default Nutch includes crawling just HTML and plain text via
HTTP,
and basic indexing and search plugins. In order to use HTTPS
please
enable
protocol-httpclient, but be aware of possible intermittent
problems
with
the
underlying commons-httpclient library.
</description>
</property>
Let us know if that worked.
[0]
https://wiki.apache.org/nutch/HttpAuthenticationSchemes#Necessity
On Tue, Apr 2, 2013 at 3:27 PM, Yves S. Garret
<[email protected]>wrote:
Hi all, just tried crawling this site [
https://plus.google.com/+projectglass ]:
This is the output that nutch is showing me:
http://bin.cakephp.org/view/1701305116
It seems to be erroring out when it gets to here:
fetching https://plus.google.com/+projectglass
Unexpected error for https://plus.google.com/+projectglass
org.apache.nutch.protocol.ProtocolNotFound: protocol not found
for
url=https
at
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:83)
at
org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:490)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
Why is this happening?