Re: Nutch 2.3.1 elasticsearch tstamp

2016-10-21 Thread lewis john mcgibbney
Hi Joe,

On Fri, Oct 21, 2016 at 7:34 AM,  wrote:

> From: Joe Adams 
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 21 Oct 2016 10:34:15 -0400
> Subject: Nutch 2.3.1 elasticsearch tstamp
> I'm working on setting up nutch with elasticsearch and hbase to crawl a
> site and provide a dashboard in kibana for reporting. I have the
> interactions working between the components. I can crawl the site, hbase
> shows all the data, and I can index into elasticsearch. The problem is that
> the tstamp field in elasticsearch shows 1970-01-01T00:00:00.000Z and not
> data related the fetched time of the page. I also tried adding the
> index-more plugin and that seems to add a 'date' field but this also shows
> up as epoch.
>
> I can't find much searching around the internet. The only thing I can find
> closely related is https://issues.apache.org/jira/browse/NUTCH-2045, but
> that was fixed in 2.3.1 which is the version I'm running.
>

My suggestion would be, that if you are running Nutch2, then use the
current development branch which is available at
https://github.com/apache/nutch/tree/2.x. I say this as we are always
fixing bugs and it will enable other using this branch a better chance of
reproducing your issue. Additionally, this will enable you to upgrade to ES
2.X as per the indexer-elastic2 plugin
https://github.com/apache/nutch/tree/2.x/src/plugin/indexer-elastic2


>
> Does anyone have any idea why my dates aren't being set properly in my
> elasticsearch index?


Not yet but I will scope it out.


> The data looks good if I run readdb -url $url.


Thanks for this info.


> Can
> anyone provide some good advice to troubleshoot this further?
>

Not right now, but can you please log an issue over at Jira and also link
it to NUTCH-2045? This would help us to track it and fix it with a test if
there is definitely a bug.


>
> Any help would be appreciated.
>
>
> Versions:
> Nutch 2.3.1
> Elasticsearch 1.7.5
> Gora: 0.6.1
> Hbase: 1.2.3
>

Please note that the supported version of HBase in Nutch2.3.1 is
0.98.8-hadoop2. I can most certainly say that HBase support will not be
compatible with HBase 1.2.3.


>
> 
> fetcher.server.delay
> .1
> Delay between page fetches.
> 
>
> 
> fetcher.server.min.delay
> .1
> 
>

You may find that you experienced access denied e.g. your IP is being
blocked from accessing servers at such small delay amounts. This is just a
friendly warning!

Please log the issue in Jira and I will try to reproduce.
Thanks
Lewis


Re: I think my hbase is broken

2016-10-21 Thread lewis john mcgibbney
Hi Tom,
Please post your entire Nutch log for the inject and generate phase if
possible. It is near impossible to debug given the information you've
provided.
Thanks

On Fri, Oct 21, 2016 at 7:34 AM,  wrote:

> From: Tom Chiverton 
> To: "user@nutch.apache.org" 
> Cc:
> Date: Thu, 20 Oct 2016 12:59:20 +0100
> Subject: I think my hbase is broken
>
> I'm using hbase with Nutch 2.3.1 and getting errors from the GeneratorJob
> step :
>
>
> GeneratorJob: java.io.IOException: Expecting at least one region.
> at org.apache.gora.hbase.store.HBaseStore.getPartitions(
> HBaseStore.java:398)
> at org.apache.gora.mapreduce.GoraInputFormat.getSplits(
> GoraInputFormat.java:94)
>
>
> I think this means hbase needs it's gora-based schema reapplying or
> something ? How does one do that ? Using a fresh hbase install doesn't seem
> to have helped.
>
>


Nutch 2.3.1 elasticsearch tstamp

2016-10-21 Thread Joe Adams
I'm working on setting up nutch with elasticsearch and hbase to crawl a
site and provide a dashboard in kibana for reporting. I have the
interactions working between the components. I can crawl the site, hbase
shows all the data, and I can index into elasticsearch. The problem is that
the tstamp field in elasticsearch shows 1970-01-01T00:00:00.000Z and not
data related the fetched time of the page. I also tried adding the
index-more plugin and that seems to add a 'date' field but this also shows
up as epoch.

I can't find much searching around the internet. The only thing I can find
closely related is https://issues.apache.org/jira/browse/NUTCH-2045, but
that was fixed in 2.3.1 which is the version I'm running.

Does anyone have any idea why my dates aren't being set properly in my
elasticsearch index? The data looks good if I run readdb -url $url. Can
anyone provide some good advice to troubleshoot this further?

Any help would be appreciated.


Versions:
Nutch 2.3.1
Elasticsearch 1.7.5
Gora: 0.6.1
Hbase: 1.2.3

Configuration:
# nutch-site.xml








 storage.data.store.class
 org.apache.gora.hbase.store.HBaseStore
 Default class for storing data


http.agent.namenutch


fetcher.server.delay
.1
Delay between page fetches.



fetcher.server.min.delay
.1



fetcher.threads.per.queue
10



http.content.limit
-1



 
   generate.update.crawldb
   true
 



elastic.host
elasticsearch.example.com


elastic.cluster
nutch


elastic.index
nutch


elastic.max.bulk.size
2500500




  plugin.includes

protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-elastic




   metatags.names
   *


   index.metadata
   description,keywords,robots

 
   db.parsemeta.to.crawldb
   description,keywords,robots