Re: Nutch 2.3.1 elasticsearch tstamp
Hi Joe, On Fri, Oct 21, 2016 at 7:34 AM,wrote: > From: Joe Adams > To: user@nutch.apache.org > Cc: > Date: Fri, 21 Oct 2016 10:34:15 -0400 > Subject: Nutch 2.3.1 elasticsearch tstamp > I'm working on setting up nutch with elasticsearch and hbase to crawl a > site and provide a dashboard in kibana for reporting. I have the > interactions working between the components. I can crawl the site, hbase > shows all the data, and I can index into elasticsearch. The problem is that > the tstamp field in elasticsearch shows 1970-01-01T00:00:00.000Z and not > data related the fetched time of the page. I also tried adding the > index-more plugin and that seems to add a 'date' field but this also shows > up as epoch. > > I can't find much searching around the internet. The only thing I can find > closely related is https://issues.apache.org/jira/browse/NUTCH-2045, but > that was fixed in 2.3.1 which is the version I'm running. > My suggestion would be, that if you are running Nutch2, then use the current development branch which is available at https://github.com/apache/nutch/tree/2.x. I say this as we are always fixing bugs and it will enable other using this branch a better chance of reproducing your issue. Additionally, this will enable you to upgrade to ES 2.X as per the indexer-elastic2 plugin https://github.com/apache/nutch/tree/2.x/src/plugin/indexer-elastic2 > > Does anyone have any idea why my dates aren't being set properly in my > elasticsearch index? Not yet but I will scope it out. > The data looks good if I run readdb -url $url. Thanks for this info. > Can > anyone provide some good advice to troubleshoot this further? > Not right now, but can you please log an issue over at Jira and also link it to NUTCH-2045? This would help us to track it and fix it with a test if there is definitely a bug. > > Any help would be appreciated. > > > Versions: > Nutch 2.3.1 > Elasticsearch 1.7.5 > Gora: 0.6.1 > Hbase: 1.2.3 > Please note that the supported version of HBase in Nutch2.3.1 is 0.98.8-hadoop2. I can most certainly say that HBase support will not be compatible with HBase 1.2.3. > > > fetcher.server.delay > .1 > Delay between page fetches. > > > > fetcher.server.min.delay > .1 > > You may find that you experienced access denied e.g. your IP is being blocked from accessing servers at such small delay amounts. This is just a friendly warning! Please log the issue in Jira and I will try to reproduce. Thanks Lewis
Re: I think my hbase is broken
Hi Tom, Please post your entire Nutch log for the inject and generate phase if possible. It is near impossible to debug given the information you've provided. Thanks On Fri, Oct 21, 2016 at 7:34 AM,wrote: > From: Tom Chiverton > To: "user@nutch.apache.org" > Cc: > Date: Thu, 20 Oct 2016 12:59:20 +0100 > Subject: I think my hbase is broken > > I'm using hbase with Nutch 2.3.1 and getting errors from the GeneratorJob > step : > > > GeneratorJob: java.io.IOException: Expecting at least one region. > at org.apache.gora.hbase.store.HBaseStore.getPartitions( > HBaseStore.java:398) > at org.apache.gora.mapreduce.GoraInputFormat.getSplits( > GoraInputFormat.java:94) > > > I think this means hbase needs it's gora-based schema reapplying or > something ? How does one do that ? Using a fresh hbase install doesn't seem > to have helped. > >
Nutch 2.3.1 elasticsearch tstamp
I'm working on setting up nutch with elasticsearch and hbase to crawl a site and provide a dashboard in kibana for reporting. I have the interactions working between the components. I can crawl the site, hbase shows all the data, and I can index into elasticsearch. The problem is that the tstamp field in elasticsearch shows 1970-01-01T00:00:00.000Z and not data related the fetched time of the page. I also tried adding the index-more plugin and that seems to add a 'date' field but this also shows up as epoch. I can't find much searching around the internet. The only thing I can find closely related is https://issues.apache.org/jira/browse/NUTCH-2045, but that was fixed in 2.3.1 which is the version I'm running. Does anyone have any idea why my dates aren't being set properly in my elasticsearch index? The data looks good if I run readdb -url $url. Can anyone provide some good advice to troubleshoot this further? Any help would be appreciated. Versions: Nutch 2.3.1 Elasticsearch 1.7.5 Gora: 0.6.1 Hbase: 1.2.3 Configuration: # nutch-site.xml storage.data.store.class org.apache.gora.hbase.store.HBaseStore Default class for storing data http.agent.namenutch fetcher.server.delay .1 Delay between page fetches. fetcher.server.min.delay .1 fetcher.threads.per.queue 10 http.content.limit -1 generate.update.crawldb true elastic.host elasticsearch.example.com elastic.cluster nutch elastic.index nutch elastic.max.bulk.size 2500500 plugin.includes protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-elastic metatags.names * index.metadata description,keywords,robots db.parsemeta.to.crawldb description,keywords,robots