RE: readseg dump and non-ASCII characters

2017-12-14 Thread Yossi Tamari
Hi Michael, Not directly answering this question, but keep in mind that as mentioned in the issue Sebastian referenced, there are many more places in Nutch that have the same problem, so setting LC_ALL is probably a good idea in general (until that issue is fixed...). If you're worried about

Re: readseg dump and non-ASCII characters

2017-12-14 Thread Michael Coffey
Not sure it's practical to go around to all the hadoop machines and change their default encoding settings. Not sure it wouldn't break something else! I'm wondering if there's a simple fix I could make to the source code to make nutch.segment.SegmentReader use utf-8 as a default when reading

Usage previous stage HostDb data for generate(fetched deltas)

2017-12-14 Thread Semyon Semyonov
Dear all, I plan to improve hostdb functionality to have a DB_FETCHED delta for generate stage. Lets say for each website we have condition of generate while number of fetched < 150. The problem is for some websites that condition will (almost)never be finished, because of its structure.

Re: crawlcomplete

2017-12-14 Thread Semyon Semyonov
The third question can be: 1) Now we have hostdb that stores all statistics per host. You can read/write to the database. Does it make sense to have both for the reporting?   Sent: Monday, December 04, 2017 at 7:47 PM From: "Yossi Tamari" To: user@nutch.apache.org