RE: readseg dump and non-ASCII characters

2017-12-14 Thread Yossi Tamari
and OutputStreamWriter > constructors, should that work? Is it likely to break something else? > > > > > > > > > ____ > From: Sebastian Nagel <wastl.na...@googlemail.com> > To: user@nutch.apache.org > Sent: Wednesday,

Re: readseg dump and non-ASCII characters

2017-12-14 Thread Michael Coffey
dnesday, November 15, 2017 5:18 AM Subject: Re: readseg dump and non-ASCII characters Hi Michael, from the arguments I guess you're interested in the raw/binary HTML content, right? After a closer look I have no simple answer: 1. HTML has no fix encoding - it could be everything, pageA may have a

Re: readseg dump and non-ASCII characters

2017-11-15 Thread Michael Coffey
all nodes in the cluster? Would it work just as well, or better, to use "-Dfile.encoding=UTF8" in the binNutch command? From: Sebastian Nagel <wastl.na...@googlemail.com> To: user@nutch.apache.org Sent: Wednesday, November 15, 2017 5:18 AM Subject: Re: readseg dump and non-ASCII

Re: readseg dump and non-ASCII characters

2017-11-15 Thread Sebastian Nagel
Hi Michael, from the arguments I guess you're interested in the raw/binary HTML content, right? After a closer look I have no simple answer: 1. HTML has no fix encoding - it could be everything, pageA may have a different encoding than pageB. 2. That's different for parsed text: it's a