Hey Eddie,
Before sending the data into jcas if i force encode it :-
String content2 = null;
content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
jcas.setDocumentText(content2);
And when i go in my first annotator i force decode it:-
String content = null;
content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"), "UTF-8");
Now the text is coming in arabic language without any problem.But again i have
many analysis engine in my aggregate and i can't hardcode this snippet
everywhere.
Maybe there is a problem in unicoding of the cas that is sent from collection
reader to analysis engine.Now i was thinking that maybe if i can get to know
the type of encoding in the cas, i can just encode the content into the unicode
of CAS and it may work fine.
Best Regards
Rohit
On 2018/06/18 17:42:04, Eddie Epstein <[email protected]> wrote:
> Hi Rohit,
>
> In a DUCC job the CAS created by users CR in the Job Driver is serialized
> into cas.xmi format, transported to the Job Process where it is
> deserialized and given to the users analytics. Likely the problem is in CAS
> serialization or deserialization, perhaps due to the active LANG
> environment on the JD or JP machines?
>
> Eddie
>
> On Thu, Jun 14, 2018 at 1:48 AM, Rohit yadav <[email protected]> wrote:
>
> > Hey,
> >
> > I use DUCC for english language and it works without any problem.
> > But lately i tried deploying a job for Arabic Language and all the content
> > of Arabic Text is replaced by *'?'* (Question Mark).
> >
> > I am extracting Data from Accumlo and after processing i send it to ES6.
> >
> > When i checked the log files of JD it shows that arabic data is coming
> > into CR without any problem.
> > But when i check another log file it shows that the moment data enters
> > into my AE arabic content is replaced by Question mark.
> > Please find the log files attached with this mail.
> >
> > I think this may be a problem of CM because the data is fine inside CR and
> > the most interesting part is that if i try running the same pipeline
> > through CPM it works without any problem which means DUCC is facing some
> > issue.
> >
> > I'll look forward to your reply.
> >
> > --
> > Best Regards,
> > *Rohit Yadav*
> >
>