Hey Eddie,

Before sending the data into jcas if i force encode it :-

String content2 = null;
content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
jcas.setDocumentText(content2);

And when i go in my first annotator i force decode it:-

String content = null;
content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"), "UTF-8");

Now the text is coming in arabic language without any problem.But again i have 
many analysis engine in my aggregate and i can't hardcode this snippet 
everywhere.

Maybe there is a problem in unicoding of the cas that is sent from collection 
reader to analysis engine.Now i was thinking that maybe if i can get to know 
the type of encoding in the cas, i can just encode the content into the unicode 
of CAS and it may work fine.

Best Regards
Rohit
On 2018/06/18 17:42:04, Eddie Epstein <[email protected]> wrote: 
> Hi Rohit,
> 
> In a DUCC job the CAS created by users CR in the Job Driver is serialized
> into cas.xmi format, transported to the Job Process where it is
> deserialized and given to the users analytics. Likely the problem is in CAS
> serialization or deserialization, perhaps due to the active LANG
> environment on the JD or JP machines?
> 
> Eddie
> 
> On Thu, Jun 14, 2018 at 1:48 AM, Rohit yadav <[email protected]> wrote:
> 
> > Hey,
> >
> > I use DUCC for english language and it works without any problem.
> > But lately i tried deploying a job for Arabic Language and all the content
> > of Arabic Text is replaced by *'?'* (Question Mark).
> >
> > I am extracting Data from Accumlo and after processing i send it to ES6.
> >
> > When i checked the log files of JD it shows that arabic data is coming
> > into CR without any problem.
> > But when i check another log file it shows that the moment data enters
> > into my AE arabic content is replaced by Question mark.
> > Please find the log files attached with this mail.
> >
> > I think this may be a problem of CM because the data is fine inside CR and
> > the most interesting part is that if i try running the same pipeline
> > through CPM  it works without any problem which means DUCC is facing some
> > issue.
> >
> > I'll look forward to your reply.
> >
> > --
> > Best Regards,
> > *Rohit Yadav*
> >
> 

Reply via email to