Hi,
I am doing NLP (Natural Language Processing) processing on my data. The
data is in form of files that can be of type PDF/Text/Word/HTML. These
files are stored in a directory structure on my local disk, even nested
directories. My stand alone Java based NLP parser can read input files,
extract text from these and do the NLP processing on the extracted text.

I am converting my Java based NLP parser to execute it on my Spark cluster.
I know that Spark can read multiple text files from a directory and convert
into RDDs for further processing. My input data is not only in text files,
but in a multitude of different file formats. My question is: How can I
efficiently read the input files (PDF/Text/Word/HTML) in my Java based
Spark program for processing these files in Spark cluster. Regards,

Regards,

Reply via email to