tesmai4 wrote
> I am converting my Java based NLP parser to execute it on my Spark
> cluster.  I know that Spark can read multiple text files from a directory
> and convert into RDDs for further processing. My input data is not only in
> text files, but in a multitude of different file formats. 
> 
> My question is: How can I efficiently read the input files
> (PDF/Text/Word/HTML) in my Java based Spark program for processing these
> files in Spark cluster.

I will suggest  flume <https://flume.apache.org/>  . Flume is a distributed,
reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data. 

I will also mention  kafka <https://kafka.apache.org/>  . Kafka is a
distributed streaming platform.

It is also popular to use both flume and kafka together ( flafka
<http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/>
 
).






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-PDF-text-word-file-efficiently-with-Spark-tp28699p28705.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to