Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread Sonal Goyal
Hi,

Sorry it's not clear to me if you want help moving the data to the cluster
or in defining the best structure of your files on the cluster for
efficient processing. Are you on standalone or using hdfs?

On Tuesday, May 23, 2017, docdwarf <doc.dwar...@gmail.com> wrote:

> tesmai4 wrote
> > I am converting my Java based NLP parser to execute it on my Spark
> > cluster.  I know that Spark can read multiple text files from a directory
> > and convert into RDDs for further processing. My input data is not only
> in
> > text files, but in a multitude of different file formats.
> >
> > My question is: How can I efficiently read the input files
> > (PDF/Text/Word/HTML) in my Java based Spark program for processing these
> > files in Spark cluster.
>
> I will suggest  flume <https://flume.apache.org/>  . Flume is a
> distributed,
> reliable, and available service for efficiently collecting, aggregating,
> and
> moving large amounts of log data.
>
> I will also mention  kafka <https://kafka.apache.org/>  . Kafka is a
> distributed streaming platform.
>
> It is also popular to use both flume and kafka together ( flafka
> <http://blog.cloudera.com/blog/2014/11/flafka-apache-
> flume-meets-apache-kafka-for-event-processing/>
> ).
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Reading-PDF-text-word-file-efficiently-with-Spark-
> tp28699p28705.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org <javascript:;>
>
>

-- 
Thanks,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>


Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread docdwarf
tesmai4 wrote
> I am converting my Java based NLP parser to execute it on my Spark
> cluster.  I know that Spark can read multiple text files from a directory
> and convert into RDDs for further processing. My input data is not only in
> text files, but in a multitude of different file formats. 
> 
> My question is: How can I efficiently read the input files
> (PDF/Text/Word/HTML) in my Java based Spark program for processing these
> files in Spark cluster.

I will suggest  flume <https://flume.apache.org/>  . Flume is a distributed,
reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data. 

I will also mention  kafka <https://kafka.apache.org/>  . Kafka is a
distributed streaming platform.

It is also popular to use both flume and kafka together ( flafka
<http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/>
 
).






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-PDF-text-word-file-efficiently-with-Spark-tp28699p28705.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Reading PDF/text/word file efficiently with Spark

2017-05-19 Thread tesm...@gmail.com
Hi,
I am doing NLP (Natural Language Processing) processing on my data. The
data is in form of files that can be of type PDF/Text/Word/HTML. These
files are stored in a directory structure on my local disk, even nested
directories. My stand alone Java based NLP parser can read input files,
extract text from these and do the NLP processing on the extracted text.

I am converting my Java based NLP parser to execute it on my Spark cluster.
I know that Spark can read multiple text files from a directory and convert
into RDDs for further processing. My input data is not only in text files,
but in a multitude of different file formats. My question is: How can I
efficiently read the input files (PDF/Text/Word/HTML) in my Java based
Spark program for processing these files in Spark cluster. Regards,

Regards,


Reading PDF/text/word file efficiently with Spark

2017-05-19 Thread tesmai4
Hi,I am doing NLP (Natural Language Processing) processing on my data. The
data is in form of files that can be of type PDF/Text/Word/HTML. These files
are stored in a directory structure on my local disk, even nested
directories. My stand alone Java based NLP parser can read input files,
extract text from these and do the NLP processing on the extracted text.I am
converting my Java based NLP parser to execute it on my Spark cluster.  I
know that Spark can read multiple text files from a directory and convert
into RDDs for further processing. My input data is not only in text files,
but in a multitude of different file formats. My question is: How can I
efficiently read the input files (PDF/Text/Word/HTML) in my Java based Spark
program for processing these files in Spark cluster.Regards,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-PDF-text-word-file-efficiently-with-Spark-tp28699.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.