Let me first ask for a few clarifications.
1. If you just want to count the words in a single text file like Don
Quixote (that is, not for a stream of data), you should use only Spark.
Then the program to count the frequency of words in a text file would look
like this in Java. If you are not super-comfortable with Java, then I
strongly recommend using the Scala API or pyspark. For scala, it may be a
little trickier to learn if you have absolutely no idea. But it is worth
it. The frequency count would look like this.
val sc = new SparkContext(...)
val linesInFile = sc.textFile("path_to_file")
val words = linesInFile.flatMap(line => line.split(" "))
val frequencies = words.map(word => (word, 1L)).reduceByKey(_ + _)
println("Word frequencies = " + frequences.collect()) // collect is
costly if the file is large
2. Let me assume that you want to do read a stream of text over the network
and then print the count of total number of words into a file. Note that it
is "total number of words" and not "frequency of each word". The Java
version would be something like this.
DStream<Integer> totalCounts = words.count();
totalCounts.foreachRDD(new Function2<JavaRDD<Long>, Time, Void>() {
@Override public Void call(JavaRDD<Long> pairRDD, Time time) throws
Exception {
Long totalCount = totalCounts.first();
// print to screen
System.out.println(totalCount);
// append count to file
...
return null;
}
})
This is count how many words have been received in each batch. The Scala
version would be much simpler to read.
words.count().foreachRDD(rdd => {
val totalCount = rdd.first()
// print to screen
println(totalCount)
// append count to file
...
})
Hope this helps! I apologize if the code doesnt compile, I didnt test for
syntax and stuff.
TD
On Thu, Jan 30, 2014 at 8:12 AM, Eduardo Costa Alfaia <
[email protected]> wrote:
> Hi Guys,
>
> I'm not very good like java programmer, so anybody could me help with this
> code piece from JavaNetworkWordcount:
>
> JavaPairDStream<String, Integer> wordCounts = words.map(
> new PairFunction<String, String, Integer>() {
> @Override
> public Tuple2<String, Integer> call(String s) throws Exception {
> return new Tuple2<String, Integer>(s, 1);
> }
> }).reduceByKey(new Function2<Integer, Integer, Integer>() {
> @Override
> public Integer call(Integer i1, Integer i2) throws Exception {
> return i1 + i2;
> }
> });
>
> JavaPairDStream<String, Integer> counts =
> wordCounts.reduceByKeyAndWindow(
> new Function2<Integer, Integer, Integer>() {
> public Integer call(Integer i1, Integer i2) { return i1 + i2; }
> },
> new Function2<Integer, Integer, Integer>() {
> public Integer call(Integer i1, Integer i2) { return i1 - i2; }
> },
> new Duration(60 * 5 * 1000),
> new Duration(1 * 1000)
> );
>
> I would like to think a manner of counting and after summing and getting a
> total from words counted in a single file, for example a book in txt
> extension Don Quixote. The counts function give me the resulted from each
> word has found and not a total of words from the file.
> Tathagata has sent me a piece from scala code, Thanks Tathagata by your
> attention with my posts I am very thankfully,
>
> yourDStream.foreachRDD(rdd => {
>
> // Get and print first n elements
> val firstN = rdd.take(n)
> println("First N elements = " + firstN)
>
> // Count the number of elements in each batch
> println("RDD has " + rdd.count() + " elements")
>
> })
>
> yourDStream.count.print()
>
> Could anybody help me?
>
>
> Thanks Guys
>
> --
> INFORMATIVA SUL TRATTAMENTO DEI DATI PERSONALI
>
> I dati utilizzati per l'invio del presente messaggio sono trattati
> dall'Università degli Studi di Brescia esclusivamente per finalità
> istituzionali. Informazioni più dettagliate anche in ordine ai diritti
> dell'interessato sono riposte nell'informativa generale e nelle notizie
> pubblicate sul sito web dell'Ateneo nella sezione "Privacy".
>
> Il contenuto di questo messaggio è rivolto unicamente alle persona cui
> è indirizzato e può contenere informazioni la cui riservatezza è
> tutelata legalmente. Ne sono vietati la riproduzione, la diffusione e l'uso
> in mancanza di autorizzazione del destinatario. Qualora il messaggio
> fosse pervenuto per errore, preghiamo di eliminarlo.
>