Re: Source code JavaNetworkWordcount

Tathagata Das Thu, 30 Jan 2014 11:16:50 -0800

Let me first ask for a few clarifications.

1. If you just want to count the words in a single text file like Don
Quixote (that is, not for a stream of data), you should use only Spark.
Then the program to count the frequency of words in a text file would look
like this in Java. If you are not super-comfortable with Java, then I
strongly recommend using the Scala API or pyspark. For scala, it may be a
little trickier to learn if you have absolutely no idea. But it is worth
it. The frequency count would look like this.


val sc = new SparkContext(...)
val linesInFile = sc.textFile("path_to_file")
val words = linesInFile.flatMap(line => line.split(" "))
val frequencies = words.map(word => (word, 1L)).reduceByKey(_ + _)
println("Word frequencies = " + frequences.collect())      // collect is
costly if the file is large


2. Let me assume that you want to do read a stream of text over the network
and then print the count of total number of words into a file. Note that it
is "total number of words" and not "frequency of each word". The Java
version would be something like this.

DStream<Integer> totalCounts = words.count();

totalCounts.foreachRDD(new Function2<JavaRDD<Long>, Time, Void>() {
   @Override public Void call(JavaRDD<Long> pairRDD, Time time) throws
Exception {
           Long totalCount = totalCounts.first();

           // print to screen
           System.out.println(totalCount);

          // append count to file
          ...
          return null;
    }
})

This is count how many words have been received in each batch. The Scala
version would be much simpler to read.

words.count().foreachRDD(rdd => {
    val totalCount = rdd.first()

    // print to screen
    println(totalCount)

    // append count to file
    ...
})

Hope this helps! I apologize if the code doesnt compile, I didnt test for
syntax and stuff.

TD



On Thu, Jan 30, 2014 at 8:12 AM, Eduardo Costa Alfaia <
[email protected]> wrote:

> Hi Guys,
>
> I'm not very good like java programmer, so anybody could me help with this
> code piece from JavaNetworkWordcount:
>
> JavaPairDStream<String, Integer> wordCounts = words.map(
>         new PairFunction<String, String, Integer>() {
>      @Override
>           public Tuple2<String, Integer> call(String s) throws Exception {
>             return new Tuple2<String, Integer>(s, 1);
>           }
>         }).reduceByKey(new Function2<Integer, Integer, Integer>() {
>           @Override
>           public Integer call(Integer i1, Integer i2) throws Exception {
>             return i1 + i2;
>           }
>         });
>
>       JavaPairDStream<String, Integer> counts =
> wordCounts.reduceByKeyAndWindow(
>         new Function2<Integer, Integer, Integer>() {
>           public Integer call(Integer i1, Integer i2) { return i1 + i2; }
>         },
>         new Function2<Integer, Integer, Integer>() {
>           public Integer call(Integer i1, Integer i2) { return i1 - i2; }
>         },
>         new Duration(60 * 5 * 1000),
>         new Duration(1 * 1000)
>       );
>
> I would like to think a manner of counting and after summing  and getting a
> total from words counted in a single file, for example a book in txt
> extension Don Quixote. The counts function give me the resulted from each
> word has found and not a total of words from the file.
> Tathagata has sent me a piece from scala code, Thanks Tathagata by your
> attention with my posts I am very thankfully,
>
>   yourDStream.foreachRDD(rdd => {
>
>    // Get and print first n elements
>    val firstN = rdd.take(n)
>    println("First N elements = " + firstN)
>
>   // Count the number of elements in each batch
>   println("RDD has " + rdd.count() + " elements")
>
> })
>
> yourDStream.count.print()
>
> Could anybody help me?
>
>
> Thanks Guys
>
> --
> INFORMATIVA SUL TRATTAMENTO DEI DATI PERSONALI
>
> I dati utilizzati per l'invio del presente messaggio sono trattati
> dall'Università degli Studi di Brescia esclusivamente per finalità
> istituzionali. Informazioni più dettagliate anche in ordine ai diritti
> dell'interessato sono riposte nell'informativa generale e nelle notizie
> pubblicate sul sito web dell'Ateneo nella sezione "Privacy".
>
> Il contenuto di questo messaggio è rivolto unicamente alle persona cui
> è indirizzato e può contenere informazioni la cui riservatezza è
> tutelata legalmente. Ne sono vietati la riproduzione, la diffusione e l'uso
> in mancanza di autorizzazione del destinatario. Qualora il messaggio
> fosse pervenuto per errore, preghiamo di eliminarlo.
>

Re: Source code JavaNetworkWordcount

Reply via email to