Mich:
When you execute the statements in Spark shell, you would see the types of
the intermediate results.
scala> val errlog = sc.textFile("/home/john/s.out")
errlog: org.apache.spark.rdd.RDD[String] = /home/john/s.out
MapPartitionsRDD[1] at textFile at <console>:24
scala> val sed = errlog.filter(line => line.contains("sed"))
sed: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at
<console>:26
scala> sed.collect()
res0: Array[String] = Array([WARNING] Unrecognised ...
Cheers
On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh <
[email protected]> wrote:
>
>
> Hi Chandeep
>
>
>
> Many thanks for your help
>
>
>
> In the line below
>
>
>
> errlog.filter(line => line.contains("sed"))collect()foreach(println)
>
>
>
> Can you please clarify the components with the correct naming as I am new
> to Scala
>
> 1. errlog --> is the RDD?
> 2. filter(line => line.contains("sed")) is a method
> 3. collect() is another method ?
> 4. foreach (println) ?
>
>
>
> Thanks
>
>
>
> On 10/02/2016 21:28, Chandeep Singh wrote:
>
> Hi Mich,
>
> If you would like to print everything to the console you could - errlog.
> filter(line => line.contains("sed"))collect()foreach(println)
>
> or you could always save to a file using any of the saveAs methods.
>
> Thanks,
> Chandeep
>
> On Wed, Feb 10, 2016 at 8:14 PM, <
> [email protected]> wrote:
>
>>
>>
>> Hi,
>>
>> I have a bunch of files stored in hdfs /unit_files directory in total 319
>> files
>> scala> val errlog = sc.textFile("/unix_files/*.ksh")
>>
>> scala> errlog.filter(line => line.contains("sed"))count()
>> res104: Long = 1113
>> So it returns 1113 instances the word "sed"
>>
>> If I want to see the collection I can do
>>
>>
>> *scala> errlog.filter(line => line.contains("sed"))collect()*
>>
>> res105: Array[String] = Array(" DSQUERY=${1} ;
>> DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", # .
>> in environment based on argument for script., " exec sp_spaceused", "
>> exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), "
>> BACKUPSERVER=$5 # Server that is used to load the transaction dump",
>> " BACKUPSERVER=$5 # Server that is used to load the
>> transaction dump", " BACKUPSERVER=$5 # Server that is used to
>> load the transaction dump", " cat $TMPDIR/${DBNAME}_trandump.sql | sed
>> s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat
>> $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ >
>> $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e
>> s/.ksh//), " B...
>> scala>
>>
>>
>> Now is there anyway I can retrieve all these instances or perhaps they are
>> all wrapped up and I only see few lines?
>>
>> Thanks,
>>
>> Mich
>>
>>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this message
> shall not be understood as given or endorsed by Cloud Technology Partners
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus free,
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>