I am trying to read logs which have many irrelevant lines and whose lines are
related by a thread number in each line.

Firstly, if I read from a text file using the textFile function and then
call multiple filter functions on that file will Spark apply all of the
filters using one read pass?

Eg will the second filter incur another read of log.txt?
val file = sc.textFile("log.txt")
val test = file.filter(some condition)
val test1 = file.filter(some other condition)

Secondly, if there are multiple reads I was thinking that I could apply a
filter that gets rid of all of the lines that I do not need and cache that
in a PairRDD. From that PairRDD I would need to remove keys that only appear
once, is there a recommended strategy for this? I was thinking about using
distinct to create another PairRDD and then using subtract, but this seems
inefficient.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Relational-Log-Data-tp24696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to