what do you think about using a streamrdd in this case?
assuming streaming is available for pyspark, and you can collect based on # events
best, matt On 09/02/2014 10:38 AM, Andrew Or wrote:
Spark-shell, or any other Spark application, returns the full results of the job until it has finished executing. You could add a hook for it to write partial results to a file, but you may want to do so sparingly to incur fewer I/Os. If you have a large file and the result contains many lines, it is unlikely to fully fit in memory anyway, so it's probably not a bad idea to just write your results to a file in batches while the application is still running. -Andrew 2014-09-01 22:16 GMT-07:00 Hao Wang <wh.s...@gmail.com <mailto:wh.s...@gmail.com>>: Hi, all I am wondering if I use Spark-shell to scan a large file to obtain lines containing "error", whether the shell returns results while the job is executing, or the job has been totally finished. Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com <mailto:wh.s...@gmail.com>
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org