Hi All, I'v read official docs of tachyon,It seems not fit my usage,For my understanding,It just cache files in memory,but I have a file contains over million lines amount about 70mb,retrieveing data and mapping to a Map varible will costs over serveral minuts,which I dont want to process it each time in map function.since tachyon occurs another problem raise an exception while doing ./bin/tachyon format The exception: Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
It seems there's a compatibility problem with hadoop,but even solved it there's still an efficient issue as I described above. could somebody tell me how to persist the data in memory.for now I just broadcast it, and re-submit spark application while the broadcast value unavaible. ------------------ 原始邮件 ------------------ 发件人: "Akhil Das";<ak...@sigmoidanalytics.com>; 发送时间: 2014年12月9日(星期二) 下午3:42 收件人: "十六夜涙"<cr...@qq.com>; 抄送: "user"<u...@spark.incubator.apache.org>; 主题: Re: spark broadcast unavailable You cannot pass the sc object (val b = Utils.load(sc,ip_lib_path)) inside a map function and that's why the Serialization exception is popping up( since sc is not serializable). You can try tachyon's cache if you want to persist the data in memory kind of forever. ThanksBest Regards On Tue, Dec 9, 2014 at 12:12 PM, 十六夜涙 <cr...@qq.com> wrote: Hi all In my spark application,I load a csv file and map the datas to a Map vairable for later uses on driver node ,then broadcast it,every thing works fine untill the exception java.io.FileNotFoundException occurs.the console log information shows me the broadcast unavailable,I googled this problem,says spark will clean up the broadcast,while these's an solution,the author mentioned about re-broadcast,I followed this way,written some exception handle code with `try` ,`catch`.after compliling and submitting the jar,I faced anthoner problem,It shows " task not serializable". so here I have there options: 1,get the right way persisting broadcast 2,solve the "task not serializable" problem re-broadcast variable 3,save the data to some kind of database,although I prefer save data in memory. here is come code snippets: val esRdd = kafkaDStreams.flatMap(_.split("\\n")) .map{ case esregex(datetime, time_request) => var ipInfo:Array[String]=Array.empty try{ ipInfo = Utils.getIpInfo(client_ip,b.value) }catch{ case e:java.io.FileNotFoundException =>{ val b = Utils.load(sc,ip_lib_path) ipInfo = Utils.getIpInfo(client_ip,b.value) } }