Re: Is RankingMetrics' NDCG implementation correct?

2016-09-19 Thread Jong Wook Kim
so > don't think it's valid to use it as such. > > On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim <jongw...@nyu.edu> wrote: > > Hi, > > > > I'm trying to evaluate a recommendation model, and found that Spark and > > Rival give different results, and it seems that

Is RankingMetrics' NDCG implementation correct?

2016-09-18 Thread Jong Wook Kim
Hi, I'm trying to evaluate a recommendation model, and found that Spark and Rival give different results, and it seems that Rival's one is what Kaggle defines :

Re: AVRO vs Parquet

2016-03-03 Thread Jong Wook Kim
How about ORC? I have experimented briefly with Parquet and ORC, and I liked the fact that ORC has its schema within the file, which makes it handy to work with any other tools. Jong Wook On 3 March 2016 at 23:29, Don Drake wrote: > My tests show Parquet has better

Spark-shell connecting to Mesos stuck at sched.cpp

2015-11-15 Thread Jong Wook Kim
I'm having problem connecting my spark app to a Mesos cluster; any help on the below question would be appreciated. http://stackoverflow.com/questions/33727154/spark-shell-connecting-to-mesos-stuck-at-sched-cpp Thanks, Jong Wook

Spark YARN Shuffle service wire compatibility

2015-10-22 Thread Jong Wook Kim
Hi, I’d like to know if there is a guarantee that Spark YARN shuffle service has wire compatibility between 1.x versions. I could run Spark 1.5 job with YARN nodemanagers having shuffle service 1.4, but it might’ve been just a coincidence. Now we’re upgrading CDH to 5.3 to 5.4, whose

Re: About extra memory on yarn mode

2015-07-14 Thread Jong Wook Kim
executor.memory only sets the maximum heap size of executor and the JVM needs non-heap memory to store class metadata, interned strings and other native overheads coming from networking libraries, off-heap storage levels, etc. These are (of course) legitimate usage of resources and you'll have

Re: ProcessBuilder in SparkLauncher is memory inefficient for launching new process

2015-07-14 Thread Jong Wook Kim
The article you've linked, is specific to an embedded system. the JVM built for that architecture (which the author didn't mention) might not be as stable and well-supported as HotSpot. ProcessBuilder is a stable Java API and despite somewhat limited functionality it is the standard method to

Re: How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread Jong Wook Kim
Your question is not very clear, but from what I understand, you want to deal with a stream of MyTable that has parsed records from your Kafka topics. What you need is JavaDStreamMyTable, and you can use transform()

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Jong Wook Kim
Based on my experience, YARN containers can get SIGTERM when - it produces too much logs and use up the hard drive - it uses off-heap memory more than what is given by spark.yarn.executor.memoryOverhead configuration. It might be due to too many classes loaded (less than MaxPermGen but more

Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
I just asked this question at the streaming webinar that just ended, but the speakers didn't answered so throwing here: AFAIK checkpoints are the only recommended method for running Spark streaming without data loss. But it involves serializing the entire dstream graph, which prohibits any logic

Re: Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
, and as the transform function is processed in every batch interval, it will always use the latest filters. HTH. TD On Wed, Jul 8, 2015 at 10:02 AM, Jong Wook Kim jongw...@nyu.edu wrote: I just asked this question at the streaming webinar that just ended, but the speakers didn't answered so

Re: Custom streaming receiver slow on YARN

2015-02-09 Thread Jong Wook Kim
replying to my own thread; I realized that this only happens when the replication level is 1. Regardless of whether setting memory_only or disk or deserialized, I had to make the replication level = 2 to make the streaming work properly on YARN. I still don't get it why, because intuitively less

Re: saveAsTextFile of RDD[Array[Any]]

2015-02-09 Thread Jong Wook Kim
If you have `RDD[Array[Any]]` you can do rdd.map(_.mkString(\t)) or with some other delimiter to make it `RDD[String]`, and then call `saveAsTextFile`. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-of-RDD-Array-Any-tp21548p21554.html

Custom streaming receiver slow on YARN

2015-02-07 Thread Jong Wook Kim
Hello people, I have an issue that my streaming receiver is laggy on YARN. Can anyone reply to my question on StackOverflow?: http://stackoverflow.com/questions/28370362/spark-streaming-receiver-particularly-slow-on-yarn Thanks Jong Wook -- View this message in context: