Re: Finding moving average using Spark and Scala

2015-07-17 Thread Anupam Bagchi
Thanks Feynman for your direction. I was able to solve this problem by calling Spark API from Java. Here is a code snippet that may help other people who might face the same challenge. if (args.length > 2) { earliestEventDate = Integer.parseInt(args[2]); } else {

Re: Finding moving average using Spark and Scala

2015-07-14 Thread Feynman Liang
If your rows may have NAs in them, I would process each column individually by first projecting the column ( map(x => x.nameOfColumn) ), filtering out the NAs, then running a summarizer over each column. Even if you have many rows, after summarizing you will only have a vector of length #columns.

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
Hello Feynman, Actually in my case, the vectors I am summarizing over will not have the same dimension since many devices will be inactive on some days. This is at best a sparse matrix where we take only the active days and attempt to fit a moving average over it. The reason I would like to sa

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
Dimensions mismatch when adding new sample. Expecting 8 but got 14. Make sure all the vectors you are summarizing over have the same dimension. Why would you want to write a MultivariateOnlineSummary object (which can be represented with a couple Double's) into a distributed filesystem like HDFS?

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
Thank you Feynman for the lead. I was able to modify the code using clues from the RegressionMetrics example. Here is what I got now. val deviceAggregateLogs = sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache() // Calculate statistics based on bytes-transferred val deviceIdsM

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
Thank you Feynman for your response. Since I am very new to Scala I may need a bit more hand-holding at this stage. I have been able to incorporate your suggestion about sorting - and it now works perfectly. Thanks again for that. I tried to use your suggestion of using MultiVariateOnlineSummar

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
A good example is RegressionMetrics 's use of of OnlineMultivariateSummarizer to aggregate statistics across labels and residuals; take a look at how aggregateByKey is use

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
> > The call to Sorting.quicksort is not working. Perhaps I am calling it the > wrong way. allaggregates.toArray allocates and creates a new array separate from allaggregates which is sorted by Sorting.quickSort; allaggregates. Try: val sortedAggregates = allaggregates.toArray Sorting.quickSort(so

Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
I have to do the following tasks on a dataset using Apache Spark with Scala as the programming language: - Read the dataset from HDFS. A few sample lines look like this: deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626 652

Moving average using Spark and Scala

2015-07-11 Thread Anupam Bagchi
I have to do the following tasks on a dataset using Apache Spark with Scala as the programming language: Read the dataset from HDFS. A few sample lines look like this: deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626 6522013,252