Re: Adding an indexed column

2015-05-31 Thread ayan guha
If you are on spark 1.3, use repartitionandSort followed by mappartition. In 1.4, window functions will be supported, it seems On 1 Jun 2015 04:10, Ricardo Almeida ricardo.alme...@actnowib.com wrote: That's great and how would you create an ordered index by partition (by product in this

Re: data localisation in spark

2015-05-31 Thread Sandy Ryza
Hi Shushant, Spark currently makes no effort to request executors based on data locality (although it does try to schedule tasks within executors based on data locality). We're working on adding this capability at SPARK-4352 https://issues.apache.org/jira/browse/SPARK-4352. -Sandy On Sun, May

Re: RDD staleness

2015-05-31 Thread Michael Armbrust
Each time you run a Spark SQL query we will create new RDDs that load the data and thus you should see the newest results. There is one caveat: formats that use the native Data Source API (parquet, ORC (in Spark 1.4), JSON (in Spark 1.5)) cache file metadata to speed up interactive querying. To

Re: union and reduceByKey wrong shuffle?

2015-05-31 Thread Igor Berman
Hi We are using spark 1.3.1 Avro-chill (tomorrow will check if its important) we register avro classes from java Avro 1.7.6 On May 31, 2015 22:37, Josh Rosen rosenvi...@gmail.com wrote: Which Spark version are you using? I'd like to understand whether this change could be caused by recent Kryo

Re: Recommended Scala version

2015-05-31 Thread Tathagata Das
Can you file a JIRA with the detailed steps to reproduce the problem? On Fri, May 29, 2015 at 2:59 AM, Alex Nakos ana...@gmail.com wrote: Hi- I’ve just built the latest spark RC from source (1.4.0 RC3) and can confirm that the spark shell is still NOT working properly on 2.11. No classes in

Re: Recommended Scala version

2015-05-31 Thread Alex Nakos
Hi- Yup, I’ve already done so here: https://issues.apache.org/jira/browse/SPARK-7944 Please let me know if this requires any more information - more than happy to provide whatever I can. Thanks Alex On Sun, May 31, 2015 at 8:45 AM, Tathagata Das t...@databricks.com wrote: Can you file a JIRA

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-31 Thread DB Tsai
Alternatively, I will give a talk about LOR and LIR with elastic-net implementation and interpretation of those models in spark summit. https://spark-summit.org/2015/events/large-scale-lasso-and-elastic-net-regularized-generalized-linear-models/ You may attend or watch online. Sincerely, DB

RDD staleness

2015-05-31 Thread Ashish Mukherjee
Hello, Since RDDs are created from data from Hive tables or HDFS, how do we ensure they are invalidated when the source data is updated? Regards, Ashish

Re: RDD staleness

2015-05-31 Thread DW @ Gmail
There is no mechanism for keeping an RDD up to date with a changing source. However you could set up a steam that watches for changes to the directory and processes the new files or use the Hive integration in SparkSQL to run Hive queries directly. (However, old query results will still grow

data localisation in spark

2015-05-31 Thread Shushant Arora
I want to understand how spark takes care of data localisation in cluster mode when run on YARN. 1.Driver program asks ResourceManager for executors. Does it tell yarn's RM to check HDFS blocks of input data and then allocate executors to it. And executors remain fixed throughout application or