Re: pyspark 1.4 udf change date values

2015-07-17 Thread Luis Guerra
Sure, I have created JIRA SPARK-9131 - UDF change data values https://issues.apache.org/jira/browse/SPARK-9131 On Thu, Jul 16, 2015 at 7:09 PM, Davies Liu dav...@databricks.com wrote: Thanks for reporting this, could you file a JIRA for it? On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra

pyspark 1.4 udf change date values

2015-07-16 Thread Luis Guerra
Hi all, I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you

Apply function to all elements along each key

2015-01-20 Thread Luis Guerra
Hi all, I would like to apply a function over all elements for each key (assuming key-value RDD). For instance, imagine I have: import numpy as np a = np.array([[1, 'hola', 'adios'],[2, 'hi', 'bye'],[2, 'hello', 'goodbye']]) a = sc.parallelize(a) Then I want to create a key-value RDD, using the

Re: Spark executors resources. Blocking?

2015-01-13 Thread Luis Guerra
or Mesos to schedule the system. The same issues will come up, but they have a much broader range of approaches that you can take to solve the problem. Dave *From:* Luis Guerra [mailto:luispelay...@gmail.com] *Sent:* Monday, January 12, 2015 8:36 AM *To:* user *Subject:* Spark executors

Spark executors resources. Blocking?

2015-01-12 Thread Luis Guerra
Hello all, I have a naive question regarding how spark uses the executors in a cluster of machines. Imagine the scenario in which I do not know the input size of my data in execution A, so I set Spark to use 20 (out of my 25 nodes, for instance). At the same time, I also launch a second execution

Ungroup data

2014-09-25 Thread Luis Guerra
Hi everyone, I need some advice about how to make the following: having a RDD of vectors (each vector being Vector(Int, Int , Int, int)), I need to group the data, then I need to apply a function to every group comparing each consecutive item within a group and retaining a variable (that has to

Time difference between Python and Scala

2014-09-19 Thread Luis Guerra
Hello everyone, What should be the normal time difference between Scala and Python using Spark? I mean running the same program in the same cluster environment. In my case I am using numpy array structures for the Python code and vectors for the Scala code, both for handling my data. The time

Re: Number of partitions when saving (pyspark)

2014-09-18 Thread Luis Guerra
is carried out only in 4 stages. What am I doing wrong? On Wed, Sep 17, 2014 at 6:20 PM, Davies Liu dav...@databricks.com wrote: On Wed, Sep 17, 2014 at 5:21 AM, Luis Guerra luispelay...@gmail.com wrote: Hi everyone, Is it possible to fix the number of tasks related to a saveAsTextFile

Number of partitions when saving (pyspark)

2014-09-17 Thread Luis Guerra
Hi everyone, Is it possible to fix the number of tasks related to a saveAsTextFile in Pyspark? I am loading several files from HDFS, fixing the number of partitions to X (let's say 40 for instance). Then some transformations, like joins and filters are carried out. The weird thing here is that

Spark execution plan

2014-07-23 Thread Luis Guerra
Hi all, I was wondering how spark may deal with an execution plan. Using PIG as example and its DAG execution, I would like to manage Spark for a similar solution. For instance, if my code has 3 different parts, being A and B self-sufficient parts: Part A: .. . . var output_a Part

Re: Spark execution plan

2014-07-23 Thread Luis Guerra
Thanks for your answer. However, there has been a missunderstanding here. My question is related to control the execution in parallel of different parts of code, similarly to PIG, where there is a planning phase before the execution. On Wed, Jul 23, 2014 at 1:46 PM, chutium teng@gmail.com

class after join

2014-07-17 Thread Luis Guerra
Hi all, I am a newbie Spark user with many doubts, so sorry if this is a silly question. I am dealing with tabular data formatted as text files, so when I first load the data, my code is like this: case class data_class( V1: String, V2: String, V3: String, V4: String, V5: String,

Re: class after join

2014-07-17 Thread Luis Guerra
of these values classes. (Although, if you otherwise needed a class that represented all of the things in class A and class B, this could be done easily with composition, a class with an A and a B inside.) On Thu, Jul 17, 2014 at 9:15 AM, Luis Guerra luispelay...@gmail.com wrote: Hi all, I am