It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Pedro, Chuck
r 'null'. Please specify one with --class." Basically I just want the application code in one s3 path and a "common" utilities package in another path. Thanks for your help. Kind regards, Chuck Pedro This message (including any attachments) may contain

Unsubscribe

2022-11-07 Thread Pedro Tuero
Unsubscribe

Re: Java : Testing RDD aggregateByKey

2021-08-23 Thread Pedro Tuero
...@japila.pl) escribió: > Hi Pedro, > > > Anyway, maybe the behavior is weird, I could expect that repartition to > zero was not allowed or at least warned instead of just discarting all the > data . > > Interesting... > > scala> spark.version > res3: Str

Re: Java : Testing RDD aggregateByKey

2021-08-19 Thread Pedro Tuero
that repartition to zero was not allowed or at least warned instead of just discarting all the data . Thanks for your time! Regards, Pedro El jue, 19 de ago. de 2021 a la(s) 07:42, Jacek Laskowski (ja...@japila.pl) escribió: > Hi Pedro, > > No idea what might be causing it. Do you per

Java : Testing RDD aggregateByKey

2021-08-17 Thread Pedro Tuero
U) => U): RDD[(K, U)] = self.withScope { aggregateByKey(zeroValue, *defaultPartitioner*(self))(seqOp, combOp) } I can't debug it properly with eclipse, and error occurs when threads are in spark code (system editor can only open file base resources). Does anyone know how to resolve this issue? Thanks in advance, Pedro.

Coalesce vs reduce operation parameter

2021-03-18 Thread Pedro Tuero
is the same. So, is it a bug or a feature? Why spark doesn't treat a coalesce after a reduce like a reduce with output partitions parameterized? Just for understanding, Thanks, Pedro.

Submitting extra jars on spark applications on yarn with cluster mode

2020-11-14 Thread Pedro Cardoso
he archive property but it did not work. I got class not defined exceptions on classes that come from the 3 extra jars. If it helps, the jars are only required for the driver not the executors. They will simply perform spark-only operations. Thank you and have good weekend. -- *Pedro Cardoso* *Researc

Distribute entire columns to executors

2020-09-24 Thread Pedro Cardoso
nippet example (not working is fine if the logic is sound) would be highly appreciated! Thank you for your time. -- *Pedro Cardoso* *Research Engineer* pedro.card...@feedzai.com [image: Follow Feedzai on Facebook.] <https://www.facebook.com/Feedzai/>[image: Follow Feedzai on Twitter!] <http

Re: Spark 2.4 partitions and tasks

2019-02-25 Thread Pedro Tuero
should need more or less parallelism. Regards, Pedro. El sáb., 23 de feb. de 2019 a la(s) 21:27, Yeikel (em...@yeikel.com) escribió: > I am following up on this question because I have a similar issue. > > When is that we need to control the parallelism manually? Skewed >

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero
* It is not getPartitions() but getNumPartitions(). El mar., 12 de feb. de 2019 a la(s) 13:08, Pedro Tuero (tuerope...@gmail.com) escribió: > And this is happening in every job I run. It is not just one case. If I > add a forced repartitions it works fine, even better than before. But

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero
And this is happening in every job I run. It is not just one case. If I add a forced repartitions it works fine, even better than before. But I run the same code for different inputs so the number to make repartitions must be related to the input. El mar., 12 de feb. de 2019 a la(s) 11:22, Pedro

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero
the initial RDD conserve the same number of partitions, in 2.4 the number of partitions reset to default. So RDD1, the product of the first mapToPair, prints 5580 when getPartitions() is called in 2.3.1, while prints 128 in 2.4. Regards, Pedro El mar., 12 de feb. de 2019 a la(s) 09:13, Jacek

Re: Spark 2.4 partitions and tasks

2019-02-08 Thread Pedro Tuero
I did a repartition to 1 (hardcoded) before the keyBy and it ends in 1.2 minutes. The questions remain open, because I don't want to harcode paralellism. El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero (tuerope...@gmail.com) escribió: > 128 is the default parallelism defi

Re: Spark 2.4 partitions and tasks

2019-02-08 Thread Pedro Tuero
128 is the default parallelism defined for the cluster. The question now is why keyBy operation is using default parallelism instead of the number of partition of the RDD created by the previous step (5580). Any clues? El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (tuerope...@gmail.com

Re: Aws

2019-02-08 Thread Pedro Tuero
have posted in this forum another thread about that recently. Regards, Pedro El jue., 7 de feb. de 2019 a la(s) 21:37, Noritaka Sekiyama ( moomind...@gmail.com) escribió: > Hi Pedro, > > It seems that you disabled maximize resource allocation in 5.16, but > enabled in 5.20. >

Spark 2.4 partitions and tasks

2019-02-07 Thread Pedro Tuero
?? Thanks. Pedro.

Re: Aws

2019-02-01 Thread Pedro Tuero
. It seems that in 5.20, a full instance is wasted with the driver only, while it could also contain an executor. Regards, Pedro. l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata escribió: > Hi, Pedro > > > I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for > per

Aws

2019-01-31 Thread Pedro Tuero
Hi guys, I use to run spark jobs in Aws emr. Recently I switch from aws emr label 5.16 to 5.20 (which use Spark 2.4.0). I've noticed that a lot of steps are taking longer than before. I think it is related to the automatic configuration of cores by executor. In version 5.16, some executors toke

unsubscribe

2018-01-16 Thread Jose Pedro de Santana Neto
unsubscribe

Broadcasted Object is empty in executors.

2017-05-22 Thread Pedro Tuero
sByWord.keys()); Prints an empty list. This works alright running locally in my computer, but fail with no match running in aws emr. I usually broadcast objects and map with no problems. Can anyone give me a clue about what's happening here? Thanks you very much, Pedro.

Kryo Exception: NegativeArraySizeException

2016-11-24 Thread Pedro Tuero
? Is there a workaround? Thank for yuor comments, Pedro. Map info: INFO 2016-11-24 15:29:34,230 [main] (Logging.scala:54) - Block broadcast_3 stored as values in memory (estimated size 2.6 GB, free 5.7 GB) Error Trace: ERROR ApplicationMaster: User class threw exception

Broadcasting Complex Custom Objects

2016-10-17 Thread Pedro Tuero
it should be a way to use it instead of being serializing and deserializing everything. Thanks, Pedro

Re: Guys is this some form of Spam or someone has left his auto-reply loose LOL

2016-07-28 Thread Pedro Rodriguez
risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destr

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Pedro Rodriguez
ark-s3_2.10:0.0.0" should work very soon). I would love to hear if this library solution works, otherwise I hope the blog post above is illuminating. Pedro On Wed, Jul 27, 2016 at 8:19 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > I have a relatively small data set

Re: dynamic coalesce to pick file size

2016-07-26 Thread Pedro Rodriguez
here have a way of > dynamically picking the number depending of the file size wanted? (around > 256mb would be perfect) > > > > I am running spark 1.6 on CDH using yarn, the files are written in parquet > format. > > > > Thanks > > > -- Pedro Rodriguez

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Pedro Rodriguez
:) Just realized you didn't get your original question answered though: scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> case class Person(age: Long, name: String) defined class Person scala> val df = Seq(Person(24, "pedro"), Person(22,

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Pedro Rodriguez
call collect spark* do nothing* so you df would not >> have any data -> can’t call foreach. >> Call collect execute the process -> get data -> foreach is ok. >> >> >> On Jul 26, 2016, at 2:30 PM, kevin <kiss.kevin...@gmail.com> wrote: >> >>

Re: Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Pedro Rodriguez
ill probably take the approach of having a S3 API call to wipe out that partition before the job starts, but it would be nice to not have to incorporate another step in the job. Pedro On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rkad...@collectivei.com> wrote: > You can have a temporary file

Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Pedro Rodriguez
of duplicated data 3. Preserve data for all other dates I am guessing that overwrite would not work here or if it does its not guaranteed to stay that way, but am not sure. If thats the case, is there a good/robust way to get this behavior? -- Pedro Rodriguez PhD Student in Distributed Machine

Re: Spark 2.0

2016-07-25 Thread Pedro Rodriguez
willing to go fix it myself). Should I just > create a ticket? > > Thank you, > > Bryan Jeffrey > > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience

Re: How to generate a sequential key in rdd across executors

2016-07-24 Thread Pedro Rodriguez
If you can use a dataframe then you could use rank + window function at the expense of an extra sort. Do you have an example of zip with index not working, that seems surprising. On Jul 23, 2016 10:24 PM, "Andrew Ehrlich" wrote: > It’s hard to do in a distributed system.

Re: Choosing RDD/DataFrame/DataSet and Cluster Tuning

2016-07-23 Thread Pedro Rodriguez
each or 10 of 7 cores each. You can also kick up the memory to use more of your cluster’s memory. Lastly, if you are running on EC2 make sure to configure spark.local.dir to write to something that is not an EBS volume, preferably an attached SSD to something like an r3 machine. — Pedro Rodriguez

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Pedro Rodriguez
your setup that might affect networking. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 23, 2016 at 9:10:31 AM, VG (vlin...@gmail.com) wrote

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Pedro Rodriguez
a security group which allows all traffic to/from itself to itself. If you are using something like ufw on ubuntu then you probably need to know the ip addresses of the worker nodes beforehand. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist

Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread Pedro Rodriguez
; > >> In later part of the code I need to change a datastructure and update > name with index value generated above . > >> I am unable to figure out how to do a look up here.. > >> > >> Please suggest /. > >> > >> If there i

Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
need to open a connection to a database so its better to re-use that connection for one partition's elements than create it for each element. What are you trying to accomplish with dapply? On Fri, Jul 22, 2016 at 8:05 PM, Neil Chang <iam...@gmail.com> wrote: > Thanks Pedro, > so t

Re: spark and plot data

2016-07-22 Thread Pedro Rodriguez
been that we cannot > download the notebooks, cannot export them and certainly cannot sync them > back to Github, without mind numbing and sometimes irritating hacks. Have > those issues been resolved? > > > Regards, > Gourav > > > On Fri, Jul 22, 2016 at 2:22 PM,

Re: How to search on a Dataset / RDD <Row, Long >

2016-07-22 Thread Pedro Rodriguez
ean by updating the data structure, I am guessing you mean replace the name column with the id column? Not, on the second line the withColumn call uses $"id" which in scala converts to a Column. In java maybe its something like new Column("id"), not sure. Pedro On Fri, Ju

Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
This should work and I don't think triggers any actions: df.rdd.partitions.length On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang <iam...@gmail.com> wrote: > Seems no function does this in Spark 2.0 preview? > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | C

Re: spark and plot data

2016-07-22 Thread Pedro Rodriguez
are insufficient. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 22, 2016 at 3:04:48 AM, Marco Colombo (ing.marco.colo...@gmail.com) wrote: Take

Re: How can we control CPU and Memory per Spark job operation..

2016-07-22 Thread Pedro Rodriguez
new job where the cpu/memory ratio is more favorable which reads from the prior job’s output. I am guessing this heavily depends on how expensive reloading the data set from disk/network is.  Hopefully one of these helps, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boul

Re: How can we control CPU and Memory per Spark job operation..

2016-07-16 Thread Pedro Rodriguez
You could call map on an RDD which has “many” partitions, then call repartition/coalesce to drastically reduce the number of partitions so that your second map job has less things running. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Pedro Rodriguez
Out of curiosity, is there a way to pull all the data back to the driver to save without collect()? That is, stream the data in chunks back to the driver so that maximum memory used comparable to a single node’s data, but all the data is saved on one node. — Pedro Rodriguez PhD Student

Re: Call http request from within Spark

2016-07-14 Thread Pedro Rodriguez
that with mapPartitions. This is useful when the initialization time of the function in the map call is expensive (eg uses a connection pool for a db or web) as it allows you to initialize that resource once per partition then reuse it for all the elements in the partition. Pedro On Thu, Jul 14, 2016 at 8:52 AM

Re: Tools for Balancing Partitions by Size

2016-07-13 Thread Pedro Rodriguez
A computes the size of on partition, RDD B holds all partitions except for the one from A, the parents of A and B are the original parent RDD, RDD C has parents A and B and has the overall write balanced function. Thanks, Pedro On Wed, Jul 13, 2016 at 9:10 AM, Gourav Sengupta <gourav.sengu...@gmail.

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Pedro Rodriguez
that to estimate the total size. Thanks for the idea. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 12, 2016 at 7:26:17 PM, Hatim Diab (timd

Tools for Balancing Partitions by Size

2016-07-12 Thread Pedro Rodriguez
also be useful to get programmatic access to the size of the RDD in memory if it is cached. Thanks, -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha

Spark SQL: Merge Arrays/Sets

2016-07-11 Thread Pedro Rodriguez
ets('words)) -> list of distinct words Thanks, -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience

Re: question about UDAF

2016-07-11 Thread Pedro Rodriguez
ng them together. In this case, the buffers are "" since initialize makes it "" and update keeps it "" so the result is just "". I am not sure it matters, but you probably also want to do buffer.getString(0). Pedro On Mon, Jul 11, 2016 at 3:04 AM, <luohui20.

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
Thanks Michael, That seems like the analog to sorting tuples. I am curious, is there a significant performance penalty to the UDAF versus that? Its certainly nicer and more compact code at least. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data

Re: problem making Zeppelin 0.6 work with Spark 1.6.1, throwing jackson.databind.JsonMappingException exception

2016-07-09 Thread Pedro Rodriguez
It would be helpful if you included relevant configuration files from each or if you are using the defaults, particularly any changes to class paths. I worked through Zeppelin to 0.6.0 at work and at home without any issue so hard to say more without having more details. — Pedro Rodriguez PhD

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
input at runtime? Thanks, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 9, 2016 at 1:33:18 AM, Pedro Rodriguez (ski.rodrig...@gmail.com

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
spark sql types are allowed? — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 8, 2016 at 6:06:32 PM, Xinh Huynh (xinh.hu...@gmail.com) wrote

DataFrame Min By Column

2016-07-08 Thread Pedro Rodriguez
Is there a way to on a GroupedData (from groupBy in DataFrame) to have an aggregate that returns column A based on a min of column B? For example, I have a list of sites visited by a given user and I would like to find the event with the minimum time (first event) Thanks, -- Pedro Rodriguez PhD

Re: Custom RDD: Report Size of Partition in Bytes to Spark

2016-07-04 Thread Pedro Rodriguez
/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/S3RDD.scala#L100-L105 Reflection code:  https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/PrivateMethodUtil.scala Thanks, — Pedro Rodriguez PhD Student in Large-Scale Machine

Custom RDD: Report Size of Partition in Bytes to Spark

2016-07-03 Thread Pedro Rodriguez
could find were some Hadoop metrics. Is there a way to simply report the number of bytes a partition read so Spark can put it on the UI? Thanks, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
That was indeed the case, using UTF8Deserializer makes everything work correctly. Thanks for the tips! On Thu, Jun 30, 2016 at 3:32 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Quick update, I was able to get most of the plumbing to work thanks to the > code Holden posted

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
://github.com/apache/spark/blob/v1.6.2/python/pyspark/rdd.py#L182: _pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified. On Thu, Jun 30, 2016 at 2:13 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Thanks Jeff a

Re: Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
Thanks Jeff and Holden, A little more context here probably helps. I am working on implementing the idea from this article to make reads from S3 faster: http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 (although my name is Pedro, I am not the author of the article

Call Scala API from PySpark

2016-06-30 Thread Pedro Rodriguez
t thing I would run into is converting the JVM RDD[String] back to a Python RDD, what is the easiest way to do this? Overall, is this a good approach to calling the same API in Scala and Python? -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Al

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Pedro Rodriguez
e").count.select('name.as[String], 'count.as [Long]).collect() Does that seem like a correct understanding of Datasets? On Sat, Jun 18, 2016 at 6:39 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Looks like it was my own fault. I had spark 2.0 cloned/built, but had the &g

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Pedro Rodriguez
t($"_1", $"count").show > > +---+-+ > > | _1|count| > > +---+-+ > > | 1|1| > > | 2|1| > > +---+-+ > > > > On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> > wrote: > >>

Re: Skew data

2016-06-17 Thread Pedro Rodriguez
this is to spread data across partitions evenly. In most cases calling repartition is enough to solve the problem. If you have a special case you might need create your own custom partitioner. Pedro On Thu, Jun 16, 2016 at 6:55 PM, Selvam Raman <sel...@gmail.com> wrote: > Hi, > > What is skew d

Re: Dataset Select Function after Aggregate Error

2016-06-17 Thread Pedro Rodriguez
it is a method/function with its name defined as $ in Scala? Lastly, are there prelim Spark 2.0 docs? If there isn't a good description/guide of using this syntax I would be willing to contribute some documentation. Pedro On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin@gmail.com>

Dataset Select Function after Aggregate Error

2016-06-17 Thread Pedro Rodriguez
low is the equivalent Dataframe code which works as expected: df.groupBy("uid").count().select("uid") Thanks! -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.

Undestanding Spark Rebalancing

2016-01-14 Thread Pedro Rodriguez
work just fine for me, but I can't seem to find out for sure if Spark does job re-scheduling/stealing. Thanks -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com

Re: How to speed up MLlib LDA?

2015-09-22 Thread Pedro Rodriguez
/intel-analytics/TopicModeling https://github.com/intel-analytics/TopicModeling It might be worth trying out. Do you know what LDA algorithm VW uses? Pedro On Tue, Sep 22, 2015 at 1:54 AM, Marko Asplund <marko.aspl...@gmail.com> wrote: > Hi, > > I did some profiling for my LDA

Re: Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Pedro Rodriguez
using Spark 1.4.1, and I want to know how can I find the complete function list supported in Spark SQL, currently I only know 'sum','count','min','max'. Thanks a lot. -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig

Re: Spark Interview Questions

2015-07-29 Thread Pedro Rodriguez
-berkeleyx-cs100-1x https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 208-340-1703 Github: github.com/EntilZha | LinkedIn: https

Re: Spark SQL Table Caching

2015-07-22 Thread Pedro Rodriguez
I would be interested in the answer to this question, plus the relationship between those and registerTempTable() Pedro On Tue, Jul 21, 2015 at 1:59 PM, Brandon White bwwintheho...@gmail.com wrote: A few questions about caching a table in Spark SQL. 1) Is there any difference between caching

Python DataFrames, length of array

2015-07-15 Thread pedro
to contribute a PR with the function. Pedro Rodriguez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-DataFrames-length-of-array-tp23868.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Python DataFrames: length of ArrayType

2015-07-15 Thread pedro
this? If this doesn't exist and seems useful, I would be happy to contribute a PR with the function. Pedro Rodriguez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-DataFrames-length-of-ArrayType-tp23869.html Sent from the Apache Spark User List

Misaligned Rows with UDF

2015-07-14 Thread pedro
be great. Thanks, Pedro Rodriguez Trulia CU Boulder PhD Student -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Misaligned-Rows-with-UDF-tp23837.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Check for null in PySpark DataFrame

2015-07-02 Thread Pedro Rodriguez
idea as well Pedro On Wed, Jul 1, 2015 at 12:18 PM, Michael Armbrust mich...@databricks.com wrote: There is an isNotNull function on any column. df._1.isNotNull or from pyspark.sql.functions import * col(myColumn).isNotNull On Wed, Jul 1, 2015 at 3:07 AM, Olivier Girardot ssab

Check for null in PySpark DataFrame

2015-06-30 Thread pedro
I am trying to find what is the correct way to programmatically check for null values for rows in a dataframe. For example, below is the code using pyspark and sql: df = sqlContext.createDataFrame(sc.parallelize([(1, None), (2, a), (3, b), (4, None)])) df.where('_2 is not null').count() However,

Gradient Boosting Decision Trees

2014-07-16 Thread Pedro Silva
Hi there, I am looking for a GBM MLlib implementation. Does anyone know if there is a plan to roll it out soon? Thanks! Pedro

Re: Gradient Boosting Decision Trees

2014-07-16 Thread Pedro Silva
Hi Ameet, that's great news! Thanks, Pedro On Wed, Jul 16, 2014 at 9:33 AM, Ameet Talwalkar atalwal...@gmail.com wrote: Hi Pedro, Yes, although they will probably not be included in the next release (since the code freeze is ~2 weeks away), GBM (and other ensembles of decision trees

Variables outside of mapPartitions scope

2014-05-16 Thread pedro
I am working on some code which uses mapPartitions. Its working great, except when I attempt to use a variable within the function passed to mapPartitions which references something outside of the scope (for example, a variable declared immediately before the mapPartitions call). When this

Re: Task not serializable?

2014-05-15 Thread pedro
I'me still fairly new to this, but I found problems using classes in maps if they used instance variables in part of the map function. It seems like for maps and such to work correctly, it needs to be purely functional programming. -- View this message in context:

Re: Variables outside of mapPartitions scope

2014-05-12 Thread pedro
Right now I am not using any class variables (references to this). All my variables are created within the scope of the method I am running. I did more debugging and found this strange behavior. variables here for loop mapPartitions call use variables here end mapPartitions endfor

Initial job has not accepted any resources

2014-05-04 Thread pedro
I have been working on a Spark program, completed it, but have spent the past few hours trying to run on EC2 without any luck. I am hoping i can comprehensively describe my problem and what I have done, but I am pretty stuck. My code uses the following lines to configure the SparkContext, which

Re: Initial job has not accepted any resources

2014-05-04 Thread pedro
that specifies their dependencies. Thanks --  Pedro Rodriguez UCBerkeley 2014 | Computer Science BSU Cryosphere Science Research SnowGeek Founder snowgeek.org pedro-rodriguez.com ski.rodrig...@gmail.com 208-340-1703 On May 4, 2014 at 6:51:56 PM, Jeremy Freeman [via Apache Spark User List] (ml-node

Re: ClassNotFoundException

2014-05-04 Thread pedro
I just ran into the same problem. I will respond if I find how to fix. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-tp5182p5342.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Initial job has not accepted any resources

2014-05-04 Thread pedro
Since it appears breeze is going to be included by default in Spark in 1.0, and I ran into the issue here: http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-td5182.html And it seems like the issues I had were recently introduced, I am cloning spark and checking out the