Re: PySpark Pandas UDF

2019-11-12 Thread Holden Karau
Thanks for sharing that. I think we should maybe add some checks around this so it’s easier to debug. I’m CCing Bryan who might have some thoughts. On Tue, Nov 12, 2019 at 7:42 AM gal.benshlomo wrote: > SOLVED! > thanks for the help - I found the issue. it was the version of pyarrow > (0.15.1) w

Re: SPARK Suitable IDE

2020-03-04 Thread Holden Karau
I work in emacs with ensime. I think really any IDE is ok, so go with the one you feel most at home in. On Wed, Mar 4, 2020 at 5:49 PM tianlangstudio wrote: > We use IntelliJ IDEA,Whether it's Java, Scala or Python > > >

Re: Going it alone.

2020-04-16 Thread Holden Karau
I want to be clear I believe the language in janethrope1s email is unacceptable for the mailing list and possibly a violation of the Apache code of conduct. I’m glad we don’t see messages like this often. I know this is a stressful time for many of us, but let’s try and do our best to not take it

Re: Copyright Infringment

2020-04-25 Thread Holden Karau
ion because I do not want to commit an unlawful act. >>> Can you please clarify if I would be infringing copyright due to this >>> text. >>> *Book: High Performance Spark * >>> *authors: holden Karau Rachel Warren.* >>> *page xii:* >>> >>> *

Re: Copyright Infringment

2020-04-25 Thread Holden Karau
ache >>> foundation's free licence agreement ? >>> >>> >>> >>> On Sat, 25 Apr 2020, 16:18 Sean Owen, wrote: >>> >>>> You'll want to ask the authors directly ; the book is not produced by >>>> the project

Re: Watch "Airbus makes more of the sky with Spark - Jesse Anderson & Hassene Ben Salem" on YouTube

2020-04-25 Thread Holden Karau
Also it’s ok if Spark and Flink evolve in different directions, were both part of the same open source foundation. Sometimes being everything to everyone isn’t as important as being the best at what you need. I like to think of our relationship with other Apache projects as less competitive and mo

Re: Spark API and immutability

2020-05-25 Thread Holden Karau
So even on RDDs cache/persist mutate the RDD object. The important thing for Spark is that the data represented/in the RDD/Dataframe isn’t mutated. On Mon, May 25, 2020 at 10:56 AM Chris Thomas wrote: > > The cache() method on the DataFrame API caught me out. > > Having learnt that DataFrames a

[ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Holden Karau
We are happy to announce the availability of Spark 2.4.6! Spark 2.4.6 is a maintenance release containing stability, correctness, and security fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release. To down

Re: Spark Streaming testing strategies

2015-03-01 Thread Holden Karau
There is also the Spark Testing Base package which is on spark-packages.org and hides the ugly bits (it's based on the existing streaming test code but I cleaned it up a bit to try and limit the number of internals it was touching). On Sunday, March 1, 2015, Marcin Kuthan wrote: > I have started

Re: Spark Streaming testing strategies

2015-03-10 Thread Holden Karau
> Right now I think a package is probably a good place for this to live since the internal Spark testing code is changing/evolving rapidly, but I think once we have the trait fleshed out a bit more we could see if there is enough interest to try and merge it in (just my personal thoug

Re: IOUtils cannot write anything in Spark?

2015-04-22 Thread Holden Karau
It seems like saveAsTextFile might do what you are looking for. On Wednesday, April 22, 2015, Xi Shen wrote: > Hi, > > I have a RDD of some processed data. I want to write these files to HDFS, > but not for future M/R processing. I want to write plain old style text > file. I tried: > > rdd fore

Re: swap tuple

2015-05-14 Thread Holden Karau
Can you paste your code? transformations return a new RDD rather than modifying an existing one, so if you were to swap the values of the tuple using a map you would get back a new RDD and then you would want to try and print this new RDD instead of the original one. On Thursday, May 14, 2015, Yas

Re: ElasticSearch enrich

2014-06-24 Thread Holden Karau
So I'm giving a talk at the Spark summit on using Spark & ElasticSearch, but for now if you want to see a simple demo which uses elasticsearch for geo input you can take a look at my quick & dirty implementation with TopTweetsInALocation ( https://github.com/holdenk/elasticsearchspark/blob/master/s

Re: ElasticSearch enrich

2014-06-25 Thread Holden Karau
> Skype: boci13, Hangout: boci.b...@gmail.com > > > On Wed, Jun 25, 2014 at 1:33 AM, Holden Karau > wrote: > >> So I'm giving a talk at the Spark summit on using Spark & ElasticSearch, >> but for now if you want to see a simple demo which uses elasticsear

Re: ElasticSearch enrich

2014-06-26 Thread Holden Karau
------ >>> Skype: boci13, Hangout: boci.b...@gmail.com >>> >>> >>> On Thu, Jun 26, 2014 at 1:20 AM, Holden Karau >>> wro

Re: ElasticSearch enrich

2014-06-26 Thread Holden Karau
... > > b0c1 > > > -- > Skype: boci13, Hangout: boci.b...@gmail.com > > > On Thu, Jun 26, 2014 at 11:48 PM, Holden Karau > wrote: > >> Hi b0c1, >> >> I have an example of

Re: ElasticSearch enrich

2014-06-27 Thread Holden Karau
-------- >> Skype: boci13, Hangout: boci.b...@gmail.com >> >> >> On Fri, Jun 27, 2014 at 12:30 AM, Holden Karau >> wrote: >> >>> Just your lu

Re: ElasticSearch enrich

2014-06-27 Thread Holden Karau
x27;s run in IDEA no magick trick. > > b0c1 > > > -- > Skype: boci13, Hangout: boci.b...@gmail.com > > > On Fri, Jun 27, 2014 at 11:11 PM, Holde

Re: ElasticSearch enrich

2014-06-27 Thread Holden Karau
I can)... > maybe later I will create a demo project based on my solution. > > b0c1 > > > -- > Skype: boci13, Hangout: boci.b...@gmail.com > >

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Holden Karau
Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit wrote: > I also ran into same issue. What is the solution? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html > Sent from

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Holden Karau
Currently scala 2.10.2 can't be pulled in from maven central it seems, however if you have it in your ivy cache it should work. On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau wrote: > Me 3 > > > On Fri, Aug 1, 2014 at 11:15 AM, nit wrote: > >> I also ran into same i

Re: MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Holden Karau
Hi Steve, The _ notation can be a bit confusing when starting with Scala, we can rewrite it to avoid using it here. So instead of val numUsers = ratings.map(_._2.user) we can write val numUsers = ratings.map(x => x._2.user) ratings is an Key-Value RDD (which is an RDD comprised of tuples) and so

Re: Spark Bug? job fails to run when given options on spark-submit (but starts and fails without)

2014-10-22 Thread Holden Karau
Hi Michael Campbell, Are you deploying against yarn or standalone mode? In yarn try setting the shell variables SPARK_EXECUTOR_MEMORY=2G in standalone try and set SPARK_WORKER_MEMORY=2G. Cheers, Holden :) On Thu, Oct 16, 2014 at 2:22 PM, Michael Campbell < michael.campb...@gmail.com> wrote: >

Re: saveasSequenceFile with codec and compression type

2014-10-22 Thread Holden Karau
Hi gpatcham, If you want to save as a sequence file with a custom compression type you can use saveAsHadoopFile along with setting the " mapred.output.compression.type" on the jobconf. If you want to keep using saveAsSequenceFile, and the syntax is much nicer, you could also set that property on t

Re: version mismatch issue with spark breeze vector

2014-10-22 Thread Holden Karau
Hi Yang, It looks like your build file is a different version than the version of Spark you are running against. I'd try building against the same version of spark as you are running your application against (1.1.0). Also what is your assembly/shading configuration for your build? Cheers, Holden

Re: OutOfMemory in "cogroup"

2014-10-27 Thread Holden Karau
On Monday, October 27, 2014, Shixiong Zhu wrote: > We encountered some special OOM cases of "cogroup" when the data in one > partition is not balanced. > > 1. The estimated size of used memory is inaccurate. For example, there are > too many values for some special keys. Because SizeEstimator.vis

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-27 Thread Holden Karau
Can you post the error message you get when trying to save the sequence file? If you call first() on the RDD does it result in the same error? On Mon, Oct 27, 2014 at 6:13 AM, buring wrote: > Hi: > After update spark to version1.1.0, I experienced a snappy error > which was > posted her

Re: exact count using rdd.count()?

2014-10-27 Thread Holden Karau
Hi Josh, The count() call will result in the correct number in each RDD, however foreachRDD doesn't return the result of its computation anywhere (its intended for things which cause side effects, like updating an accumulator or causing an web request), you might want to look at transform or the c

Re: Filtering URLs from incoming Internet traffic(Stream data). feasible with spark streaming?

2014-10-27 Thread Holden Karau
On Mon, Oct 27, 2014 at 9:15 PM, Nasir Khan wrote: > According to my knowledge spark streams uses mini batches for processing, > > Q: Is it a good idea to use my ML trained Model on a web server for > filtering purpose to classify URLs as obscene or benin. If spark streaming > handle data as mini

Re: Filtering URLs from incoming Internet traffic(Stream data). feasible with spark streaming?

2014-10-27 Thread Holden Karau
On Mon, Oct 27, 2014 at 10:19 PM, Nasir Khan wrote: > I am kinda stuck with spark now :/ i already proposed this model in my > synopsis and its already accepted :D spark is a new thing for alot of > people. what alternate tool should i use now? > You could use Spark to train your model and then m

Re: pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Holden Karau
Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :) Cheers, Holden :) On Tuesday, October 28, 2014, Csaba Ragany wrote: > Dear Spark Community, > > Is it possible to convert text files (.log or .txt files) into > sequencefiles in Python? > > Using PySpark I can create

Re: what does DStream.union() do?

2014-10-29 Thread Holden Karau
The union function simply returns a DStream with the elements from both. This is the same behavior as when we call union on RDDs :) (You can think of union as similar to the union operator on sets except without the unique element restrictions). On Wed, Oct 29, 2014 at 3:15 PM, spr wrote: > The

Re: how to extract/combine elements of an Array in DStream element?

2014-10-29 Thread Holden Karau
On Wed, Oct 29, 2014 at 3:29 PM, spr wrote: > I am processing a log file, from each line of which I want to extract the > zeroth and 4th elements (and an integer 1 for counting) into a tuple. I > had > hoped to be able to index the Array for elements 0 and 4, but Arrays appear > not to support v

Re: what does DStream.union() do?

2014-10-29 Thread Holden Karau
n function works on a DStream of the same templated type. If you have hetrogeneous data you can first map each DStream it to a case class with options or try something like http://stackoverflow.com/questions/3508077/does-scala-have-type-disjunction-union-types > > > Holden Karau wrote > &

Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

2014-11-04 Thread Holden Karau
This is the expected behavior. Spark Streaming only reads new files once, this is why they must be created through an atomic move so that Spark doesn't accidentally read a partially written file. I'd recommend looking at "Basic Sources" in the Spark Streaming guide ( http://spark.apache.org/docs/la

Re: Nesting RDD

2014-11-06 Thread Holden Karau
Hi Naveen, Nesting RDDs inside of transformations or actions is not supported. Instead if you need access to the other RDDs contents you can try doing a join or (if the data is small enough) collecting and broadcasting the second RDD. Cheers, Holden :) On Thu, Nov 6, 2014 at 10:28 PM, Naveen Ku

Re: Parallelize on spark context

2014-11-06 Thread Holden Karau
Hi Naveen, So by default when we call parallelize it will be parallelized by the default number (which we can control with the property spark.default.parallelism) or if we just want a specific instance of parallelize to have a different number of partitions, we can instead call sc.parallelize(data

Re: N-Fold validation and RDD partitions

2014-03-24 Thread Holden Karau
There is also https://github.com/apache/spark/pull/18 against the current repo which may be easier to apply. On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh wrote: > Hi Jaonary, > > You can find the code for k-fold CV in > https://github.com/apache/incubator-spark/pull/448. I have not find the >

<    1   2   3