Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-10-27 Thread Holden Karau
w.com/users/1305344/jacek-laskowski > > > On Wed, Oct 21, 2015 at 12:55 AM, Holden Karau > wrote: > > Hi SF based folks, > > > > I'm going to try doing some simple office hours this Friday afternoon > > outside of Paramo Coffee. If no one comes by I'll ju

Re: Spark-Testing-Base Q/A

2015-10-28 Thread Holden Karau
And now (before 1am California time :p) there is a new version of spark-testing base which adds a java base class for streaming tests. I noticed you were using 1.3 so I put in the effort to make this release for Spark 1.3 to 1.5 inclusive). On Wed, Oct 21, 2015 at 4:16 PM, Holden Karau wrote

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-11-10 Thread Holden Karau
ct 27, 2015 at 11:43 AM, Holden Karau wrote: > So I'm going to try and do these again, with an on-line ( > http://doodle.com/poll/cr9vekenwims4sna ) and SF version ( > http://doodle.com/poll/ynhputd974d9cv5y ). You can help me pick a day > that works for you by filling out the

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Holden Karau
I just pushed some code that does this for spark-testing-base ( https://github.com/holdenk/spark-testing-base ) (its in master) and will publish an updated artifact with it for tonight. On Fri, Aug 14, 2015 at 3:35 PM, Tathagata Das wrote: > A hacky workaround is to create a customer InputDStre

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Holden Karau
So println of any array of strings will look like that. The java.util.Arrays class has some options to print arrays nicely. On Thu, Aug 27, 2015 at 2:08 PM, Arun Luthra wrote: > What types of RDD can saveAsObjectFile(path) handle? I tried a naive test > with an RDD[Array[String]], but when I tri

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Holden Karau
Yes, any java serializable object. Its important to note that since its saving serialized objects it is as brittle as java serialization when it comes to version changes, so if you can make your data fit in something like sequence files, parquet, avro, or similar it can be not only more space effic

Re: tweet transformation ideas

2015-08-27 Thread Holden Karau
It seems like this might be better suited to a broadcasted hash map since 200k entries isn't that big. You can then map over the tweets and lookup each word in the broadcasted map. On Thursday, August 27, 2015, Jesse F Chen wrote: > This is a question on general usage/best practice/best transfor

Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau
A common practice to do this is to use foreachRDD with a local var to accumulate the data (you can see it in the Spark Streaming test code). That being said, I am a little curious why you want the driver to create the file specifically. On Friday, September 11, 2015, allonsy > wrote: > Hi everyo

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau
thout the need of repartitioning data? > > Hope I have been clear, I am pretty new to Spark. :) > > 2015-09-11 18:19 GMT+02:00 Holden Karau >: > >> A common practice to do this is to use foreachRDD with a local var to >> accumulate the data (you can see it in the Spark

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-04 Thread Holden Karau
You really shouldn't mix different versions of Spark between the master and worker nodes, if your going to upgrade - upgrade all of them. Otherwise you may get very confusing failures. On Monday, September 5, 2016, Rex X wrote: > Wish to use the Pivot Table feature of data frame which is availab

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-10 Thread Holden Karau
> > On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" > wrote: > > You really shouldn't mix different versions of Spark between the master > and worker nodes, if your going to upgrade - upgrade all of them. Otherwise > you may get very confusing failures. >

Re: Strings not converted when calling Scala code from a PySpark app

2016-09-12 Thread Holden Karau
Ah yes so the Py4J conversions only apply on the driver program - your DStream however is RDDs of pickled objects. If you want to with a transform function use Spark SQL transferring DataFrames back and forth between Python and Scala spark can be much easier. On Monday, September 12, 2016, Alexis

Re: databricks spark-csv: linking coordinates are what?

2016-09-23 Thread Holden Karau
So the good news is the csv library has been integrated into Spark 2.0 so you don't need to use that package. On the other hand if your in an older version you can included it using the standard sbt or maven package configuration. On Friday, September 23, 2016, Dan Bikle wrote: > hello world-of

PySpark UDF Performance Exploration w/Jython (Early/rough 2~3X improvement*) [SPARK-15369]

2016-10-05 Thread Holden Karau
Hi Python Spark Developers & Users, As Datasets/DataFrames are becoming the core building block of Spark, and as someone who cares about Python Spark performance, I've been looking more at PySpark UDF performance. I've got an early WIP/request for comments pull request open

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread Holden Karau
Awesome, good points everyone. The ranking of the issues is super useful and I'd also completely forgotten about the lack of built in UDAF support which is rather important. There is a PR to make it easier to call/register JVM UDFs from Python which will hopefully help a bit there too. I'm getting

Re: detecting last record of partition

2016-10-13 Thread Holden Karau
It sounds like mapPartitionsWithIndex will give you the information you want over flatMap. On Thursday, October 13, 2016, Shushant Arora wrote: > Hi > > I have a transformation on a pair rdd using flatmap function. > > 1.Can I detect in flatmap whether the current record is last record of > part

Re: Aggregate UDF (UDAF) in Python

2016-10-16 Thread Holden Karau
I don't believe UDAFs are available in PySpark as this came up on the developer list while I was asking for what features people were missing in PySpark - see http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-td19422.html . T

Re: Aggregate UDF (UDAF) in Python

2016-10-16 Thread Holden Karau
ventually > abandon python for scala. It just takes too long for features to get ported > over from scala/java. > > > On Sun, Oct 16, 2016 at 8:42 AM, Holden Karau > wrote: > >> I don't believe UDAFs are available in PySpark as this came up on the >> develop

Re: Contributing to PySpark

2016-10-18 Thread Holden Karau
Hi Krishna, Thanks for your interest contributing to PySpark! I don't personally use either of those IDEs so I'll leave that part for someone else to answer - but in general you can find the building spark documentation at http://spark.apache.org/docs/latest/building-spark.html which includes note

Re: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$ . Please Help!!!!!!!

2016-11-04 Thread Holden Karau
It seems like you've marked the spark jars as provided, in this case they would only be provided you run your application with spark-submit or otherwise have Spark's JARs on your class path. How are you launching your application? On Fri, Nov 4, 2016 at 2:00 PM, shyla deshpande wrote: > object A

Re: Spark-packages

2016-11-06 Thread Holden Karau
I think there is a bit more life in the connector side of things for spark-packages, but there seem to be some outstanding issues with Python support that are waiting on progress (see https://github.com/databricks/sbt-spark-package/issues/26 ). It's possible others are just distributing on maven ce

Re: SparkILoop doesn't run

2016-11-17 Thread Holden Karau
Moving to user list So this might be a better question for the user list - but is there a reason you are trying to use the SparkILoop for tests? On Thu, Nov 17, 2016 at 5:47 PM Mohit Jaggi wrote: > > > I am trying to use SparkILoop to write some tests(shown below) but the > test hangs with the

Re: PySpark TaskContext

2016-11-24 Thread Holden Karau
Hi, The TaskContext isn't currently exposed in PySpark but I've been meaning to look at exposing at least some of TaskContext for parity in PySpark. Is there a particular use case which you want this for? Would help with crafting the JIRA :) Cheers, Holden :) On Thu, Nov 24, 2016 at 1:39 AM, of

Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Holden Karau
YARN will kill your processes if the child processes you start via PIPE consume too much memory, you can configured the amount of memory Spark leaves aside for other processes besides the JVM in the YARN containers with spark.yarn.executor.memoryOverhead. On Wed, Nov 23, 2016 at 10:38 PM, Sameer C

Re: PySpark TaskContext

2016-11-24 Thread Holden Karau
3 AM, Ofer Eliassaf wrote: > Since we can't work with log4j in pyspark executors we build our own > logging infrastructure (based on logstash/elastic/kibana). > Would help to have TID in the logs, so we can drill down accordingly. > > > On Thu, Nov 24, 2016 at 11:48 AM, Holden K

Re: PySpark TaskContext

2016-11-24 Thread Holden Karau
https://issues.apache.org/jira/browse/SPARK-18576 On Thu, Nov 24, 2016 at 2:05 AM, Holden Karau wrote: > Cool - thanks. I'll circle back with the JIRA number once I've got it > created - will probably take awhile before it lands in a Spark release > (since 2.1 has already b

Re: PySpark TaskContext

2016-11-24 Thread Holden Karau
n me finding a committer who shares my view - but I've over a hundred commits so it happens more often than not :) On Thu, Nov 24, 2016 at 3:15 AM, Ofer Eliassaf wrote: > thank u so much for this! Great to see that u listen to the community. > > On Thu, Nov 24, 2016 at 12:10 PM, Holde

Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Holden Karau
t; memory from YARN executor memory overhead, as well? How will YARN know that > the container launched by the docker daemon is linked to an executor? > > Best, > Sameer > > On Thu, Nov 24, 2016 at 1:59 AM Holden Karau wrote: > >> YARN will kill your processes if the child

Re: unit testing in spark

2016-12-08 Thread Holden Karau
There are also libraries designed to simplify testing Spark in the various platforms, spark-testing-base for Scala/Java/Python (& video https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck (scala focused property ba

Re: unit testing in spark

2016-12-08 Thread Holden Karau
Ive created my own library for > this as well. In my blog post I talk about testing with Spark in RSpec > style: > https://medium.com/@therevoltingx/test-driven-development-w-apache-spark- > 746082b44941 > > Sent from my iPhone > > On Dec 8, 2016, at 4:09 PM, Holden Karau

Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Holden Karau
How many workers are in the cluster? On Sat, Dec 17, 2016 at 12:23 PM Deepak Sharma wrote: > Hi All, > I am iterating over data frame's paritions using df.foreachPartition . > Upon each iteration of row , i am initializing DAO to insert the row into > cassandra. > Each of these iteration takes a

Re: [PySpark - 1.6] - Avoid object serialization

2016-12-29 Thread Holden Karau
Alternatively, using the broadcast functionality can also help with this. On Thu, Dec 29, 2016 at 3:05 AM Eike von Seggern wrote: > 2016-12-28 20:17 GMT+01:00 Chawla,Sumit : > > Would this work for you? > > def processRDD(rdd): > analyzer = ShortTextAnalyzer(root_dir) > rdd.foreach(lambd

Re: Efficient look up in Key Pair RDD

2017-01-08 Thread Holden Karau
To start with caching and having a known partioner will help a bit, then there is also the IndexedRDD project, but in general spark might not be the best tool for the job. Have you considered having Spark output to something like memcache? What's the goal of you are trying to accomplish? On Sun,

Re: A note about MLlib's StandardScaler

2017-01-08 Thread Holden Karau
Hi Gilad, Spark uses the sample standard variance inside of the StandardScaler (see https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler ) which I think would explain the results you are seeing you are seeing. I believe the scalers are intended to

Re: handling of empty partitions

2017-01-08 Thread Holden Karau
Hi Georg, Thanks for the question along with the code (as well as posting to stack overflow). In general if a question is well suited for stackoverflow its probably better suited to the user@ list instead of the dev@ list so I've cc'd the user@ list for you. As far as handling empty partitions wh

Re: Why StringIndexer uses double instead of int for indexing?

2017-01-21 Thread Holden Karau
I'm downstream stages the labels & features are generally expected to be doubles, so its easier to use as a double. On Sat, Jan 21, 2017 at 5:32 PM Shiyuan wrote: > Hi Spark, > StringIndex uses double instead of int for indexing > http://spark.apache.org/docs/latest/ml-features.html#stringindexe

Re: ML version of Kmeans

2017-01-31 Thread Holden Karau
You most likely want the transform function on KMeansModel (although that works on a dataset input rather than a single element at a time). On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I am not able to find predict method on "ML" version of Km

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Holden Karau
Which version of Spark is this (or is it a dev build)? We've recently made some improvements with PYTHONHASHSEED propagation. On Tue, Apr 4, 2017 at 7:49 AM Eike von Seggern wrote: 2017-04-01 21:54 GMT+02:00 Paul Tremblay : When I try to to do a groupByKey() in my spark environment, I get the e

Spark Testing Library Discussion

2017-04-14 Thread Holden Karau
Hi Spark Users (+ Some Spark Testing Devs on BCC), Awhile back on one of the many threads about testing in Spark there was some interest in having a chat about the state of Spark testing and what people want/need. So if you are interested in joining an online (with maybe an IRL component if enoug

Re: Spark Testing Library Discussion

2017-04-24 Thread Holden Karau
are alternative ideas. I'll record the hangout and if it isn't terrible I'll post it for those who weren't able to make it (and for next time I'll include more European friendly time options - Doodle wouldn't let me update it once posted). On Fri, Apr 14, 2017 at 11:

Re: Spark Testing Library Discussion

2017-04-24 Thread Holden Karau
The (tentative) link for those interested is https://hangouts.google.com/hangouts/_/oyjvcnffejcjhi6qazf3lysypue . On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau wrote: > So 14 people have said they are available on Tuesday the 25th at 1PM > pacific so we will do this meeting then (

Re: Spark Testing Library Discussion

2017-04-25 Thread Holden Karau
Urgh hangouts did something frustrating, updated link https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau wrote: > The (tentative) link for those interested is https://hangouts.google. > com/hangouts/_/oyjvcnffejcjhi6qazf3l

Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
And the recording of our discussion is at https://www.youtube.com/watch?v=2q0uAldCQ8M A few of us have follow up things and we will try and do another meeting in about a month or two :) On Tue, Apr 25, 2017 at 1:04 PM, Holden Karau wrote: > Urgh hangouts did something frustrating, updated l

Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
Sorry about that, hangouts on air broke in the first one :( On Wed, Apr 26, 2017 at 8:41 AM, Marco Mistroni wrote: > Uh i stayed online in the other link but nobody joinedWill follow > transcript > Kr > > On 26 Apr 2017 9:35 am, "Holden Karau" wrote: >

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Holden Karau
In PySpark the filter and then map steps are combined into a single transformation from the JVM point of view. This allows us to avoid copying the data back to Scala in between the filter and the map steps. The debugging exeperience is certainly much harder in PySpark and I think is an interesting

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Holden Karau
ld be a good intro (of course I'm pretty biased about that). On Wed, May 10, 2017 at 9:42 AM Pavel Klemenkov wrote: > Thanks for the quick answer, Holden! > > Are there any other tricks with PySpark which are hard to debug using UI > or toDebugString? > > On Wed, May 10, 2

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Holden Karau
In non streaming Spark checkpoints aren't for inter-application recovery, rather you can think of them as doing persist but to a HDFS rather than each nodes local memory / storage. On Fri, May 26, 2017 at 3:06 PM Priya wrote: > Hi, > > With nonstreaming spark application, did checkpoint the RDD

Re: Can we access files on Cluster mode

2017-06-24 Thread Holden Karau
addFile is supposed to not depend on a shared FS unless the semantics have changed recently. On Sat, Jun 24, 2017 at 11:55 AM varma dantuluri wrote: > Hi Sudhir, > > I believe you have to use a shared file system that is accused by all > nodes. > > > On Jun 24, 2017, at 1:30 PM, sudhir k wrote:

With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Holden Karau
Hi wonderful Python + Spark folks, I'm excited to announce that with Spark 2.2.0 we finally have PySpark published on PyPI (see https://pypi.python.org/pypi/pyspark / https://twitter.com/holdenkarau/status/885207416173756417). This has been a long time coming (previous releases included pip instal

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Holden Karau
The memory overhead is based less on the total amount of data and more on what you end up doing with the data (e.g. if your doing a lot of off-heap processing or using Python you need to increase it). Honestly most people find this number for their job "experimentally" (e.g. they try a few differen

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
So performing complete output without an aggregation would require building up a table of the entire input to write out at each micro batch. This would get prohibitively expensive quickly. With an aggregation we just need to keep track of the aggregates and update them every batch, so the memory re

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
drawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Fri, Aug 18, 2017 at 6:35 PM, Holden Karau > wrote: > > So performing complete ou

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Holden Karau
My assumption is it would be similar though, in memory sink of all of your records would quickly overwhelm your cluster, but in aggregation it could be reasonable. But there might be additional reasons on top of that. On Fri, Aug 18, 2017 at 11:44 AM Holden Karau wrote: > Ah yes I'm

[ANNOUNCE] Apache Spark 2.1.2

2017-10-25 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.2! Apache Spark 2.1.2 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.2 visit http://spark.apache.org/downloa

Re: Use of Accumulators

2017-11-13 Thread Holden Karau
So you want to set an accumulator to 1 after a transformation has fully completed? Or what exactly do you want to do? On Mon, Nov 13, 2017 at 9:47 PM vaquar khan wrote: > Confirmed ,you can use Accumulators :) > > Regards, > Vaquar khan > > On Mon, Nov 13, 2017 at 10:58 AM, Kedarnath Dixit < > k

Re: Use of Accumulators

2017-11-14 Thread Holden Karau
t toggle it saying there is some change while > processing the data. > > > > Please let me know if we can runtime do this. > > > > > > Thanks! > > *~Kedar Dixit* > > Bigdata Analytics at Persistent Systems Ltd. > > > > *From:* Holden Karau [via

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Holden Karau
What command did you use to launch your Spark application? The https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying documentation suggests using spark-submit with the `--packages` flag to include the required Kafka package. e.g. ./bin/spark-submit --packages o

What do you pay attention to when validating Spark jobs?

2017-11-21 Thread Holden Karau
Hi Folks, I'm working on updating a talk and I was wondering if any folks in the community wanted to share their best practices for validating your Spark jobs? Are there any counters folks have found useful for monitoring/validating your Spark jobs? Cheers, Holden :) -- Twitter: https://twitte

Re: NLTK with Spark Streaming

2017-11-26 Thread Holden Karau
So it’s certainly doable (it’s not super easy mind you), but until the arrow udf release goes out it will be rather slow. On Sun, Nov 26, 2017 at 8:01 AM ashish rawat wrote: > Hi, > > Has someone tried running NLTK (python) with Spark Streaming (scala)? I > was wondering if this is a good idea a

Re: Is Databricks REST API open source ?

2017-12-02 Thread Holden Karau
That API is not open source. There are some other options as separate projects you can check out (like Livy,spark-jobserver, etc). On Sat, Dec 2, 2017 at 8:30 PM kant kodali wrote: > HI All, > > Is REST API (https://docs.databricks.com/api/index.html) open source? > where I can submit spark jobs

Re: Recommended way to serialize Hadoop Writables' in Spark

2017-12-03 Thread Holden Karau
So is there a reason you want to shuffle Hadoop types rather than the Java types? As for your specific question, for Kyro you also need to register your serializers, did you do that? On Sun, Dec 3, 2017 at 10:02 AM pradeepbaji wrote: > Hi, > > Is there any recommended way of serializing Hadoop

Re: Access to Applications metrics

2017-12-05 Thread Holden Karau
I've done a SparkListener to record metrics for validation (it's a bit out of date). Are you just looking to have graphing/alerting set up on the Spark metrics? On Tue, Dec 5, 2017 at 1:53 PM, Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > You can also get the metrics from the Spark app

Re: Spark Tuning Tool

2018-01-22 Thread Holden Karau
That's very interesting, and might also get some interest on the dev@ list if it was open source. On Tue, Jan 23, 2018 at 4:02 PM, Roger Marin wrote: > I'd be very interested. > > On 23 Jan. 2018 4:01 pm, "Rohit Karlupia" wrote: > >> Hi, >> >> I have been working on making the performance tunin

FOSDEM mini-office hour?

2018-01-31 Thread Holden Karau
Hi Spark Friends, If any folks are around for FOSDEM this year I was planning on doing a coffee office hour on the last day after my talks . Maybe like 6pm? I'm also going to see if any BEAM folks are around and interested :) Cheers, Holden

Re: pyspark+spacy throwing pickling exception

2018-02-15 Thread Holden Karau
So you left out the exception. On one hand I’m also not sure how well spacy serializes, so to debug this I would start off by moving the nlp = inside of my function and see if it still fails. On Thu, Feb 15, 2018 at 9:08 PM Selvam Raman wrote: > import spacy > > nlp = spacy.load('en') > > > > de

Re: Can spark handle this scenario?

2018-02-16 Thread Holden Karau
I'm not sure what you mean by it could be hard to serialize complex operations? Regardless I think the question is do you want to parallelize this on multiple machines or just one? On Feb 17, 2018 4:20 PM, "Lian Jiang" wrote: > Thanks Ayan. RDD may support map better than Dataset/DataFrame. How

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-23 Thread Holden Karau
You can also look at the shuffle file cleanup tricks we do inside of the ALS algorithm in Spark. On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvp wrote: > have you looked at > http://apache-spark-user-list.1001560.n3.nabble.com/Limit- > Spark-Shuffle-Disk-Usage-td23279.html > > and the post mentioned

Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
Hi folks, If your curious in learning more about how Spark is developed, I’m going to expirement doing a live code review where folks can watch and see how that part of our process works. I have two volunteers already for having their PRs looked at live, and if you have a Spark PR your working on

Re: Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
If anyone wants to watch the recording: https://www.youtube.com/watch?v=lugG_2QU6YU I'll do one next week as well - March 16th @ 11am - https://www.youtube.com/watch?v=pXzVtEUjrLc On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau wrote: > Hi folks, > > If your curious in learning

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-21 Thread Holden Karau
Super exciting! I look forward to digging through it this weekend. On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com> wrote: > Excellent. You filled a missing link. > > Best, > Passion > > On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia > wrote: > >> Hi, >> >>

Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
Hi Y'all, If your interested in learning more about how the development process in Apache Spark works I've been doing a weekly live streamed code review most Fridays at 11am. This weeks will be on twitch/youtube ( https://www.twitch.tv/holdenkarau / https://www.youtube.com/watch?v=vGVSa9KnD80 ). I

Re: Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
ezone? >> >> Il gio 12 apr 2018, 21:23 Holden Karau ha scritto: >> >>> Hi Y'all, >>> >>> If your interested in learning more about how the development process in >>> Apache Spark works I've been doing a weekly live streamed code re

Re: Live Stream Code Reviews :)

2018-04-13 Thread Holden Karau
t; zone I guess. > > Regards, > Gourav Sengupta > > On Thu, Apr 12, 2018 at 8:23 PM, Holden Karau > wrote: > >> Hi Y'all, >> >> If your interested in learning more about how the development process in >> Apache Spark works I've been doin

Re: [Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Holden Karau
So, while its not perfect, I have a guide focused on running custom Spark on GKE https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc and if you want to run pre-built Spark on GKE there is a solutions article

Re: testing frameworks

2018-05-21 Thread Holden Karau
So I’m biased as the author of spark-testing-base but I think it’s pretty ok. Are you looking for unit or integration or something else? On Mon, May 21, 2018 at 5:24 AM Steve Pruitt wrote: > Hi, > > > > Can anyone recommend testing frameworks suitable for Spark jobs. > Something that can be inte

Re: testing frameworks

2018-05-30 Thread Holden Karau
So Jessie has an excellent blog post on how to use it with Java applications - http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/ On Wed, May 30, 2018 at 4:14 AM Spico Florin wrote: > Hello! > I'm also looking for unit testing spark Java application. I've seen the > great work

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Holden Karau
If it’s one 33mb file which decompressed to 1.5g then there is also a chance you need to split the inputs since gzip is a non-splittable compression format. On Tue, Jun 5, 2018 at 11:55 AM Anastasios Zouzias wrote: > Are you sure that your JSON file has the right format? > > spark.read.json(...)

Spark ML online serving

2018-06-06 Thread Holden Karau
At Spark Summit some folks were talking about model serving and we wanted to collect requirements from the community. -- Twitter: https://twitter.com/holdenkarau

Re: Live Streamed Code Review today at 11am Pacific

2018-06-07 Thread Holden Karau
I'll be doing another one tomorrow morning at 9am pacific focused on Python + K8s support & improved JSON support - https://www.youtube.com/watch?v=Z7ZEkvNwneU & https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :) On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau wrote: > If anyon

Re: Live Streamed Code Review today at 11am Pacific

2018-06-14 Thread Holden Karau
d the other will be the regular Friday code review ( https://www.youtube.com/watch?v=IAWm4OLRoyY / https://www.twitch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am. On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau wrote: > I'll be doing another one tomorrow morning at 9am pacific focused on &

Re: Live Streamed Code Review today at 11am Pacific

2018-06-27 Thread Holden Karau
.com/user/holdenkarau & https://www.twitch.tv/holdenkarau/events . Hopefully this can encourage more folks to help with RC validation & PR reviews :) On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau wrote: > Next week is pride in San Francisco but I'm still going to do two quick > sess

[ANNOUNCE] Apache Spark 2.1.3

2018-07-01 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.3! Apache Spark 2.1.3 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. The release notes are available at http://spark.apache.org/releases/s

Re: Live Streamed Code Review today at 11am Pacific

2018-07-13 Thread Holden Karau
ySpark and working on Sparkling ML - https://www.youtube.com/watch?v=kCnBDpNce9A&list=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw&index=32 On Wed, Jun 27, 2018 at 10:44 AM, Holden Karau wrote: > Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and > see how we validate S

Re: Pyspark access to scala/java libraries

2018-07-15 Thread Holden Karau
If you want to see some examples in a library shows a way to do it - https://github.com/sparklingpandas/sparklingml and high performance spark also talks about it. On Sun, Jul 15, 2018, 11:57 AM <0xf0f...@protonmail.com.invalid> wrote: > Check > https://stackoverflow.com/questions/31684842/callin

Re: Live Streamed Code Review today at 11am Pacific

2018-07-19 Thread Holden Karau
Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30 am because I had to move some flights around. On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau wrote: > This afternoon @ 3pm pacific I'll be looking at review tooling for Spark & > Beam https://www.yout

Live Code Reviews, Coding, and Dev Tools

2018-07-24 Thread Holden Karau
Tomorrow afternoon @ 3pm pacific I'll be doing some dev tools poking for Beam and Spark - https://www.youtube.com/watch?v=6cTmC_fP9B0 for mention-bot. On Friday I'll be doing my normal code reviews - https://www.youtube.com/watch?v=O4rRx-3PTiM On Monday July 30th @ 9:30am I'll be doing some more

Re: Use Arrow instead of Pickle without pandas_udf

2018-07-25 Thread Holden Karau
Not currently. What's the problem with pandas_udf for your use case? On Wed, Jul 25, 2018 at 1:27 PM, Hichame El Khalfi wrote: > Hi There, > > > Is there a way to use Arrow format instead of Pickle but without using > pandas_udf ? > > > Thank for your help, > > > Hichame > -- Twitter: https:

Re: Live Streamed Code Review today at 11am Pacific

2018-09-20 Thread Holden Karau
order batches) is my current plan to start with :) On Thu, Jul 19, 2018 at 11:38 PM Holden Karau wrote: > Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30 > am because I had to move some flights around. > > On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau &

Code review and Coding livestreams today

2018-10-12 Thread Holden Karau
I’ll be doing my regular weekly code review at 10am Pacific today - https://youtu.be/IlH-EGiWXK8 with a look at the current RC, and in the afternoon at 3pm Pacific I’ll be doing some live coding around WIP graceful decommissioning PR - https://youtu.be/4FKuYk2sbQ8 -- Twitter: https://twitter.com/h

Re: Is there any Spark source in Java

2018-11-03 Thread Holden Karau
Parts of it are indeed written in Java. You probably want to reach out to the developers list to talk about changing Spark. On Sat, Nov 3, 2018, 11:42 AM Soheil Pourbafrani Hi, I want to customize some part of Spark. I was wondering if there any > Spark source is written in Java language, or all

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-11-15 Thread Holden Karau
If folks are interested, while it's not on Amazon, I've got a live stream of getting client mode with Jupyternotebook to work on GCP/GKE : https://www.youtube.com/watch?v=eMj0Pv1-Nfo&index=3&list=PLRLebp9QyZtZflexn4Yf9xsocrR_aSryx On Wed, Oct 31, 2018 at 5:55 PM Zhang, Yuqi wrote: > Hi Li, > > >

Re: How to preserve event order per key in Structured Streaming Repartitioning By Key?

2018-12-11 Thread Holden Karau
So it's been awhile since I poked at the streaming code base, but I don't think we make an promises about stable sort during repartition, and there's notes in there about how some of these components should be re-written into core so even if we did have stable sort I wouldn't depend on it unless it

Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Holden Karau
+1 On Fri, May 31, 2019 at 5:41 PM Bryan Cutler wrote: > +1 and the draft sounds good > > On Thu, May 30, 2019, 11:32 AM Xiangrui Meng wrote: > >> Here is the draft announcement: >> >> === >> Plan for dropping Python 2 support >> >> As many of you already knew, Python core development team and

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Holden Karau
+1 Does anyone have any critical fixes they’d like to see in 2.4.4? On Tue, Aug 13, 2019 at 5:22 PM Sean Owen wrote: > Seems fine to me if there are enough valuable fixes to justify another > release. If there are any other important fixes imminent, it's fine to > wait for those. > > > On Tue, A

Re: Release Apache Spark 2.4.4

2019-08-14 Thread Holden Karau
t;> [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in >> EpochTracker (to support Python UDFs) >> <https://github.com/apache/spark/pull/24946> >> >> Thanks, >> Terry >> >> On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan wrote: >> >

Re: Announcing .NET for Apache Spark 0.5.0

2019-09-30 Thread Holden Karau
Congratulations on the release :) On Mon, Sep 30, 2019 at 9:38 AM Terry Kim wrote: > We are thrilled to announce that .NET for Apache Spark 0.5.0 has been just > released ! > > > > Some of the highlights of this release include: > >- Delta

Re: Loop through Dataframes

2019-10-06 Thread Holden Karau
So if you want to process the contents of a dataframe locally but not pull all of the data back at once toLocaliterator is probably what you're looking for, it's still not great though so maybe you can share the root problem which your trying to solve and folks might have some suggestions there. O

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-11-01 Thread Holden Karau
On Thu, Oct 31, 2019 at 10:04 PM Nicolas Paris wrote: > have you deactivated the spark.ui ? > I have read several thread explaining the ui can lead to OOM because it > stores 1000 dags by default > > > On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote: > > Dear List, > > > > I've observed

Re: Why Spark generates Java code and not Scala?

2019-11-10 Thread Holden Karau
If you look inside of the generation we generate java code and compile it with Janino. For interested folks the conversation moved over to the dev@ list On Sat, Nov 9, 2019 at 10:37 AM Marcin Tustin wrote: > What do you mean by this? Spark is written in a combination of Scala and > Java, and the

Re: PySpark Pandas UDF

2019-11-10 Thread Holden Karau
Can you switch the write for a count just so we can isolate if it’s the write or the count? Also what’s the output path your using? On Sun, Nov 10, 2019 at 7:31 AM Gal Benshlomo wrote: > > > Hi, > > > > I’m using pandas_udf and not able to run it from cluster mode, even though > the same code wo

<    1   2   3   >