Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Brandon White
Hello all, I will be starting a new Spark codebase and I would like to get opinions on using Python over Scala. Historically, the Scala API has always been the strongest interface to Spark. Is this still true? Are there still many benefits and additional features in the Scala API that are not avai

Encrypting Airflow Communications

2016-10-27 Thread Brandon White
>From what I see, Airflow communicates with a couple sources: 1) SQL Store 2) Celery Broker Does Airflow have any configurations which make it easy to encrypt all of its communications or do we need to build custom solutions into Airflow? -- This e-mail is private and confidential and is for th

Tracking metrics in a task

2016-10-04 Thread Brandon White
Hello! Airflow does a great job of tracking metrics at the task level and I am wondering if there is any support for tracking metrics within a task. Say I have a task which downloads data, processes it, then stores it. Are there any Airflow features which allow me to track how long these subtasks

Re: Using spark package XGBoost

2016-08-14 Thread Brandon White
The XGBoost integration with Spark is currently only supported for RDDs, there is a ticket for dataframe and folks calm to be working on it. On Aug 14, 2016 8:15 PM, "Jacek Laskowski" wrote: > Hi, > > I've never worked with the library and speaking about sbt setup only. > > It appears that the p

Setting spark.sql.shuffle.partitions Dynamically

2016-07-26 Thread Brandon White
Hello, My platform runs hundreds of Spark jobs every day each with its own datasize from 20mb to 20TB. This means that we need to set resources dynamically. One major pain point for doing this is spark.sql.shuffle.partitions, the number of partitions to use when shuffling data for joins or aggrega

Optimal Amount of Tasks Per size of data in memory

2016-07-20 Thread Brandon White
What is the best heuristic for setting the number of partitions/task on an RDD based on the size of the RDD in memory? The Spark docs say that the number of partitions/tasks should be 2-3x the number of CPU cores but this does not make sense for all data sizes. Sometimes, this number is way to muc

Size of cached dataframe

2016-07-15 Thread Brandon White
Is there any public API to get the size of a dataframe in cache? It's seen through the Spark UI but I don't see the API to access this information. Do I need to build it myself using a forked version of Spark?

Difference between Dataframe and RDD Persisting

2016-06-26 Thread Brandon White
What is the difference between persisting a dataframe and a rdd? When I persist my RDD, the UI says it takes 50G or more of memory. When I persist my dataframe, the UI says it takes 9G or less of memory. Does the dataframe not persist the actual content? Is it better / faster to persist a RDD when

Re: What does it mean when a executor has negative active tasks?

2016-06-18 Thread Brandon White
AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 18 June 2016 at 17:50, Brandon Wh

Spark ML - Is it safe to schedule two trainings job at the same time or will worker state be corrupted?

2016-06-09 Thread Brandon White
For example, say I want to train two Linear Regressions and two GBD Tree Regressions. Using different threads, Spark allows you to submit jobs at the same time (see: http://spark.apache.org/docs/latest/job-scheduling.html). If I schedule two or more training jobs and they are running at the same t

Re: BlockManager crashing applications

2016-05-08 Thread Brandon White
May 8, 2016 5:55 PM, "Ashish Dubey" wrote: Brandon, how much memory are you giving to your executors - did you check if there were dead executors in your application logs.. Most likely you require higher memory for executors.. Ashish On Sun, May 8, 2016 at 1:01 PM, Brandon White wrot

BlockManager crashing applications

2016-05-08 Thread Brandon White
Hello all, I am running a Spark application which schedules multiple Spark jobs. Something like: val df = sqlContext.read.parquet("/path/to/file") filterExpressions.par.foreach { expression => df.filter(expression).count() } When the block manager fails to fetch a block, it throws an excepti

QueryExecution to String breaks with OOM

2016-05-02 Thread Brandon White
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuild

Is DataFrame randomSplit Deterministic?

2016-05-01 Thread Brandon White
If I have the same data, the same ratios, and same sample seed, will I get the same splits every time?

Re: Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Brandon White
80 executors with 7g memory. On Apr 30, 2016 1:22 PM, "Ted Yu" wrote: > Can you provide a bit more information: > > Does the smaller dataset have skew ? > > Which release of Spark are you using ? > > How much memory did you specify ? > > Thanks > > On Sat, A

Re: Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Brandon White
randomSplit instead of randomSample On Apr 30, 2016 1:51 PM, "Brandon White" wrote: > val df = globalDf > val filteredDfs= filterExpressions.map { expr => > val filteredDf = df.filter(expr) > val samples = filteredDf.randomSample([.7, .3]) >(samples(0), sam

Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Brandon White
Hello, I am writing to datasets. One dataset is x2 larger than the other. Both datasets are written to parquet the exact same way using df.write.mode("Overwrite").parquet(outputFolder) The smaller dataset OOMs while the larger dataset writes perfectly fine. Here is the stack trace: Any ideas wha

DataFrame to DataSet without Predefined Class

2016-04-26 Thread Brandon White
I am reading parquet files into a dataframe. The schema varies depending on the data so I have no way to write a predefined class. Is there any way to go from DataFrame to DataSet without predefined a case class? Can I build a class from my dataframe schema?

How can I bucketize / group a DataFrame from parquet files?

2016-04-25 Thread Brandon White
I am creating a dataFrame from parquet files. The schema is based on the parquet files, I do not know it before hand. What I want to do is group the entire DF into buckets based on a column. val df = sqlContext.read.parquet("/path/to/files") val groupedBuckets: DataFrame[String, Array[Rows]] = df.

Re: subscribe

2015-08-22 Thread Brandon White
https://www.youtube.com/watch?v=umDr0mPuyQc On Sat, Aug 22, 2015 at 8:01 AM, Ted Yu wrote: > See http://spark.apache.org/community.html > > Cheers > > On Sat, Aug 22, 2015 at 2:51 AM, Lars Hermes < > li...@hermes-it-consulting.de> wrote: > >> subscribe >> >> -

Re: How to save a string to a text file ?

2015-08-14 Thread Brandon White
Convert it to a rdd then save the rdd to a file val str = "dank memes" sc.parallelize(List(str)).saveAsTextFile("str.txt") On Fri, Aug 14, 2015 at 7:50 PM, go canal wrote: > Hello again, > online resources have sample code for writing RDD to a file, but I have a > simple string, how to save to

Re: subscribe

2015-08-10 Thread Brandon White
https://www.youtube.com/watch?v=H07zYvkNYL8 On Mon, Aug 10, 2015 at 10:55 AM, Ted Yu wrote: > Please take a look at the first section of > https://spark.apache.org/community > > Cheers > > On Mon, Aug 10, 2015 at 10:54 AM, Phil Kallos > wrote: > >> please >> > >

Re: Spark SQL Hive - merge small files

2015-08-05 Thread Brandon White
So there is no good way to merge spark files in a manage hive table right now? On Wed, Aug 5, 2015 at 10:02 AM, Michael Armbrust wrote: > This feature isn't currently supported. > > On Wed, Aug 5, 2015 at 8:43 AM, Brandon White > wrote: > >> Hello, >> >&g

Spark SQL Hive - merge small files

2015-08-05 Thread Brandon White
Hello, I would love to have hive merge the small files in my managed hive context after every query. Right now, I am setting the hive configuration in my Spark Job configuration but hive is not managing the files. Do I need to set the hive fields in around place? How do you set Hive configurations

Combining Spark Files with saveAsTextFile

2015-08-04 Thread Brandon White
What is the best way to make saveAsTextFile save as only a single file?

Turn Off Compression for Textfiles

2015-08-04 Thread Brandon White
How do you turn off gz compression for saving as textfiles? Right now, I am reading ,gz files and it is saving them as .gz. I would love to not compress them when I save. 1) DStream.saveAsTextFiles() //no compression 2) RDD.saveAsTextFile() //no compression Any ideas?

Re: Schema evolution in tables

2015-08-03 Thread Brandon White
Sim did you find anything? :) On Sun, Jul 26, 2015 at 9:31 AM, sim wrote: > The schema merging > > section of the Spark SQL documentation shows an example of schema evolution > in a partitioned table. > > Is this fun

Re: Unsubscribe

2015-08-03 Thread Brandon White
YOU SHALL NOT PASS! On Aug 3, 2015 1:23 PM, "Aries Kay" wrote: > > > On Mon, Aug 3, 2015 at 1:00 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > >> Thanks a lot for all these documents. Appreciate your effort & time. >> >> On Mon, Aug 3, 2015 at 10:15 AM, Christian Tzolov >> wrote: >> >>> ÐΞ€ρ@Ҝ (๏̯͡๏), >>> >>> I'v

What happens when you create more DStreams then nodes in the cluster?

2015-07-31 Thread Brandon White
Since one input dstream creates one receiver and one receiver uses one executor / node. What happens if you create more Dstreams than nodes in the cluster? Say I have 30 Dstreams on a 15 node cluster. Will ~10 streams get assigned to ~10 executors / nodes then the other ~20 streams will be queue

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-31 Thread Brandon White
bs in the batch in parallel in thread pool, > but it would also make sure all the jobs finish before the batch is marked > as completed. > > On Tue, Jul 28, 2015 at 4:05 PM, Brandon White > wrote: > >> Thank you Tathagata. My main use case for the 500 streams is to append &

Does Spark Streaming need to list all the files in a directory?

2015-07-30 Thread Brandon White
Is this a known bottle neck for Spark Streaming textFileStream? Does it need to list all the current files in a directory before he gets the new files? Say I have 500k files in a directory, does it list them all in order to get the new files?

Re: unsubscribe

2015-07-30 Thread Brandon White
https://www.youtube.com/watch?v=JncgoPKklVE On Thu, Jul 30, 2015 at 1:30 PM, wrote: > > > -- > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If you have > received it in error, ple

Re: unsubscribe

2015-07-28 Thread Brandon White
NO! On Tue, Jul 28, 2015 at 5:03 PM, Harshvardhan Chauhan wrote: > > > -- > *Harshvardhan Chauhan* | Software Engineer > *GumGum* | *Ads that stick* > 310-260-9666 | ha...@gumgum.com >

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Brandon White
onedStream.foreachRDD { rdd => > // do something > } > > Then there is only one foreachRDD executed in every batch that will > process in parallel all the new files in each batch interval. > TD > > > On Tue, Jul 28, 2015 at 3:06 PM, Brandon White > wrote: > >&

Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Brandon White
val ssc = new StreamingContext(sc, Minutes(10)) //500 textFile streams watching S3 directories val streams = streamPaths.par.map { path => ssc.textFileStream(path) } streams.par.foreach { stream => stream.foreachRDD { rdd => //do something } } ssc.start() Would something like this sca

Re: Programmatically launch several hundred Spark Streams in parallel

2015-07-24 Thread Brandon White
ypesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Fri, Jul 24, 2015 at 11:23 AM, Brandon White > wrote: > >> Hello, >> >> So I have about 500 Spark Streams and I want to know the fastest a

Programmatically launch several hundred Spark Streams in parallel

2015-07-24 Thread Brandon White
Hello, So I have about 500 Spark Streams and I want to know the fastest and most reliable way to process each of them. Right now, I am creating and process them in a list: val ssc = new StreamingContext(sc, Minutes(10)) val streams = paths.par.map { nameAndPath => (path._1, ssc.textFileStream

Spark SQL Table Caching

2015-07-21 Thread Brandon White
A few questions about caching a table in Spark SQL. 1) Is there any difference between caching the dataframe and the table? df.cache() vs sqlContext.cacheTable("tableName") 2) Do you need to "warm up" the cache before seeing the performance benefits? Is the cache LRU? Do you need to run some que

DataFrame Union not passing optimizer assertion

2015-07-19 Thread Brandon White
Hello! So I am doing a union of two dataframes with the same schema but a different number of rows. However, I am unable to pass an assertion. I think it is this one here

Nullpointer when saving as table with a timestamp column type

2015-07-16 Thread Brandon White
So I have a very simple dataframe that looks like df: [name:String, Place:String, time: time:timestamp] I build this java.sql.Timestamp from a string and it works really well expect when I call saveAsTable("tableName") on this df. Without the timestamp, it saves fine but with the timestamp, it th

Running foreach on a list of rdds in parallel

2015-07-15 Thread Brandon White
Hello, I have a list of rdds List(rdd1, rdd2, rdd3,rdd4) I would like to save these rdds in parallel. Right now, it is running each operation sequentially. I tried using a rdd of rdd but that does not work. list.foreach { rdd => rdd.saveAsTextFile("/tmp/cache/") } Any ideas?

Re: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Brandon White
ecute the sql query like > > “cache table mytable as SELECT xxx” in the JDBC connection also. > > > > Cheng Hao > > > > *From:* Brandon White [mailto:bwwintheho...@gmail.com] > *Sent:* Wednesday, July 15, 2015 8:26 AM > *To:* user > *Subject:* How do yo

How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Brandon White
Hello there, I have a JBDC connection setup to my Spark cluster but I cannot see the tables that I cache in memory. The only tables I can see are those that are in my Hive instance. I use a HiveContext to register a table and cache it in memory. How can I enable my JBDC connection to query this in

Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Brandon White
data files? > > Thanks, > > Yin > > On Fri, Jul 10, 2015 at 11:55 AM, Brandon White > wrote: > >> Why does this not work? Is insert into broken in 1.3.1? It does not throw >> any errors, fail, or throw exceptions. It simply does not work. >> >> >

Spark Streaming - Inserting into Tables

2015-07-10 Thread Brandon White
Why does this not work? Is insert into broken in 1.3.1? It does not throw any errors, fail, or throw exceptions. It simply does not work. val ssc = new StreamingContext(sc, Minutes(10)) val currentStream = ssc.textFileStream(s"s3://textFileDirectory/") val dayBefore = sqlContext.jsonFile(s"s3://

What is faster for SQL table storage, On-Heap or off-heap?

2015-07-09 Thread Brandon White
Is the read / aggregate performance better when caching Spark SQL tables on-heap with sqlContext.cacheTable() or off heap by saving it to Tachyon? Has anybody tested this? Any theories?

S3 vs HDFS

2015-07-08 Thread Brandon White
Are there any significant performance differences between reading text files from S3 and hdfs?

Re: Spark query

2015-07-08 Thread Brandon White
Convert the column to a column of java Timestamps. Then you can do the following import java.sql.Timestamp import java.util.Calendar def date_trunc(timestamp:Timestamp, timeField:String) = { timeField match { case "hour" => val cal = Calendar.getInstance() cal.setTimeInMillis

Re: Real-time data visualization with Zeppelin

2015-07-08 Thread Brandon White
Can you use a con job to update it every X minutes? On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya wrote: > Hi all – I’m just wondering if anyone has had success integrating Spark > Streaming with Zeppelin and actually dynamically updating the data in near > real-time. From my investigation, it s

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Brandon White
; it in parallel though. > > Thanks > Best Regards > > On Wed, Jul 8, 2015 at 5:34 AM, Brandon White > wrote: > >> Say I have a spark job that looks like following: >> >> def loadTable1() { >> val table1 = sqlContext.jsonFile(s"s3://textfiledi

Why can I not insert into TempTables in Spark SQL?

2015-07-07 Thread Brandon White
Why does this not work? Is insert into broken in 1.3.1? val ssc = new StreamingContext(sc, Minutes(10)) val currentStream = ssc.textFileStream(s"s3://textFileDirectory/") val dayBefore = sqlContext.jsonFile(s"s3://textFileDirectory/") dayBefore.saveAsParquetFile("/tmp/cache/dayBefore.parquet")

Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-07 Thread Brandon White
Say I have a spark job that looks like following: def loadTable1() { val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/") table1.cache().registerTempTable("table1")} def loadTable2() { val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/") table2.cache().registerTempTable("t

Not Blocking Other Notebooks when running all on a notebook

2015-07-03 Thread Brandon White
Hello there, Whenever I run all for one notebook, it blocks all queries and execution for all the other notebooks. How do I turn this off? I need to be able to run all and still have the other notebooks work. I set zeppelin.spark.concurrentSQL to true but it is still blocking. Is there any other

Grouping elements in a RDD

2015-06-20 Thread Brandon White
How would you do a .grouped(10) on a RDD, is it possible? Here is an example for a Scala list scala> List(1,2,3,4).grouped(2).toList res1: List[List[Int]] = List(List(1, 2), List(3, 4)) Would like to group n elements.

[jira] (MNG-4715) version expression constant

2015-02-22 Thread Brandon White (JIRA)
[ https://jira.codehaus.org/browse/MNG-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=363737#comment-363737 ] Brandon White commented on MNG-4715: Prior comments on this bug report from Chris Price,

Re: [coreboot] Support for Google CR-48/Atom N455

2011-01-04 Thread Brandon White
Thanks Crisit for getting back to me. I have indeed flashed it with flashrom in Ubuntu. I will see what I can do about details of the northbridge. On Mon, Jan 3, 2011 at 12:07 PM, Cristi Magherusan < cristi.magheru...@net.utcluj.ro> wrote: > În Dum, Ianuarie 2, 2011 4:31, Brandon Whit

[coreboot] Support for Google CR-48/Atom N455

2011-01-02 Thread Brandon White
Hello all. Someone was accidentally sent a CR-48 that had Windows 7 pre-installed and an actual BIOS instead of Google's EFI. Anways, he uploaded the BIOS, I was able to flash it and after that I installed Ubuntu as a dual boot with Chrome OS. I was wondering if Coreboot was possible on this machin

[Bug 593152] [NEW] Remote Desktop Cannot Run W/O being plugged into a monitor.

2010-06-12 Thread Brandon White
Public bug reported: I have been trying to figure out how to get a Ubuntu 10.04 Desktop Edition to work with remote desktop without using a monitor for the Ubuntu system. Unfortunately, I have not succeeded, can somebody PLEASE HELP?!? ** Affects: ubuntu Importance: Undecided Status

[laptop-discuss] problem with wificonfig

2007-12-30 Thread Brandon White
I am following the instructions to install ipw driver for intel 3945 here: http://opensolaris.org/os/community/laptop/wireless/ipw/ Everything went fine until I try to connect to created profile, I get this message: "wificonfig: failed to open 'wpi0': No such file or directory" However, wpi0 i

Re: [osol-help] wpi driver install problem

2007-12-30 Thread Brandon White
Thanks for your response. I am very new to Solaris, coming from Linux. I downloaded and installed Solaris 10 from sun.com. Is this the newest version? This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensol

[osol-help] wpi driver install problem

2007-12-30 Thread Brandon White
Thanks for your response. I am very new to Solaris, coming from Linux. I downloaded and installed Solaris 10 from sun.com. Is this the newest version? This message posted from opensolaris.org

[osol-help] wpi driver install problem

2007-12-30 Thread Brandon White
I am following the instructions here: http://opensolaris.org/os/community/laptop/wireless/wpi/ However, when I try the command: pkgadd -d packages/i386/nightly/ SUNWwpi I get this error in the terminal: pkgadd: ERROR: attempt to process datastream failed - open of failed, errno=2 pkgadd:

[osol-help] wpi driver install problem

2007-12-30 Thread Brandon White
I am following the instructions here: http://opensolaris.org/os/community/laptop/wireless/wpi/ However, when I try the command: pkgadd -d packages/i386/nightly/ SUNWwpi I get this error in the terminal: pkgadd: ERROR: attempt to process datastream failed - open of failed, errno=2 pkgadd:

Bug#420530: installation report

2007-04-22 Thread Brandon White
Package: installation-reports Boot method: CD Image version: JIGDO http://cdimage.debian.org/debian-cd/4.0_r0/i386/jigdo-cd/debian-40r0-i386-CD-1.jigdo Date: 22 APR 2007, 2200 Machine: Dell Inspiron E1505 Processor: Intel Centrio Core Duo, 1.8ghz Memory: 1GB Partitions: /dev/sda4 ext324675

Bug#420530: installation report

2007-04-22 Thread Brandon White
Package: installation-reports Boot method: CD Image version: JIGDO http://cdimage.debian.org/debian-cd/4.0_r0/i386/jigdo-cd/debian-40r0-i386-CD-1.jigdo Date: 22 APR 2007, 2200 Machine: Dell Inspiron E1505 Processor: Intel Centrio Core Duo, 1.8ghz Memory: 1GB Partitions: /dev/sda4 ext324675

Bug#24950: Please respond before August 11

2005-08-04 Thread Brandon White
THIS IS GOING TO BE OUR ABSOLUTE ATTEMPT We have endevored to speak to you on many periods and we await your response now! Your current finanncial loann situation meets the requirements for you for up to a 3.1 % lower rate. However, based on the fact that our previous attempts to speak to you

Bug#24950: Please respond before August 11

2005-08-04 Thread Brandon White
THIS IS GOING TO BE OUR ABSOLUTE ATTEMPT We have endevored to speak to you on many periods and we await your response now! Your current finanncial loann situation meets the requirements for you for up to a 3.1 % lower rate. However, based on the fact that our previous attempts to speak to you

RE: Protect XMLPlatformUtils::Initialize and XMLPlatformUtils::Terminate

2005-05-19 Thread Brandon White
Doing it like this is the only safe way to guarantee that Init/Term are thread-safe though because we know that the static mutex instance will be initialized before any calls can be made into the library. And honestly, if the static global mutex object fails to initialize a synchronization primi

RE: Protect XMLPlatformUtils::Initialize and XMLPlatformUtils::Terminate

2005-05-19 Thread Brandon White
Weiß [mailto:[EMAIL PROTECTED] Sent: May 19, 2005 2:23 AM To: c-dev@xerces.apache.org Subject: Re: Protect XMLPlatformUtils::Initialize and XMLPlatformUtils::Terminate Brandon White wrote: > Is there any reason why this wouldn't be acceptable? It would prevent > clients from having

Protect XMLPlatformUtils::Initialize and XMLPlatformUtils::Terminate

2005-05-18 Thread Brandon White
This post is in regards to a bug that has been in the system for a long time: http://issues.apache.org/jira/browse/XERCESC-42 (XMLPlatformUtils::Initialize and XMLPlatformUtils::Terminate are not thread safe). My question is why can't the Initialize and Terminate functions be made thread-safe w