New JIRA - [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-27 Thread Don Drake
https://issues.apache.org/jira/browse/SPARK-7182 Can anyone suggest a workaround for the above issue? Thanks. -Don -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ http://www.MailLaunder.com/

Re: Spark sql and csv data processing question

2015-05-16 Thread Don Drake
Your parenthesis don't look right as you're embedding the filter on the Row.fromSeq(). Try this: val trainRDD = rawTrainData .filter(!_.isEmpty) .map(rawRow = Row.fromSeq(rawRow.split(,))) .filter(_.length == 15) .map(_.toString).map(_.trim) -Don On Fri,

GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Don Drake
I'm running Spark v1.3.1 and when I run the following against my dataset: model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatu res, maxDepth=6, numIterations=3) The job will fail with the following message: Traceback (most recent call last): File

Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Don Drake
for it? The maxBins input is missing for the Python Api. Is it possible if you can use the current master? In the current master, you should be able to use trees with the Pipeline Api and DataFrames. Best, Burak On Wed, May 20, 2015 at 2:44 PM, Don Drake dondr...@gmail.com wrote: I'm running

Problem reading Parquet from 1.2 to 1.3

2015-06-03 Thread Don Drake
As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that Spark is behaving differently when reading Parquet directories that contain a .metadata directory. It seems that in spark 1.2.x, it would just ignore the .metadata directory, but now that I'm using Spark 1.3, reading these

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-07 Thread Don Drake
/6581 Cheng On 6/5/15 12:38 AM, Marcelo Vanzin wrote: I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real issue here. On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com wrote

Re: importerror using external library with pyspark

2015-06-04 Thread Don Drake
I would try setting PYSPARK_DRIVER_PYTHON environment variable to the location of your python binary, especially if you are using a virtual environment. -Don On Wed, Jun 3, 2015 at 8:24 PM, AlexG swift...@gmail.com wrote: I have libskylark installed on both machines in my two node cluster in

Re: Parsing a tsv file with key value pairs

2015-06-25 Thread Don Drake
Use this package: https://github.com/databricks/spark-csv and change the delimiter to a tab. The documentation is pretty straightforward, you'll get a Dataframe back from the parser. -Don On Thu, Jun 25, 2015 at 4:39 AM, Ravikant Dindokar ravikant.i...@gmail.com wrote: So I have a file

Re: --packages Failed to load class for data source v1.4

2015-06-14 Thread Don Drake
I looked at this again, and when I use the Scala spark-shell and load a CSV using the same package it works just fine, so this seems specific to pyspark. I've created the following JIRA: https://issues.apache.org/jira/browse/SPARK-8365 -Don On Sat, Jun 13, 2015 at 11:46 AM, Don Drake dondr

Re: --packages Failed to load class for data source v1.4

2015-06-17 Thread Don Drake
curious whether it's the same issue. Thanks for opening the Jira, I'll take a look. Best, Burak On Jun 14, 2015 2:40 PM, Don Drake dondr...@gmail.com wrote: I looked at this again, and when I use the Scala spark-shell and load a CSV using the same package it works just fine, so this seems

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Don Drake
Take a look at https://github.com/databricks/spark-csv to read in the tab-delimited file (change the default delimiter) and once you have that as a DataFrame, SQL can do the rest. https://spark.apache.org/docs/latest/sql-programming-guide.html -Don On Fri, Jun 12, 2015 at 8:46 PM, Rex X

--packages Failed to load class for data source v1.4

2015-06-13 Thread Don Drake
I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing Python Spark application against it and got the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o90.save. : java.lang.RuntimeException: Failed to load class for data source:

Re: Does feature parity exist between Spark and PySpark

2015-10-06 Thread Don Drake
If you are using Dataframes in PySpark, then the performance will be the same as Scala. However, if you need to implement your own UDF, or run a map() against a DataFrame in Python, then you will pay the penalty for performance when executing those functions since all of your data has to go

Utility for PySpark DataFrames - smartframes

2015-10-05 Thread Don Drake
I would like to announce a Python package that makes creating rows in DataFrames in PySpark as easy as creating an object. Code is available on GitHub, PyPi, and soon to be on spark-packages.org. https://github.com/dondrake/smartframes Motivation Spark DataFrames provide a nice interface to

Re: Jupyter configuration

2015-12-02 Thread Don Drake
Here's what I set in a shell script to start the notebook: export PYSPARK_PYTHON=~/anaconda/bin/python export PYSPARK_DRIVER_PYTHON=~/anaconda/bin/ipython export PYSPARK_DRIVER_PYTHON_OPTS='notebook' If you want to use HiveContext w/CDH: export HADOOP_CONF_DIR=/etc/hive/conf Then just run

Re: copy/mv hdfs file to another directory by spark program

2016-01-04 Thread Don Drake
You will need to use the HDFS API to do that. Try something like: val conf = sc.hadoopConfiguration val fs = org.apache.hadoop.fs.FileSystem.get(conf) fs.rename(new org.apache.hadoop.fs.Path("/path/on/hdfs/file.txt"), new org.apache.hadoop.fs.Path("/path/on/hdfs/other/file.txt")) Full API for

df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-28 Thread Don Drake
I have a 2TB dataset that I have in a DataFrame that I am attempting to partition by 2 fields and my YARN job seems to write the partitioned dataset successfully. I can see the output in HDFS once all Spark tasks are done. After the spark tasks are done, the job appears to be running for over an

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Don Drake
I'm seeing similar slowness in saveAsTextFile(), but only in Python. I'm sorting data in a dataframe, then transform it and get a RDD, and then coalesce(1).saveAsTextFile(). I converted the Python to Scala and the run-times were similar, except for the saveAsTextFile() stage. The scala version

Re: Spark 1.6 and Application History not working correctly

2016-01-13 Thread Don Drake
I noticed a similar problem going from 1.5.x to 1.6.0 on YARN. I resolved it be setting the following command-line parameters: spark.eventLog.enabled=true spark.eventLog.dir= -Don On Wed, Jan 13, 2016 at 8:29 AM, Darin McBeath wrote: > I tried using Spark 1.6 in

Re: output the datas(txt)

2016-02-28 Thread Don Drake
If you use the spark-csv package: $ spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 scala> val df = sc.parallelize(Array(Array(1,2,3),Array(2,3,4),Array(3,4,6))).map(x => (x(0), x(1), x(2))).toDF() df: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int] scala>

Building a REST Service with Spark back-end

2016-03-01 Thread Don Drake
I'm interested in building a REST service that utilizes a Spark SQL Context to return records from a DataFrame (or IndexedRDD?) and even add/update records. This will be a simple REST API, with only a few end-points. I found this example: https://github.com/alexmasselot/spark-play-activator

Re: AVRO vs Parquet

2016-03-03 Thread Don Drake
My tests show Parquet has better performance than Avro in just about every test. It really shines when you are querying a subset of columns in a wide table. -Don On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann wrote: > Which format is the best format for SparkSQL adhoc

Spark 2.0 Aggregator problems

2016-04-23 Thread Don Drake
I've downloaded a nightly build of Spark 2.0 (from today 4/23) and was attempting to create an aggregator that will create a Seq[Rows], or specifically a Seq[Class1], my custom class. When I attempt to run the following code in a spark-shell, it errors out: Gist:

Wide Datasets (v1.6.1)

2016-05-20 Thread Don Drake
I have been working to create a Dataframe that contains a nested structure. The first attempt is to create an array of structures. I've written previously on this list how it doesn't work in Dataframes in 1.6.1, but it does in 2.0. I've continued my experimenting and have it working in

Re: Wide Datasets (v1.6.1)

2016-05-21 Thread Don Drake
I was able to verify the similar exceptions occur in Spark 2.0.0-preview. I have create this JIRA: https://issues.apache.org/jira/browse/SPARK-15467 You mentioned using beans instead of case classes, do you have an example (or test case) that I can see? -Don On Fri, May 20, 2016 at 3:49 PM,

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake
You can call rdd.coalesce(10, shuffle = true) and the returning rdd will be evenly balanced. This obviously triggers a shuffle, so be advised it could be an expensive operation depending on your RDD size. -Don On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil wrote: >

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Don Drake
ue) out of curiosity, the rdd partitions became even more imbalanced: >> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6, >> 5120), (7, 5120), (8, 5120), (9, *6144*)] >> >> >> On Tue, May 10, 2016 at 10:16 PM, Don Drake <dondr...@gm

Outer Explode needed

2016-07-24 Thread Don Drake
I have a nested data structure (array of structures) that I'm using the DSL df.explode() API to flatten the data. However, when the array is empty, I'm not getting the rest of the row in my output as it is skipped. This is the intended behavior, and Hive supports a SQL "OUTER explode()" to

Re: how to order data in descending order in spark dataset

2016-07-30 Thread Don Drake
Try: ts.groupBy("b").count().orderBy(col("count").desc()); -Don On Sat, Jul 30, 2016 at 1:30 PM, Tony Lane wrote: > just to clarify I am try to do this in java > > ts.groupBy("b").count().orderBy("count"); > > > > On Sun, Jul 31, 2016 at 12:00 AM, Tony Lane

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Don Drake
/docs.scala-lang.org/tutorials/FAQ/context-bounds>. > > import org.apache.spark.sql.Encoder > abstract class RawTable[A : Encoder](inDir: String) { > ... > } > > On Tue, Jan 31, 2017 at 8:12 PM, Don Drake <dondr...@gmail.com> wrote: > >> I have a set

Spark 2 - Creating datasets from dataframes with extra columns

2017-02-02 Thread Don Drake
In 1.6, when you created a Dataset from a Dataframe that had extra columns, the columns not in the case class were dropped from the Dataset. For example in 1.6, the column c4 is gone: scala> case class F(f1: String, f2: String, f3:String) defined class F scala> import sqlContext.implicits._

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Don Drake
oudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/204687029790319/2840265927289860/latest.html> > > On Wed, Feb 1, 2017 at 3:34 PM, Don Drake <dondr...@gmail.com> wrote: > >> Thanks for the reply. I did give that syntax a try [A : Encoder] >> yesterda

Parameterized types and Datasets - Spark 2.1.0

2017-01-31 Thread Don Drake
I have a set of CSV that I need to perform ETL on, with the plan to re-use a lot of code between each file in a parent abstract class. I tried creating the following simple abstract class that will have a parameterized type of a case class that represents the schema being read in. This won't

Re: Converting timezones in Spark

2017-01-31 Thread Don Drake
ter)String convertToTZFullTimestamp: org.apache.spark.sql.expressions.UserDefinedFunction df: org.apache.spark.sql.DataFrame = [id: bigint, dts: string] df2: org.apache.spark.sql.DataFrame = [id: bigint, dts: string ... 2 more fields] scala> On Fri, Jan 27, 2017 at 12:01 PM, Don Drake <dondr...@gmail.

Re: Spark 2 - Creating datasets from dataframes with extra columns

2017-02-08 Thread Don Drake
Please see: https://issues.apache.org/jira/browse/SPARK-19477 Thanks. -Don On Wed, Feb 8, 2017 at 6:51 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > i checked it, it seems is a bug. do you create a jira now plesae? > > ---Original--- > *From:* "Don Drake"<dondr...@gmai

Re: Spark 2 - Creating datasets from dataframes with extra columns

2017-02-06 Thread Don Drake
der.schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true)) I'll open a JIRA. -Don On Thu, Feb 2, 2017 at 2:46 PM, Don Drake <dondr...@gmail.com> wrote: > In 1.6, when you created a Dataset from a Dataframe

Converting timezones in Spark

2017-01-27 Thread Don Drake
I'm reading CSV with a timestamp clearly identified in the UTC timezone, and I need to store this in a parquet format and eventually read it back and convert to different timezones as needed. Sounds straightforward, but this involves some crazy function calls and I'm seeing strange results as I

Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake
this PR. https://github.com/apache/spark/pull/14339 > > On 1 Sep 2016 2:48 a.m., "Don Drake" <dondr...@gmail.com> wrote: > >> I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark >> 2.0 and have encountered some interesting issues. >>

Spark 2.0.0 - SQL - Running query with outer join from 1.6 fails

2016-09-01 Thread Don Drake
So I was able to reproduce in a simple case the issue I'm seeing with a query from Spark 1.6.2 that would run fine that is no longer working on Spark 2.0. Example code: https://gist.github.com/dondrake/c136d61503b819f0643f8c02854a9cdf Here's the code for Spark 2.0 that doesn't run (this runs

Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake
I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark 2.0 and have encountered some interesting issues. First, it seems the SQL parsing is different, and I had to rewrite some SQL that was doing a mix of inner joins (using where syntax, not inner) and outer joins to get the SQL

Re: Sqoop vs spark jdbc

2016-09-21 Thread Don Drake
We just had this conversation at work today. We have a long sqoop pipeline and I argued to keep it in sqoop since we can take advantage of OraOop (direct mode) for performance and spark can't match that AFAIK. Sqoop also allows us to write directly into parquet format, which then Spark can read

Re: sbt shenanigans for a Spark-based project

2016-11-13 Thread Don Drake
I would upgrade your Scala version to 2.11.8 as Spark 2.0 uses Scala 2.11 by default. On Sun, Nov 13, 2016 at 3:01 PM, Marco Mistroni wrote: > HI all > i have a small Spark-based project which at the moment depends on jar > from Spark 1.6.0 > The project has few Spark

Re: sbt shenanigans for a Spark-based project

2016-11-14 Thread Don Drake
rk.ml > [error] import org.apache.spark.ml.tuning.{ CrossValidator, > ParamGridBuilder } > [error]^ > [error] C:\Users\marco\SparkExamples\src\main\scala\ > DecisionTreeExampleML.scala:10: object tuning is not a member of package > org.apache.spark.ml >

Re: how can I set the log configuration file for spark history server ?

2016-12-08 Thread Don Drake
You can update $SPARK_HOME/spark-env.sh by setting the environment variable SPARK_HISTORY_OPTS. See http://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options for options (spark.history.fs.logDirectory) you can set. There is log rotation built in (by time, not size) to the

Re: Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread Don Drake
Try passing maxID.toString, I think it wants the number as a string. On Mon, May 29, 2017 at 3:12 PM, Mich Talebzadeh wrote: > thanks Gents but no luck! > > scala> val s = HiveContext.read.format("jdbc").options( > | Map("url" -> _ORACLEserver, > | "dbtable"

flatMap() returning large class

2017-12-14 Thread Don Drake
I'm looking for some advice when I have a flatMap on a Dataset that is creating and returning a sequence of a new case class (Seq[BigDataStructure]) that contains a very large amount of data, much larger than the single input record (think images). In python, you can use generators (yield) to

Re: flatMap() returning large class

2017-12-17 Thread Don Drake
rd Garris > > Principal Architect > > Databricks, Inc > > 650.200.0840 <(650)%20200-0840> > > rlgar...@databricks.com > > On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin (van...@cloudera.com) > wrote: > > This sounds like something mapPartitions should