https://issues.apache.org/jira/browse/SPARK-7182
Can anyone suggest a workaround for the above issue?
Thanks.
-Don
--
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/
Your parenthesis don't look right as you're embedding the filter on the
Row.fromSeq().
Try this:
val trainRDD = rawTrainData
.filter(!_.isEmpty)
.map(rawRow = Row.fromSeq(rawRow.split(,)))
.filter(_.length == 15)
.map(_.toString).map(_.trim)
-Don
On Fri,
I'm running Spark v1.3.1 and when I run the following against my dataset:
model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo=catFeatu
res, maxDepth=6, numIterations=3)
The job will fail with the following message:
Traceback (most recent call last):
File
for it? The maxBins input is missing for the
Python Api.
Is it possible if you can use the current master? In the current master,
you should be able to use trees with the Pipeline Api and DataFrames.
Best,
Burak
On Wed, May 20, 2015 at 2:44 PM, Don Drake dondr...@gmail.com wrote:
I'm running
As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that
Spark is behaving differently when reading Parquet directories that contain
a .metadata directory.
It seems that in spark 1.2.x, it would just ignore the .metadata directory,
but now that I'm using Spark 1.3, reading these
/6581
Cheng
On 6/5/15 12:38 AM, Marcelo Vanzin wrote:
I talked to Don outside the list and he says that he's seeing this issue
with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a
real issue here.
On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com wrote
I would try setting PYSPARK_DRIVER_PYTHON environment variable to the
location of your python binary, especially if you are using a virtual
environment.
-Don
On Wed, Jun 3, 2015 at 8:24 PM, AlexG swift...@gmail.com wrote:
I have libskylark installed on both machines in my two node cluster in
Use this package:
https://github.com/databricks/spark-csv
and change the delimiter to a tab.
The documentation is pretty straightforward, you'll get a Dataframe back
from the parser.
-Don
On Thu, Jun 25, 2015 at 4:39 AM, Ravikant Dindokar ravikant.i...@gmail.com
wrote:
So I have a file
I looked at this again, and when I use the Scala spark-shell and load a CSV
using the same package it works just fine, so this seems specific to
pyspark.
I've created the following JIRA:
https://issues.apache.org/jira/browse/SPARK-8365
-Don
On Sat, Jun 13, 2015 at 11:46 AM, Don Drake dondr
curious whether it's the same
issue. Thanks for opening the Jira, I'll take a look.
Best,
Burak
On Jun 14, 2015 2:40 PM, Don Drake dondr...@gmail.com wrote:
I looked at this again, and when I use the Scala spark-shell and load a
CSV using the same package it works just fine, so this seems
Take a look at https://github.com/databricks/spark-csv to read in the
tab-delimited file (change the default delimiter)
and once you have that as a DataFrame, SQL can do the rest.
https://spark.apache.org/docs/latest/sql-programming-guide.html
-Don
On Fri, Jun 12, 2015 at 8:46 PM, Rex X
I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing
Python Spark application against it and got the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
: java.lang.RuntimeException: Failed to load class for data source:
If you are using Dataframes in PySpark, then the performance will be the
same as Scala. However, if you need to implement your own UDF, or run a
map() against a DataFrame in Python, then you will pay the penalty for
performance when executing those functions since all of your data has to go
I would like to announce a Python package that makes creating rows in
DataFrames in PySpark as easy as creating an object.
Code is available on GitHub, PyPi, and soon to be on spark-packages.org.
https://github.com/dondrake/smartframes
Motivation
Spark DataFrames provide a nice interface to
Here's what I set in a shell script to start the notebook:
export PYSPARK_PYTHON=~/anaconda/bin/python
export PYSPARK_DRIVER_PYTHON=~/anaconda/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
If you want to use HiveContext w/CDH:
export HADOOP_CONF_DIR=/etc/hive/conf
Then just run
You will need to use the HDFS API to do that.
Try something like:
val conf = sc.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
fs.rename(new org.apache.hadoop.fs.Path("/path/on/hdfs/file.txt"), new
org.apache.hadoop.fs.Path("/path/on/hdfs/other/file.txt"))
Full API for
I have a 2TB dataset that I have in a DataFrame that I am attempting to
partition by 2 fields and my YARN job seems to write the partitioned
dataset successfully. I can see the output in HDFS once all Spark tasks
are done.
After the spark tasks are done, the job appears to be running for over an
I'm seeing similar slowness in saveAsTextFile(), but only in Python.
I'm sorting data in a dataframe, then transform it and get a RDD, and then
coalesce(1).saveAsTextFile().
I converted the Python to Scala and the run-times were similar, except for
the saveAsTextFile() stage. The scala version
I noticed a similar problem going from 1.5.x to 1.6.0 on YARN.
I resolved it be setting the following command-line parameters:
spark.eventLog.enabled=true
spark.eventLog.dir=
-Don
On Wed, Jan 13, 2016 at 8:29 AM, Darin McBeath
wrote:
> I tried using Spark 1.6 in
If you use the spark-csv package:
$ spark-shell --packages com.databricks:spark-csv_2.11:1.3.0
scala> val df =
sc.parallelize(Array(Array(1,2,3),Array(2,3,4),Array(3,4,6))).map(x =>
(x(0), x(1), x(2))).toDF()
df: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int]
scala>
I'm interested in building a REST service that utilizes a Spark SQL Context
to return records from a DataFrame (or IndexedRDD?) and even add/update
records.
This will be a simple REST API, with only a few end-points. I found this
example:
https://github.com/alexmasselot/spark-play-activator
My tests show Parquet has better performance than Avro in just about every
test. It really shines when you are querying a subset of columns in a wide
table.
-Don
On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann
wrote:
> Which format is the best format for SparkSQL adhoc
I've downloaded a nightly build of Spark 2.0 (from today 4/23) and was
attempting to create an aggregator that will create a Seq[Rows], or
specifically a Seq[Class1], my custom class.
When I attempt to run the following code in a spark-shell, it errors out:
Gist:
I have been working to create a Dataframe that contains a nested
structure. The first attempt is to create an array of structures. I've
written previously on this list how it doesn't work in Dataframes in 1.6.1,
but it does in 2.0.
I've continued my experimenting and have it working in
I was able to verify the similar exceptions occur in Spark 2.0.0-preview.
I have create this JIRA: https://issues.apache.org/jira/browse/SPARK-15467
You mentioned using beans instead of case classes, do you have an example
(or test case) that I can see?
-Don
On Fri, May 20, 2016 at 3:49 PM,
You can call rdd.coalesce(10, shuffle = true) and the returning rdd will be
evenly balanced. This obviously triggers a shuffle, so be advised it could
be an expensive operation depending on your RDD size.
-Don
On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil
wrote:
>
ue) out of curiosity, the rdd partitions became even more imbalanced:
>> [(0, 5120), (1, 5120), (2, 5120), (3, 5120), (4, *3920*), (5, 4096), (6,
>> 5120), (7, 5120), (8, 5120), (9, *6144*)]
>>
>>
>> On Tue, May 10, 2016 at 10:16 PM, Don Drake <dondr...@gm
I have a nested data structure (array of structures) that I'm using the DSL
df.explode() API to flatten the data. However, when the array is empty,
I'm not getting the rest of the row in my output as it is skipped.
This is the intended behavior, and Hive supports a SQL "OUTER explode()" to
Try:
ts.groupBy("b").count().orderBy(col("count").desc());
-Don
On Sat, Jul 30, 2016 at 1:30 PM, Tony Lane wrote:
> just to clarify I am try to do this in java
>
> ts.groupBy("b").count().orderBy("count");
>
>
>
> On Sun, Jul 31, 2016 at 12:00 AM, Tony Lane
/docs.scala-lang.org/tutorials/FAQ/context-bounds>.
>
> import org.apache.spark.sql.Encoder
> abstract class RawTable[A : Encoder](inDir: String) {
> ...
> }
>
> On Tue, Jan 31, 2017 at 8:12 PM, Don Drake <dondr...@gmail.com> wrote:
>
>> I have a set
In 1.6, when you created a Dataset from a Dataframe that had extra columns,
the columns not in the case class were dropped from the Dataset.
For example in 1.6, the column c4 is gone:
scala> case class F(f1: String, f2: String, f3:String)
defined class F
scala> import sqlContext.implicits._
oudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/204687029790319/2840265927289860/latest.html>
>
> On Wed, Feb 1, 2017 at 3:34 PM, Don Drake <dondr...@gmail.com> wrote:
>
>> Thanks for the reply. I did give that syntax a try [A : Encoder]
>> yesterda
I have a set of CSV that I need to perform ETL on, with the plan to re-use
a lot of code between each file in a parent abstract class.
I tried creating the following simple abstract class that will have a
parameterized type of a case class that represents the schema being read in.
This won't
ter)String
convertToTZFullTimestamp:
org.apache.spark.sql.expressions.UserDefinedFunction
df: org.apache.spark.sql.DataFrame = [id: bigint, dts: string]
df2: org.apache.spark.sql.DataFrame = [id: bigint, dts: string ... 2 more
fields]
scala>
On Fri, Jan 27, 2017 at 12:01 PM, Don Drake <dondr...@gmail.
Please see: https://issues.apache.org/jira/browse/SPARK-19477
Thanks.
-Don
On Wed, Feb 8, 2017 at 6:51 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
> i checked it, it seems is a bug. do you create a jira now plesae?
>
> ---Original---
> *From:* "Don Drake"<dondr...@gmai
der.schema
res4: org.apache.spark.sql.types.StructType =
StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(f3,StringType,true))
I'll open a JIRA.
-Don
On Thu, Feb 2, 2017 at 2:46 PM, Don Drake <dondr...@gmail.com> wrote:
> In 1.6, when you created a Dataset from a Dataframe
I'm reading CSV with a timestamp clearly identified in the UTC timezone,
and I need to store this in a parquet format and eventually read it back
and convert to different timezones as needed.
Sounds straightforward, but this involves some crazy function calls and I'm
seeing strange results as I
this PR. https://github.com/apache/spark/pull/14339
>
> On 1 Sep 2016 2:48 a.m., "Don Drake" <dondr...@gmail.com> wrote:
>
>> I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark
>> 2.0 and have encountered some interesting issues.
>>
So I was able to reproduce in a simple case the issue I'm seeing with a
query from Spark 1.6.2 that would run fine that is no longer working on
Spark 2.0.
Example code:
https://gist.github.com/dondrake/c136d61503b819f0643f8c02854a9cdf
Here's the code for Spark 2.0 that doesn't run (this runs
I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark 2.0
and have encountered some interesting issues.
First, it seems the SQL parsing is different, and I had to rewrite some SQL
that was doing a mix of inner joins (using where syntax, not inner) and
outer joins to get the SQL
We just had this conversation at work today. We have a long sqoop pipeline
and I argued to keep it in sqoop since we can take advantage of OraOop
(direct mode) for performance and spark can't match that AFAIK. Sqoop also
allows us to write directly into parquet format, which then Spark can read
I would upgrade your Scala version to 2.11.8 as Spark 2.0 uses Scala 2.11
by default.
On Sun, Nov 13, 2016 at 3:01 PM, Marco Mistroni wrote:
> HI all
> i have a small Spark-based project which at the moment depends on jar
> from Spark 1.6.0
> The project has few Spark
rk.ml
> [error] import org.apache.spark.ml.tuning.{ CrossValidator,
> ParamGridBuilder }
> [error]^
> [error] C:\Users\marco\SparkExamples\src\main\scala\
> DecisionTreeExampleML.scala:10: object tuning is not a member of package
> org.apache.spark.ml
>
You can update $SPARK_HOME/spark-env.sh by setting the environment
variable SPARK_HISTORY_OPTS.
See
http://spark.apache.org/docs/latest/monitoring.html#spark-configuration-options
for options (spark.history.fs.logDirectory) you can set.
There is log rotation built in (by time, not size) to the
Try passing maxID.toString, I think it wants the number as a string.
On Mon, May 29, 2017 at 3:12 PM, Mich Talebzadeh
wrote:
> thanks Gents but no luck!
>
> scala> val s = HiveContext.read.format("jdbc").options(
> | Map("url" -> _ORACLEserver,
> | "dbtable"
I'm looking for some advice when I have a flatMap on a Dataset that is
creating and returning a sequence of a new case class
(Seq[BigDataStructure]) that contains a very large amount of data, much
larger than the single input record (think images).
In python, you can use generators (yield) to
rd Garris
>
> Principal Architect
>
> Databricks, Inc
>
> 650.200.0840 <(650)%20200-0840>
>
> rlgar...@databricks.com
>
> On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin (van...@cloudera.com)
> wrote:
>
> This sounds like something mapPartitions should
47 matches
Mail list logo