Re: Spark books

2017-05-03 Thread Tobi Bosede
Well that is the nature of technology, ever evolving. There will always be new concepts. If you're trying to get started ASAP and the internet isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of production stacks are still on that version and the knowledge from mastering 1.6 is

Splines or Smoothing Kernels for Linear Regression

2016-11-08 Thread Tobi Bosede
Hi fellow users, Has anyone ever used splines or smoothing kernels for linear regression in Spark? If not, does anyone have ideas on how this can be done or what suitable alternatives exist? I am on Spark 1.6.1 with python. Thanks, Tobi

Re: Aggregate UDF (UDAF) in Python

2016-10-18 Thread Tobi Bosede
rom python (assuming spark is your spark > session): > > # get a reference dataframe to do the example on: > > df = spark.range(20) > > > > # get the jvm pointer > > jvm = spark.sparkContext._gateway.jvm > > # import the class > > from py4j.java_gateway import j

Re: Aggregate UDF (UDAF) in Python

2016-10-17 Thread Tobi Bosede
or python use. If you need an > example on how to do so I can provide one. > > Assaf. > > > > *From:* Tobi Bosede [mailto:ani.to...@gmail.com] > *Sent:* Sunday, October 16, 2016 7:49 PM > *To:* Holden Karau > *Cc:* user > *Subject:* Re: Aggregate UDF (UDAF) in Pyth

Re: Aggregate UDF (UDAF) in Python

2016-10-16 Thread Tobi Bosede
't made the hop to Python - while I work a fair amount > on PySpark it's mostly in core & ML and not a lot with SQL so there could > be good reasons I'm just not familiar with. We can try pinging Davies or > Michael on the JIRA to see what their thoughts are. > > On

Re: Aggregate UDF (UDAF) in Python

2016-10-16 Thread Tobi Bosede
nt-Proposals-td19422.html . The JIRA for tacking this > issue is at https://issues.apache.org/jira/browse/SPARK-10915 > > On Sat, Oct 15, 2016 at 7:20 PM, Tobi Bosede wrote: > >> Hello, >> >> I am trying to use a UDF that calculates inter-quartile (IQR) range for >&

Re: 回复:Spark-submit Problems

2016-10-15 Thread Tobi Bosede
ewayConnection.java:209) at java.lang.Thread.run(Unknown Source) On Sat, Oct 15, 2016 at 10:06 PM, Mekal Zheng wrote: > Show me your code > > > 2016年10月16日 +0800 08:24 hxfeng <980548...@qq.com>,写道: > > show you pi.py code and what is the exception message? > &

Aggregate UDF (UDAF) in Python

2016-10-15 Thread Tobi Bosede
Hello, I am trying to use a UDF that calculates inter-quartile (IQR) range for pivot() and SQL in pyspark and got the error that my function wasn't an aggregate function in both scenarios. Does anyone know if UDAF functionality is available in python? If not, what can I do as a work around? Thank

Spark-submit Problems

2016-10-15 Thread Tobi Bosede
Hi everyone, I am having problems submitting an app through spark-submit when the master is not "local". However the pi.py example which comes with Spark works with any master. I believe my script has the same structure as pi.py, but for some reason my script is not as flexible. Specifically, the

Re: MLib Documentation Update Needed

2016-09-28 Thread Tobi Bosede
; The loss is summed in the case of log-loss, not multiplied (if that's > what you're saying). > > Those are decent improvements, feel free to open a pull request / JIRA. > > > On Mon, Sep 26, 2016 at 6:22 AM, Tobi Bosede wrote: > > The loss function here for lo

MLib Documentation Update Needed

2016-09-25 Thread Tobi Bosede
The loss function here for logistic regression is confusing. It seems to imply that spark uses only -1 and 1 class labels. However it uses 0,1 as the very inconspicuous note quoted below (under Classification) says.

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
'offset'. That's orthogonal. > > You're also suggesting making withMean=True the default, which we > > don't want. The point is that if this is *explicitly* requested, the > > scaler shouldn't refuse to subtract the mean from a sparse vector, and >

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
s that if this is *explicitly* requested, the > scaler shouldn't refuse to subtract the mean from a sparse vector, and > fail. > > On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede wrote: > > Sean, > > > > I have created a jira; I hope you don't mind that

Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
the same, that perhaps it's fine to let the >>>> StandardScaler proceed, if it's explicitly asked to center, rather >>>> than refuse to. It's not really much more rope to let a user hang >>>> herself with, and, blocks legitimate usages (we ran into

Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
t' value applied to all > elements. That would let you shift all values while preserving a > sparse representation. I'm not sure if it's worth implementing but > would help this case. > > > > > On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede wrote:

Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Hi everyone, I am doing some standardization using standardScaler on data from VectorAssembler which is represented as sparse vectors. I plan to fit a regularized model. However, standardScaler does not allow the mean to be subtracted from sparse vectors. It will only divide by the standard devia

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-19 Thread Tobi Bosede
sql import Row > > rdd2 = filteredRDD.map(lambda v: Row(features=v)) > > df = rdd2.toDF() > > > Thanks > Yanbo > > 2016-07-16 14:51 GMT-07:00 Tobi Bosede : > >> Hi Yanbo, >> >> Appreciate the response. I might not have phrased this correctly, but I >

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Tobi Bosede
> However, we strongly recommend you to migrate to DataFrame-based API since > the RDD-based API is switched to maintain mode. > > Thanks > Yanbo > > 2016-07-14 13:23 GMT-07:00 Tobi Bosede : > >> Hi everyone, >> >> I am trying to filter my

Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-14 Thread Tobi Bosede
Hi everyone, I am trying to filter my features based on the spark.mllib ChiSqSelector. filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label, model.transform(lp.features))) However when I do the following I get the error below. Is there any other way to filter my data to avoid th

chisqSelector in Python

2016-07-11 Thread Tobi Bosede
Hi all, There is no python example for chisqSelector in python at the below link. https://spark.apache.org/docs/1.4.1/mllib-feature-extraction.html#chisqselector So I am converting the scala code to python. I "translated" the following code val discretizedData = data.map { lp => LabeledPoint(l

Re: Saving Table with Special Characters in Columns

2016-07-11 Thread Tobi Bosede
t; you write out invalid files that can't be read back, so we added this check. > > You can call .format("csv") (in spark 2.0) to switch it to CSV. > > On Mon, Jul 11, 2016 at 11:16 AM, Tobi Bosede wrote: > >> Hi everyone, >> >> I am trying to save a data fr

Saving Table with Special Characters in Columns

2016-07-11 Thread Tobi Bosede
Hi everyone, I am trying to save a data frame with special characters in the column names as a table in hive. However I am getting the following error. Is the only solution to rename all the columns? Or is there some argument that can be passed to into the saveAsTable() or write.parquet() function