Re: Spark books

2017-05-03 Thread Tobi Bosede
Well that is the nature of technology, ever evolving. There will always be new concepts. If you're trying to get started ASAP and the internet isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of production stacks are still on that version and the knowledge from mastering 1.6 is

Splines or Smoothing Kernels for Linear Regression

2016-11-08 Thread Tobi Bosede
Hi fellow users, Has anyone ever used splines or smoothing kernels for linear regression in Spark? If not, does anyone have ideas on how this can be done or what suitable alternatives exist? I am on Spark 1.6.1 with python. Thanks, Tobi

Re: Aggregate UDF (UDAF) in Python

2016-10-18 Thread Tobi Bosede
e want to call this from python (assuming spark is your spark > session): > > # get a reference dataframe to do the example on: > > df = spark.range(20) > > > > # get the jvm pointer > > jvm = spark.sparkContext._gateway.jvm > > # import the class > > from py4

Re: Aggregate UDF (UDAF) in Python

2016-10-17 Thread Tobi Bosede
UDAF in java/scala and wrap it for python use. If you need an > example on how to do so I can provide one. > > Assaf. > > > > *From:* Tobi Bosede [mailto:ani.to...@gmail.com] > *Sent:* Sunday, October 16, 2016 7:49 PM > *To:* Holden Karau > *Cc:* user > *Subjec

Re: Aggregate UDF (UDAF) in Python

2016-10-16 Thread Tobi Bosede
> > On Sunday, October 16, 2016, Tobi Bosede <ani.to...@gmail.com> wrote: > >> Thanks for the info Holden. >> >> So it seems both the jira and the comment on the developer list are over >> a year old. More surprising, the jira has no assignee. Any particular >>

Re: Aggregate UDF (UDAF) in Python

2016-10-16 Thread Tobi Bosede
rked-from- > Spark-Improvement-Proposals-td19422.html . The JIRA for tacking this > issue is at https://issues.apache.org/jira/browse/SPARK-10915 > > On Sat, Oct 15, 2016 at 7:20 PM, Tobi Bosede <ani.to...@gmail.com> wrote: > >> Hello, >> >> I am trying to use a UD

Re: 回复:Spark-submit Problems

2016-10-15 Thread Tobi Bosede
ewayConnection.java:209) at java.lang.Thread.run(Unknown Source) On Sat, Oct 15, 2016 at 10:06 PM, Mekal Zheng <mekal.zh...@gmail.com> wrote: > Show me your code > > > 2016年10月16日 +0800 08:24 hxfeng <980548...@qq.com>,写道: > > show you pi.py code and what is the

Aggregate UDF (UDAF) in Python

2016-10-15 Thread Tobi Bosede
Hello, I am trying to use a UDF that calculates inter-quartile (IQR) range for pivot() and SQL in pyspark and got the error that my function wasn't an aggregate function in both scenarios. Does anyone know if UDAF functionality is available in python? If not, what can I do as a work around?

Spark-submit Problems

2016-10-15 Thread Tobi Bosede
Hi everyone, I am having problems submitting an app through spark-submit when the master is not "local". However the pi.py example which comes with Spark works with any master. I believe my script has the same structure as pi.py, but for some reason my script is not as flexible. Specifically, the

Re: MLib Documentation Update Needed

2016-09-28 Thread Tobi Bosede
the case of log-loss, not multiplied (if that's > what you're saying). > > Those are decent improvements, feel free to open a pull request / JIRA. > > > On Mon, Sep 26, 2016 at 6:22 AM, Tobi Bosede <ani.to...@gmail.com> wrote: > > The loss function here for logisti

MLib Documentation Update Needed

2016-09-25 Thread Tobi Bosede
The loss function here for logistic regression is confusing. It seems to imply that spark uses only -1 and 1 class labels. However it uses 0,1 as the very inconspicuous note quoted below (under Classification)

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
'offset'. That's orthogonal. > > You're also suggesting making withMean=True the default, which we > > don't want. The point is that if this is *explicitly* requested, the > > scaler shouldn't refuse to subtract the mean from a sparse vector, and > > fail. > > > &g

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
if this is *explicitly* requested, the > scaler shouldn't refuse to subtract the mean from a sparse vector, and > fail. > > On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede <ani.to...@gmail.com> wrote: > > Sean, > > > > I have created a jira; I hope you don't mind that I b

Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
ang >>>> herself with, and, blocks legitimate usages (we ran into this last >>>> week and couldn't use StandardScaler as a result). >>>> >>>> I'm personally supportive of the change and don't see a JIRA. I think >>>> you could at least

Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
lements. That would let you shift all values while preserving a > sparse representation. I'm not sure if it's worth implementing but > would help this case. > > > > > On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com> wrote: > > Hi everyone, > >

Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Hi everyone, I am doing some standardization using standardScaler on data from VectorAssembler which is represented as sparse vectors. I plan to fit a regularized model. However, standardScaler does not allow the mean to be subtracted from sparse vectors. It will only divide by the standard

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-19 Thread Tobi Bosede
from pyspark.sql import Row > > rdd2 = filteredRDD.map(lambda v: Row(features=v)) > > df = rdd2.toDF() > > > Thanks > Yanbo > > 2016-07-16 14:51 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>: > >> Hi Yanbo, >> >> Appreciate the response. I mig

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Tobi Bosede
filteredRDD.collect() > > However, we strongly recommend you to migrate to DataFrame-based API since > the RDD-based API is switched to maintain mode. > > Thanks > Yanbo > > 2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>: > >> Hi everyone, >>

Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-14 Thread Tobi Bosede
Hi everyone, I am trying to filter my features based on the spark.mllib ChiSqSelector. filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label, model.transform(lp.features))) However when I do the following I get the error below. Is there any other way to filter my data to avoid

chisqSelector in Python

2016-07-11 Thread Tobi Bosede
Hi all, There is no python example for chisqSelector in python at the below link. https://spark.apache.org/docs/1.4.1/mllib-feature-extraction.html#chisqselector So I am converting the scala code to python. I "translated" the following code val discretizedData = data.map { lp =>

Re: Saving Table with Special Characters in Columns

2016-07-11 Thread Tobi Bosede
rquet. The library will let > you write out invalid files that can't be read back, so we added this check. > > You can call .format("csv") (in spark 2.0) to switch it to CSV. > > On Mon, Jul 11, 2016 at 11:16 AM, Tobi Bosede <ani.to...@gmail.com> wrote: > >> Hi ev

Saving Table with Special Characters in Columns

2016-07-11 Thread Tobi Bosede
Hi everyone, I am trying to save a data frame with special characters in the column names as a table in hive. However I am getting the following error. Is the only solution to rename all the columns? Or is there some argument that can be passed to into the saveAsTable() or write.parquet()