unsubscribe

2018-03-27 Thread Nicholas Sharkey



Re: [PySpark SQL] sql function to_date and to_timestamp return the same data type

2018-03-15 Thread Nicholas Sharkey
unsubscribe

On Thu, Mar 15, 2018 at 8:00 PM, Alan Featherston Lago 
wrote:

> I'm a pretty new user of spark and I've run into this issue with the
> pyspark docs:
>
> The functions pyspark.sql.functions.to_date && 
> pyspark.sql.functions.to_timestamp
> behave in the same way. As in both functions convert a Column of
> pyspark.sql.types.StringType or pyspark.sql.types.TimestampType into
> pyspark.sql.types.DateType.
>
> Shouldn't the function `to_timestmap` return pyspark.sql.types.
> TimestampType?
> Also the to_timestamp docs say that "By default, it follows casting rules
> to pyspark.sql.types.TimestampType if the format is omitted (equivalent
> to col.cast("timestamp")). ", which doesn't seem to be right ie:
>
> to_timestamp(current_timestamp()) <> current_timestamp().cast("timestamp")
>
>
> This is wrong right? or am I missing something? (is this due to the
> underlying jvm data types?)
>
>
> Cheers,
> alan
>


Re: H2O DataFrame to Spark RDD/DataFrame

2017-01-12 Thread Nicholas Sharkey
Page 33 of the Sparkling Water Booklet:

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/SparklingWaterBooklet.pdf

df = sqlContext.read.format("h2o").option("key",frame.frame_id).load()

df = sqlContext.read.format("h2o").load(frame.frame_id)

On Thu, Jan 12, 2017 at 1:17 PM, Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> Hi there,
>
> Is there any way to convert an H2O DataFrame to equivalent Spark RDD or
> DataFrame? I found a good documentation on "*Machine Learning with
> Sparkling Water: H2O + Spark*" here at.
> 
>
> However, it discusses how to convert a Spark RDD or DaataFrame to H2O
> DatFrame but not the vice-versa.
>
>
>
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>


Re: Spark ML : One hot Encoding for multiple columns

2016-11-13 Thread Nicholas Sharkey
Amen 

> On Nov 13, 2016, at 7:55 PM, janardhan shetty  wrote:
> 
> These Jiras'  are still unresolved:
> https://issues.apache.org/jira/browse/SPARK-11215
> 
> Also there is https://issues.apache.org/jira/browse/SPARK-8418
> 
>> On Wed, Aug 17, 2016 at 11:15 AM, Nisha Muktewar  wrote:
>> 
>> The OneHotEncoder does not accept multiple columns.
>> 
>> You can use Michal's suggestion where he uses Pipeline to set the stages and 
>> then executes them. 
>> 
>> The other option is to write a function that performs one hot encoding on a 
>> column and returns a dataframe with the encoded column and then call it 
>> multiple times for the rest of the columns.
>> 
>> 
>> 
>> 
>>> On Wed, Aug 17, 2016 at 10:59 AM, janardhan shetty  
>>> wrote:
>>> I had already tried this way :
>>> 
>>> scala> val featureCols = Array("category","newone")
>>> featureCols: Array[String] = Array(category, newone)
>>> 
>>> scala>  val indexer = new 
>>> StringIndexer().setInputCol(featureCols).setOutputCol("categoryIndex").fit(df1)
>>> :29: error: type mismatch;
>>>  found   : Array[String]
>>>  required: String
>>> val indexer = new 
>>> StringIndexer().setInputCol(featureCols).setOutputCol("categoryIndex").fit(df1)
>>> 
>>> 
 On Wed, Aug 17, 2016 at 10:56 AM, Nisha Muktewar  
 wrote:
 I don't think it does. From the documentation: 
 https://spark.apache.org/docs/2.0.0-preview/ml-features.html#onehotencoder,
  I see that it still accepts one column at a time.
 
> On Wed, Aug 17, 2016 at 10:18 AM, janardhan shetty 
>  wrote:
> 2.0:
> 
> One hot encoding currently accepts single input column is there a way to 
> include multiple columns ?
 
>>> 
>> 
> 


Re: Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread Nicholas Sharkey
I did get *some* help from DataBricks in terms of programmatically grabbing
the categorical variables but I can't figure out where to go from here:

*# Get all string cols/categorical cols*
*stringColList = [i[0] for i in df.dtypes if i[1] == 'string']*

*# generate OHEs for every col in stringColList*
*OHEstages = [OneHotEncoder(inputCol = categoricalCol, outputCol =
categoricalCol + "Vector") for categoricalCol in stringColList]*



On Fri, Nov 11, 2016 at 2:00 PM, Nick Pentreath 
wrote:

> For now OHE supports a single column. So you have to have 1000 OHE in a
> pipeline. However you can add them programatically so it is not too bad. If
> the cardinality of each feature is quite low, it should be workable.
>
> After that user VectorAssembler to stitch the vectors together (which
> accepts multiple input columns).
>
> The other approach is - if your features are all categorical - to encode
> the features as "feature_name=feature_value" strings. This can
> unfortunately only be done with RDD ops since a UDF can't accept multiple
> columns as input at this time. You can create a new column with all the
> feature name/value pairs as a list of strings ["feature_1=foo",
> "feature_2=bar", ...]. Then use CountVectorizer to create your binary
> vectors. This basically works like the DictVectorizer in scikit-learn.
>
>
>
> On Fri, 11 Nov 2016 at 20:33 nsharkey  wrote:
>
>> I have a dataset that I need to convert some of the the variables to
>> dummy variables. The get_dummies function in Pandas works perfectly on
>> smaller datasets but since it collects I'll always be bottlenecked by the
>> master node.
>>
>> I've looked at Spark's OHE feature and while that will work in theory I
>> have over a thousand variables I need to convert so I don't want to have to
>> do 1000+ OHE. My project is pretty simple in scope: read in a raw CSV,
>> convert the categorical variables into dummy variables, then save the
>> transformed data back to CSV. That is why I'm so interested in get_dummies
>> but it's not scalable enough for my data size (500-600GB per file).
>>
>> Thanks in advance.
>>
>> Nick
>>
>> --
>> View this message in context: Finding a Spark Equivalent for Pandas'
>> get_dummies
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>


Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread Nicholas Sharkey
I have a dataset that I need to convert some of the the variables to dummy
variables. The get_dummies function in Pandas works perfectly on smaller
datasets but since it collects I'll always be bottlenecked by the master
node.

I've looked at Spark's OHE feature and while that will work in theory I
have over a thousand variables I need to convert so I don't want to have to
do 1000+ OHE. My project is pretty simple in scope: read in a raw CSV,
convert the categorical variables into dummy variables, then save the
transformed data back to CSV. That is why I'm so interested in get_dummies
but it's not scalable enough for my data size (500-600GB per file).

Thanks in advance.

Nick