from:"Bedrytski Aliaksandr"

Re: Random forest binary classification H20 difference Spark

2016-08-10 Thread Bedrytski Aliaksandr

Hi Samir, either use *dataframe.na.fill()* method or the *nvl()* UDF when selecting features: val train = sqlContext.sql("SELECT ... nvl(Field, 1.0) AS Field ... FROM test") -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 10, 2016, at 11:19, Yanbo Liang wrote: > Hi S

Re: Losing executors due to memory problems

2016-08-12 Thread Bedrytski Aliaksandr

f 6 nodes, 16 cores/node, 64 ram/node => Gives: 17 executors, > 19Gb/exec, 5 cores/exec > No more than 5 cores per exec > Leave some cores/Ram for the driver More on the matter here http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Bedrytski Aliaksandr

temporary table, we add an unique, incremented, thread safe id (AtomicInteger) to its name so that there are only specific, non-shared temporary tables used for a test. -- Bedrytski Aliaksandr sp...@bedryt.ski > On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote: > Hi! > > Just

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-22 Thread Bedrytski Aliaksandr

f I'm wrong), if you already have >1 specs per test, the CPU will be already saturated, so total parallel execution of tests will not give additional gains. Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Sun, Aug 21, 2016, at 18:30, Everett Anderson wrote: > > > On Sun, A

Re: DataFrame Data Manipulation - Based on a timestamp column Not Working

2016-08-24 Thread Bedrytski Aliaksandr

E,'__-MM-_dd_') >= > unix_timestamp(demand_timefence_end_date ,'__-MM-_dd_') > """) This is if demand_timefence_end_date has '__-MM-_dd_' date format Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 24,

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Bedrytski Aliaksandr

dataframe. This way it won't hit performance too much. Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote: > Hi, > > what is the best way to calculate intermediate column statistics like > the number of empty values

Re: Best way to calculate intermediate column statistics

2016-08-26 Thread Bedrytski Aliaksandr

Hi Mich, I was wondering what are the advantages of using helper methods instead of one SQL multiline string? (I rarely (if ever) use helper methods, but maybe I'm missing something) Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Thu, Aug 25, 2016, at 11:39, Mich Talebzadeh

Re: How to acess the WrappedArray

2016-08-29 Thread Bedrytski Aliaksandr

. Or (if the file is expected to be larger than bash tools can handle) you could iterate over the resulting WrappedArray and create a case class for each line. PS: I wonder where the *meta* object from the json goes. -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Aug 29, 2016, at 11:27

Re: Why does spark take so much time for simple task without calculation?

2016-08-31 Thread Bedrytski Aliaksandr

don't really matter. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 31, 2016, at 11:45, xiefeng wrote: > I install a spark standalone and run the spark cluster(one master and one > worker) in a windows 2008 server with 16cores and 24GB memory. > > I have d

Re: Why does spark take so much time for simple task without calculation?

2016-09-09 Thread Bedrytski Aliaksandr

Hi xiefeng, Even if your RDDs are tiny and reduced to one partition, there is always orchestration overhead (sending tasks to executor(s), reducing results, etc., these things are not free). If you need fast, [near] real-time processing, look towards spark-streaming. Regards, -- Bedrytski

Re: SparkR error: reference is ambiguous.

2016-09-09 Thread Bedrytski Aliaksandr

s ambiguity problems. Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Fri, Sep 9, 2016, at 19:33, xingye wrote: > Not sure whether this is the right distribution list that I can ask > questions. If not, can someone give a distribution list that can find > someone to help? > > I

Re: Get profile from sbt

2016-09-21 Thread Bedrytski Aliaksandr

Hi Saurabh, you may use BuildInfo[1] sbt plugin to access values defined in build.sbt Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Sep 19, 2016, at 18:28, Saurabh Malviya (samalviy) wrote: > Hi, > > Is there any way equivalent to profiles in maven in sbt. I want spar

Re: Spark Application Log

2016-09-22 Thread Bedrytski Aliaksandr

l the executors in one output. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Thu, Sep 22, 2016, at 06:06, Divya Gehlot wrote: > Hi, > I have initialised the logging in my spark App > */*Initialize Logging */ **val **log *= Logger.*getLogger*(getClass.getName) > > Logger

Re: udf forces usage of Row for complex types?

2016-09-25 Thread Bedrytski Aliaksandr

how to read it as a table (by transforming it to a DataFrame) Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Sun, Sep 25, 2016, at 23:41, Koert Kuipers wrote: > after having gotten used to have case classes represent complex > structures in Datasets, i am surprised to find out tha

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Bedrytski Aliaksandr

'Nan' > """) This query filters rows containing Nan for a table with 3 columns. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Sep 26, 2016, at 09:30, muhammet pakyürek wrote: > > is there any way to do this directly. if its not, is there any todo > this indirectly using another datastrcutures of spark >

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Bedrytski Aliaksandr

Hi Muhammet, python also supports sql queries http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Sep 26, 2016, at 10:01, muhammet pakyürek wrote: > > > > but my requst i

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Bedrytski Aliaksandr

lines") spark.sql("SELECT cast(value as FLOAT) from lines").show() +-+ |value| +-+ | null| | 1. | | null| | 8.6 | +-+ After it you may filter the DataFrame for values containing null. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Sep 28, 2016, at 10

Re: spark persistence doubt

2016-09-29 Thread Bedrytski Aliaksandr

y lose the optimisations given by lining up the 3 steps in one operation). If there is a second action executed on any of the transformation, persisting the farthest common transformation would be a good idea. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Thu, Sep 29, 2016, at 07:09, Shus

Re: Random forest binary classification H20 difference Spark

Re: Losing executors due to memory problems

Re: Plans for improved Spark DataFrame/Dataset unit testing?

Re: Plans for improved Spark DataFrame/Dataset unit testing?

Re: DataFrame Data Manipulation - Based on a timestamp column Not Working

Re: Best way to calculate intermediate column statistics

Re: Best way to calculate intermediate column statistics

Re: How to acess the WrappedArray

Re: Why does spark take so much time for simple task without calculation?

Re: Why does spark take so much time for simple task without calculation?

Re: SparkR error: reference is ambiguous.

Re: Get profile from sbt

Re: Spark Application Log

Re: udf forces usage of Row for complex types?

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

Re: Issue with rogue data in csv file used in Spark application

Re: spark persistence doubt

18 matches

Site Navigation

Mail list logo

Footer information