Re: How to populate all possible combination values in columns using Spark SQL

2020-05-09 Thread Aakash Basu
2968?pid=InProduct=Global_Internal_YGrowth_AndroidEmailSig__AndroidUsers_wl=ym_sub1=Internal_sub2=Global_YGrowth_sub3=EmailSignature> > > On Thu, May 7, 2020 at 22:46, Aakash Basu > wrote: > Hi, > > I've updated the SO question with masked data, added year column and other >

Re: How to populate all possible combination values in columns using Spark SQL

2020-05-07 Thread Aakash Basu
ked) sample of > the data? It will be easier to see what you are trying to do if you add the > year column > > Thanks, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > On Thu, May 7, 2020 at 10:26

How to populate all possible combination values in columns using Spark SQL

2020-05-06 Thread Aakash Basu
Hi, I've described the problem in Stack Overflow with a lot of detailing, can you kindly check and help if possible? https://stackoverflow.com/q/61643910/5536733 I'd be absolutely fine if someone solves it using Spark SQL APIs rather than plain spark SQL query. Thanks, Aakash.

Which SQL flavor does Spark SQL follow?

2020-05-06 Thread Aakash Basu
Hi, Wish to know, which type of SQL syntax is followed when we write a plain SQL query inside spark.sql? Is it MySQL or PGSQL? I know it isn't SQL Server or Oracle as while migrating, had to convert a lot of SQL functions. Also if you can provide a documentation which clearly says the above

unix_timestamp() equivalent in plain Spark SQL Query

2020-04-02 Thread Aakash Basu
Hi, What is the unix_timestamp() function equivalent in a plain spark SQL query? I want to subtract one timestamp column from another, but in plain SQL am getting error "Should be numeric or calendarinterval and not timestamp." But when I did through the above function inaide withColumn, it

INTERVAL function not working

2020-04-02 Thread Aakash Basu
Hi, Am unable to solve a comparison between two timestamp field's difference and a particular interval of time in Spark SQL. I've asked rhe question here: https://stackoverflow.com/questions/60995744 Thanks, Aakash.

Re: Upsert for hive tables

2019-05-29 Thread Aakash Basu
Don't you have a date/timestamp to handle updates? So, you're talking about CDC? If you've Datestamp you can check if that/those key(s) exists, if exists then check if timestamp matches, if that matches, then ignore, if that doesn't then update. On Thu 30 May, 2019, 7:11 AM Genieliu, wrote: >

Re: Upsert for hive tables

2019-05-29 Thread Aakash Basu
Why don't you simply copy whole of delta data (Table A) into a stage table (temp table in your case) and insert depending on a *WHERE NOT EXISTS* check on primary key/composite key which already exists in the table B? That's faster and does the reconciliation job smoothly enough. Others, any

Fetching LinkedIn data into PySpark using OAuth2.0

2019-05-20 Thread Aakash Basu
Hi, Just curious to know if anyone was successful in connecting LinkedIn using OAuth2.0, client ID and client secret to fetch data and process in Python/PySpark. I'm getting stuck at connection establishment. Any help? Thanks, Aakash.

Data growth vs Cluster Size planning

2019-02-11 Thread Aakash Basu
Hi, I ran a dataset of *200 columns and 0.2M records* in a cluster of *1 master 18 GB, 2 slaves 32 GB each, **16 cores/slave*, took around *772 minutes* for a *very large ML tuning based job* (training). Now, my requirement is to run the *same operation on 3M records*. Any idea on how we should

Avoiding collect but use foreach

2019-01-31 Thread Aakash Basu
Hi, This: *to_list = [list(row) for row in df.collect()]* Gives: [[5, 1, 1, 1, 2, 1, 3, 1, 1, 0], [5, 4, 4, 5, 7, 10, 3, 2, 1, 0], [3, 1, 1, 1, 2, 2, 3, 1, 1, 0], [6, 8, 8, 1, 3, 4, 3, 7, 1, 0], [4, 1, 1, 3, 2, 1, 3, 1, 1, 0]] I want to avoid collect operation, but still convert the

Silly Spark SQL query

2019-01-28 Thread Aakash Basu
Hi, How to do this when the column (malignant and prediction) names are stored in two respective variables? tp = test_transformed[(test_transformed.malignant == 1) & (test_transformed.prediction == 1)].count() Thanks, Aakash.

Re: Silly Spark SQL query

2019-01-28 Thread Aakash Basu
Well, it is done. Using: ma = "malignant" pre = "prediction" tp_test = test_transformed.filter((col(ma) == "1") & (col(pre) == "1")).count() On Mon, Jan 28, 2019 at 5:41 PM Aakash Basu wrote: > Hi, > > How to do this when the column

Re: How to Overwrite a saved PySpark ML Model

2019-01-21 Thread Aakash Basu
d your help. Thanks, Aakash. On Mon, Jan 21, 2019 at 5:14 PM Aakash Basu wrote: > Hi, > > I am trying to overwrite a Spark ML Logistic Regression Model, but it > isn't working. > > Tried: > a) lr_model.write.overwrite().save(input_dict["config"]["save_model_pat

How to Overwrite a saved PySpark ML Model

2019-01-21 Thread Aakash Basu
Hi, I am trying to overwrite a Spark ML Logistic Regression Model, but it isn't working. Tried: a) lr_model.write.overwrite().save(input_dict["config"]["save_model_path"]) and b) lr_model.write.overwrite.save(input_dict["config"]["save_model_path"]) This works (if I do not want to overwrite):

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Aakash Basu
Any help, anyone? On Fri, Dec 21, 2018 at 2:21 PM Aakash Basu wrote: > Hey Shuporno, > > With the updated config too, I am getting the same error. While trying to > figure that out, I found this link which says I need aws-java-sdk (which I > already have): > https://github.c

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Aakash Basu
Choudhury < shuporno.choudh...@gmail.com> wrote: > Hi, > I don't know whether the following config (that you have tried) are > correct: > fs.s3a.awsAccessKeyId > fs.s3a.awsSecretAccessKey > > The correct ones probably are: > fs.s3a.access.key > fs.s3a.secret.key &g

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
ec 21, 2018 at 12:51 PM Shuporno Choudhury < shuporno.choudh...@gmail.com> wrote: > > > On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury < > shuporno.choudh...@gmail.com> wrote: > >> Hi, >> Your connection config uses 's3n' but your read command uses 's3a'

Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
Hi, I am trying to connect to AWS S3 and read a csv file (running POC) from a bucket. I have s3cmd and and being able to run ls and other operation from cli. *Present Configuration:* Python 3.7 Spark 2.3.1 *JARs added:* hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)

Re: How to read remote HDFS from Spark using username?

2018-10-03 Thread Aakash Basu
If it is so, how to update/fix the firewall issue? On Wed, Oct 3, 2018 at 1:14 PM Jörn Franke wrote: > Looks like a firewall issue > > Am 03.10.2018 um 09:34 schrieb Aakash Basu : > > The stac

Re: How to read remote HDFS from Spark using username?

2018-10-03 Thread Aakash Basu
connect(NetUtils.java:495) > at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614) > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Clien

How to read remote HDFS from Spark using username?

2018-10-03 Thread Aakash Basu
Hi, I have to read data stored in HDFS of a different machine and needs to be accessed through Spark for being read. How to do that? Full HDFS address along with port doesn't seem to work. Anyone did it before? Thanks, AB.

Re: Time-Series Forecasting

2018-09-19 Thread Aakash Basu
Hey, Even though I'm more of a Data Engineer than Data Scientist, but still, I work closely with the DS guys extensively on Spark ML, it is something which they're still working on following the scikit-learn trend, but, I never saw Spark handling Time-Series problems. Talking about both

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Aakash Basu
Removing support for an API in a major release makes poor sense, deprecating is always better. Removal can always be done two - three minor release later. On Mon 17 Sep, 2018, 6:49 AM Felix Cheung, wrote: > I don’t think we should remove any API even in a major release without > deprecating it

Local vs Cluster

2018-09-14 Thread Aakash Basu
Hi, What is the Spark cluster equivalent of standalone's local[N]. I mean, the value we set as a parameter of local as N, which parameter takes it in the cluster mode? Thanks, Aakash.

[Help] Set nThread in Spark cluster

2018-09-12 Thread Aakash Basu
Hi, API = ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier val xgbParam = Map("eta" -> 0.1f, | "max_depth" -> 2, | "objective" -> "multi:softprob", | "num_class" -> 3, | "num_round" -> 100, | "num_workers" -> 2) I'm running a job which

Which Py4J version goes with Spark 2.3.1?

2018-08-29 Thread Aakash Basu
Hi, Which Py4J version goes with Spark 2.3.1? I have py4j-0.10.7 but throws an error because of certain compatibility issues with the Spark 2.3.1. Error: [2018-08-29] [06:46:56] [ERROR] - Traceback (most recent call last): File "", line 120, in run File

RDD Collect Issue

2018-08-28 Thread Aakash Basu
Hi, I configured a new system, spark 2.3.0, python 3.6.0, dataframe read and other operations working as expected. But, RDD collect is failing - distFile = spark.sparkContext.textFile("/Users/aakash/Documents/Final_HOME_ORIGINAL/Downloads/PreloadedDataset/breast-cancer-wisconsin.csv")

How to convert Spark Streaming to Static Dataframe on the fly and pass it to a ML Model as batch

2018-08-14 Thread Aakash Basu
Hi all, The requirement is, to process file using Spark Streaming fed from Kafka Topic and once all the transformations are done, make it a batch of static dataframe and pass it into a Spark ML Model tuning. As of now, I had been doing it in the below fashion - 1) Read the file using Kafka 2)

Accessing a dataframe from another Singleton class (Python)

2018-08-13 Thread Aakash Basu
Hi, I wanted to read a dataframe in one singleton class and and use that in another singleton class. Below is my code - Class Singleton - class Singleton(object): _instances = {} def __new__(class_, *args, **kwargs): if class_ not in class_._instances: class_._instances[class_]

Re: How to do PCA with Spark Streaming Dataframe?

2018-07-31 Thread Aakash Basu
FYI The relevant StackOverflow query on the same - https://stackoverflow.com/questions/51610482/how-to-do-pca-with-spark-streaming-dataframe On Tue, Jul 31, 2018 at 3:18 PM, Aakash Basu wrote: > Hi, > > Just curious to know, how can we run a Principal Component Analysis on > st

How to do PCA with Spark Streaming Dataframe?

2018-07-31 Thread Aakash Basu
Hi, Just curious to know, how can we run a Principal Component Analysis on streaming data in distributed mode? If we can, is it mathematically valid enough? Have anyone done that before? Can you guys share your experience over it? Is there any API Spark provides to do the same on Spark Streaming

Re: Query on Profiling Spark Code

2018-07-31 Thread Aakash Basu
Okay, sure! On Tue, Jul 31, 2018 at 1:06 PM, Patil, Prashasth < prashasth.pa...@spglobal.com> wrote: > Hi Aakash, > > On a related note, you may want to try SparkLens for profiling which is > quite helpful in my opinion. > > > > > > -Prash > > &g

Query on Profiling Spark Code

2018-07-17 Thread Aakash Basu
Hi guys, I'm trying to profile my Spark code on cProfiler and check where more time is taken. I found the most time taken is by some socket object, which I'm quite clueless of, as to where it is used. Can anyone shed some light on this?

Re: Inferring Data driven Spark parameters

2018-07-04 Thread Aakash Basu
ontext. > On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu > wrote: > > > > We aren't using Oozie or similar, moreover, the end to end job shall be > exactly the same, but the data will be extremely different (number of > continuous and categorical columns, vertical size,

Re: [G1GC] -XX: -ResizePLAB How to provide in Spark Submit

2018-07-03 Thread Aakash Basu
hould be no space > after the colon symbol > On Tue, Jul 3, 2018 at 3:01 AM Aakash Basu > wrote: > > > > Hi, > > > > I used the below in the Spark Submit for using G1GC - > > > > --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" > >

Re: Inferring Data driven Spark parameters

2018-07-03 Thread Aakash Basu
t; > On 3. Jul 2018, at 09:34, Aakash Basu wrote: > > Hi, > > Cluster - 5 node (1 Driver and 4 workers) > Driver Config: 16 cores, 32 GB RAM > Worker Config: 8 cores, 16 GB RAM > > I'm using the below parameters from which I know the first chunk is > cluster depe

Inferring Data driven Spark parameters

2018-07-03 Thread Aakash Basu
Hi, Cluster - 5 node (1 Driver and 4 workers) Driver Config: 16 cores, 32 GB RAM Worker Config: 8 cores, 16 GB RAM I'm using the below parameters from which I know the first chunk is cluster dependent and the second chunk is data/code dependent. --num-executors 4 --executor-cores 5

[G1GC] -XX: -ResizePLAB How to provide in Spark Submit

2018-07-03 Thread Aakash Basu
Hi, I used the below in the Spark Submit for using G1GC - --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" Now, I want to use *-XX: -ResizePLAB *of the G1GC to control to avoid performance degradation caused by a large number of thread communications. How to do it? I tried submitting in

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Aakash Basu
oduce this problem? It would be very helpful that the > community will address this problem. > > Best regards, > Kazuaki Ishizaki > > > > From:vaquar khan > To:Eyal Zituny > Cc:Aakash Basu , user < > user@spark.apache.org> > Date:

Way to avoid CollectAsMap in RandomForest

2018-06-20 Thread Aakash Basu
Hi, I'm running RandomForest model from Spark ML API on a medium sized data (2.25 million rows and 60 features), most of my time goes in the CollectAsMap of RandomForest but I've no option to avoid it as it is in the API. Is there a way to cutshort my end to end runtime? Thanks, Aakash.

G1GC vs ParallelGC

2018-06-20 Thread Aakash Basu
Hi guys, I just wanted to know, why my ParallelGC (*--conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"*) in a very long Spark ML Pipeline works faster than when I set G1GC (*--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"*), even though the Spark community suggests G1GC to be much

Re: Best way to process this dataset

2018-06-19 Thread Aakash Basu
Georg, just asking, can Pandas handle such a big dataset? If that data is further passed into using any of the sklearn modules? On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler wrote: > use pandas or dask > > If you do want to use spark store the dataset as parquet / orc. And then > continue to

Fwd: StackOverFlow ERROR - Bulk interaction for many columns fail

2018-06-18 Thread Aakash Basu
*Correction, 60C2 * 3* -- Forwarded message -- From: Aakash Basu Date: Mon, Jun 18, 2018 at 4:15 PM Subject: StackOverFlow ERROR - Bulk interaction for many columns fail To: user Hi, When doing bulk interaction on around 60 columns, I want 3 columns to be created out of each

StackOverFlow ERROR - Bulk interaction for many columns fail

2018-06-18 Thread Aakash Basu
Hi, When doing bulk interaction on around 60 columns, I want 3 columns to be created out of each one of them, since it has a combination of 3, then it becomes 60N2 * 3, which creates a lot of columns. So, for a lesser than 50 - 60 columns, even though it takes time, it still works fine, but, for

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-16 Thread Aakash Basu
Vaquar khan > > On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu > wrote: > >> Hi guys, >> >> I'm getting an error when I'm feature engineering on 30+ columns to >> create about 200+ columns. It is not failing the job, but the ERROR shows. >> I want to know ho

[Help] Codegen Stage grows beyond 64 KB

2018-06-16 Thread Aakash Basu
Hi guys, I'm getting an error when I'm feature engineering on 30+ columns to create about 200+ columns. It is not failing the job, but the ERROR shows. I want to know how can I avoid this. Spark - 2.3.1 Python - 3.6 Cluster Config - 1 Master - 32 GB RAM, 16 Cores 4 Slaves - 16 GB RAM, 8 Cores

Understanding Event Timeline of Spark UI

2018-06-15 Thread Aakash Basu
Hi, I've a job running which shows the Event Timeline as follows, I am trying to guess the gaps between these single lines, they seem to be parallel but not immediately sequential with other stages. Any other insight from this, and what is the cluster doing during these gaps? Thanks, Aakash.

Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-14 Thread Aakash Basu
Hi, Downloaded the latest Spark version because the of the fix for "ERROR AsyncEventQueue:70 - Dropping event from queue appStatus." After setting environment variables and running the same code in PyCharm, I'm getting this error, which I can't find a solution of. Exception in thread "main"

Re: Using G1GC in Spark

2018-06-14 Thread Aakash Basu
.html#runtime-environment> > . > > --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" > > Regards, > Srinath. > > > On Thu, Jun 14, 2018 at 4:44 PM Aakash Basu > wrote: > >> Hi, >> >> I am trying to spark submit with G1GC for garbage c

Using G1GC in Spark

2018-06-14 Thread Aakash Basu
Hi, I am trying to spark submit with G1GC for garbage collection, but it isn't working. What is the way to deploy a spark job with G1GC? Tried - *spark-submit --master spark://192.168.60.20:7077 --conf -XX:+UseG1GC /appdata/bblite-codebase/test.py* Didn't work.

Crosstab/AproxQuantile Performance on Spark Cluster

2018-06-14 Thread Aakash Basu
Hi all, Is the Event Timeline representing a good shape? I mean at a point, to calculate WoE columns on categorical variables, I am having to do crosstab on each column, and on a cluster of 4 nodes, it is taking time as I've 230+ columns and 60,000 rows. How to make it more performant?

Inferring from Event Timeline

2018-06-13 Thread Aakash Basu
Hi guys, What all can be inferred by closely watching an event time-line in Spark UI? I generally monitor the tasks taking more time and also how much in parallel they're spinning. What else? Eg Event Time-line from Spark UI: Thanks, Aakash.

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Aakash Basu
; Regards, > Srinath. > > > On Tue, Jun 12, 2018 at 1:39 PM Aakash Basu > wrote: > >> Yes, but when I did increase my executor memory, the spark job is going >> to halt after running a few steps, even though, the executor isn't dying. >> >> Data -

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Aakash Basu
: > Aakash, > > Like Jorn suggested, did you increase your test data set? If so, did you > also update your executor-memory setting? It seems like you might exceeding > the executor memory threshold. > > Thanks > Vamshi Talla > > Sent from my iPhone > > On Jun 11, 20

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
oon as > it gets bigger you will see usage of more nodes. > > Hence increase your testing Dataset . > > On 11. Jun 2018, at 12:22, Aakash Basu wrote: > > Jorn - The code is a series of feature engineering and model tuning > operations. Too big to show. Yes, data volume is to

[Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
Hi, I have submitted a job on* 4 node cluster*, where I see, most of the operations happening at one of the worker nodes and other two are simply chilling out. Picture below puts light on that - How to properly distribute the load? My cluster conf (4 node cluster [1 driver; 3 slaves]) -

Re: Spark YARN job submission error (code 13)

2018-06-08 Thread Aakash Basu
Fixed by adding 2 configurations in yarn-site,xml. Thanks all! On Fri, Jun 8, 2018 at 2:44 PM, Aakash Basu wrote: > Hi, > > I fixed that problem by putting all the Spark JARS in spark-archive.zip > and putting it in the HDFS (as that problem was happening for that reason) - > &g

Re: Spark YARN Error - triggering spark-shell

2018-06-08 Thread Aakash Basu
Fixed by adding 2 configurations in yarn-site,xml. Thanks all! On Fri, Jun 8, 2018 at 2:44 PM, Aakash Basu wrote: > Hi, > > I fixed that problem by putting all the Spark JARS in spark-archive.zip > and putting it in the HDFS (as that problem was happening for that reason) -

Re: Spark YARN Error - triggering spark-shell

2018-06-08 Thread Aakash Basu
ark.SparkException: Yarn application has already ended! It might > have been killed or unable to launch application master. > > > Check once on yarn logs > > Thanks, > Sathish- > > > On Fri, Jun 8, 2018 at 2:22 PM, Jeff Zhang wrote: > >> >> Check the yarn AM l

Re: Spark YARN job submission error (code 13)

2018-06-08 Thread Aakash Basu
y, no clue, anyone faced this problem, any help on this? Thanks, Aakash. On Fri, Jun 8, 2018 at 2:17 PM, Saisai Shao wrote: > In Spark on YARN, error code 13 means SparkContext doesn't initialize in > time. You can check the yarn application log to get more information. > > BTW, did you

Spark YARN Error - triggering spark-shell

2018-06-08 Thread Aakash Basu
Hi, Getting this error when trying to run Spark Shell using YARN - Command: *spark-shell --master yarn --deploy-mode client* 2018-06-08 13:39:09 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 2018-06-08 13:39:25

Spark YARN job submission error (code 13)

2018-06-08 Thread Aakash Basu
Hi, I'm trying to run a program on a cluster using YARN. YARN is present there along with HADOOP. Problem I'm running into is as below - Container exited with a non-zero exit code 13 > Failing this attempt. Failing the application. > ApplicationMaster host: N/A > ApplicationMaster

Fundamental Question on Spark's distribution

2018-06-07 Thread Aakash Basu
Hi all, *Query 1)* Need a serious help! I'm running feature engineering of different types on a dataset and trying to benchmark from by tweaking different types of Spark properties. I don't know where it is going wrong that a single machine is working faster than a 3 node cluster, even though,

[Spark Streaming] Distinct Count on unrelated columns

2018-06-06 Thread Aakash Basu
Hi guys, Posted a question (link) on StackOverflow, any help? Thanks, Aakash.

Re: Append In-Place to S3

2018-06-02 Thread Aakash Basu
As Jay suggested correctly, if you're joining then overwrite otherwise only append as it removes dups. I think, in this scenario, just change it to write.mode('overwrite') because you're already reading the old data and your job would be done. On Sat 2 Jun, 2018, 10:27 PM Benjamin Kim, wrote:

[Spark SQL] Efficiently calculating Weight of Evidence in PySpark

2018-06-01 Thread Aakash Basu
Hi guys, Can anyone please let me know if you've any clue on this problem I posted in StackOverflow - https://stackoverflow.com/questions/50638911/how-to-efficiently-calculate-woe-in-pyspark Thanks, Aakash.

Fwd: [Help] PySpark Dynamic mean calculation

2018-05-31 Thread Aakash Basu
: col_namer.append(column+'_fold_'+str(fold)) df = df.withColumn(column+'_fold_'+str(folds)+'_mean', (sum(df[col] for col in col_namer)/(k_folds-1))) print(col_namer) df.show(1) -- Forwarded message -- From: Aakash Basu Date: Thu, May 31, 2018 at 3:40 PM Subject: [Help] PySpark Dynamic

[Help] PySpark Dynamic mean calculation

2018-05-31 Thread Aakash Basu
Hi, Using - Python 3.6 Spark 2.3 Original DF - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 1 1 2 3 4 5 6 2 7 5 3 5 2 1 I want to calculate means from the below dataframe as follows (like this for all columns and all folds) - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2

[Suggestions needed] Weight of Evidence PySpark

2018-05-31 Thread Aakash Basu
Hi guys, I'm trying to calculate WoE on a particular categorical column depending on the target column. But the code is taking a lot of time on very few datapoints (rows). How can I optimize it to make it performant enough? Here's the code (here categorical_col is a python list of columns) -

Spark AsyncEventQueue doubt

2018-05-27 Thread Aakash Basu
Hi, I'm getting the below ERROR and WARN when running a little heavy calculation on a dataset - To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 2018-05-27 12:51:11 ERROR AsyncEventQueue:70 - Dropping event from queue > appStatus. This likely means

Re: Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
comprehend that even though the name of column is same but they come from two different tables, isn't? Well, I'll try out the solution provided above, and see if it works for me. Thanks! On Sat, May 26, 2018 at 9:45 PM, Aakash Basu <aakash.spark@gmail.com> wrote: > You're right. >

Re: Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
> Please check if the right attribute(s) are used.; > > > > On Sat, May 26, 2018 at 6:16 PM, Aakash Basu <aakash.spark@gmail.com> > wrote: > >> Hi, >> >> This query is based on one step further from the query in this link >> <https://stackoverfl

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Well, it did, meaning, internally a TempTable and a TempView are the same. Thanks buddy! On Sat, May 26, 2018 at 9:23 PM, Aakash Basu <aakash.spark@gmail.com> wrote: > Question is, while registering, using registerTempTable() and while > dropping, using a dropTempView(), would i

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
wrote: > I think it's dropTempView > > On Sat, May 26, 2018, 8:56 PM Aakash Basu <aakash.spark@gmail.com> > wrote: > >> Hi all, >> >> I'm trying to use dropTempTable() after the respective Temporary Table's >> use is over (to free up the memory for ne

Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Hi all, I'm trying to use dropTempTable() after the respective Temporary Table's use is over (to free up the memory for next calculations). Newer Spark Session doesn't need sqlContext, so, it is confusing me on how to use the function. 1) Tried, same DF which I used to register a temp table to

Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
Hi, This query is based on one step further from the query in this link . In this scenario, I add 1 or 2 more columns to be processed, Spark throws an ERROR by printing the physical plan of queries. It

Spark 2.3 Memory Leak on Executor

2018-05-26 Thread Aakash Basu
Hi, I am getting memory leak warning which ideally was a Spark bug back till 1.6 version and was resolved. Mode: Standalone IDE: PyCharm Spark version: 2.3 Python version: 3.6 Below is the stack trace - 2018-05-25 15:00:05 WARN Executor:66 - Managed memory leak detected; size = 262144 bytes,

[Query] Weight of evidence on Spark

2018-05-25 Thread Aakash Basu
Hi guys, What's the best way to create feature column with Weight of Evidence calculated for categorical columns on target column (both Binary and Multi-Class)? Any insight? Thanks, Aakash.

Fwd: XGBoost on PySpark

2018-05-23 Thread Aakash Basu
Guys any insight on the below? -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Sat, May 19, 2018 at 12:21 PM Subject: XGBoost on PySpark To: user <user@spark.apache.org> Hi guys, I need help in implementing XG-Boost in PySp

XGBoost on PySpark

2018-05-19 Thread Aakash Basu
Hi guys, I need help in implementing XG-Boost in PySpark. As per the conversation in a popular thread regarding XGB goes, it is available in Scala and Java versions but not Python. But, we've to implement a pythonic distributed solution (on Spark) maybe using DMLC or similar, to go ahead with

[How To] Using Spark Session in internal called classes

2018-04-23 Thread Aakash Basu
Hi, I have created my own Model Tuner class which I want to use to tune models and return a Model object if the user expects. This Model Tuner is in a file which I would ideally import into another file and call the class and use it. Outer file {from where I'd be calling the Model Tuner): I am

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
solve the problem. > > -kr, Gerard. > > > > On Mon, Apr 16, 2018 at 10:52 AM, Aakash Basu <aakash.spark@gmail.com> > wrote: > >> Hey Jayesh and Others, >> >> Is there then, any other way to come to a solution for this use-case? >> >> Than

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
tured streaming, you can only window by timestamp columns. You cannot > do windows aggregations on integers. > > > > *From: *Aakash Basu <aakash.spark@gmail.com> > *Date: *Monday, April 16, 2018 at 4:52 AM > *To: *"Lalwani, Jayesh" <jayesh.lalw

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
n aggregated streaming data frame. As per the documentation, joining > an aggregated streaming data frame with another streaming data frame is not > supported > > > > > > *From: *spark receiver <spark.recei...@gmail.com> > *Date: *Friday, April 13, 2018 at 11:49 PM &g

Is DLib available for Spark?

2018-04-10 Thread Aakash Basu
Hi team, Is DLib package available for use through Spark? Thanks, Aakash.

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-09 Thread Aakash Basu
Thanks in adv, Aakash. On Fri, Apr 6, 2018 at 9:55 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Instead of write to console you need to write to memory for it to be > queryable > > > .format("memory") >.queryName("tableName") > https://spa

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-06 Thread Aakash Basu
treams.awaitAnyTermination instead (waiting for either query1 or > query2 to terminate). Make sure you do that after the query2.start call. > > I hope this helps. > > Cheers, > Panagiotis > > On Fri, Apr 6, 2018 at 11:23 AM, Aakash Basu <aakash.spark@gmail.com> &

Fwd: [Structured Streaming] More than 1 streaming in a code

2018-04-06 Thread Aakash Basu
Any help? Need urgent help. Someone please clarify the doubt? -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Thu, Apr 5, 2018 at 3:18 PM Subject: [Structured Streaming] More than 1 streaming in a code To: user <user@spark.apache.org> Hi

Fwd: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-06 Thread Aakash Basu
Any help? Need urgent help. Someone please clarify the doubt? -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Mon, Apr 2, 2018 at 1:01 PM Subject: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query To: user

Fwd: [Structured Streaming] How to save entire column aggregation to a file

2018-04-06 Thread Aakash Basu
Any help? Need urgent help. Someone please clarify the doubt? -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Thu, Apr 5, 2018 at 2:28 PM Subject: [Structured Streaming] How to save entire column aggregation to a file To: user <user@spark.a

Fwd: Spark Structured Streaming Inner Queries fails

2018-04-06 Thread Aakash Basu
Any help? Need urgent help. Someone please clarify the doubt? -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Thu, Apr 5, 2018 at 2:50 PM Subject: Spark Structured Streaming Inner Queries fails To: user <user@spark.apache.org> Hi, W

[Structured Streaming] More than 1 streaming in a code

2018-04-05 Thread Aakash Basu
Hi, If I have more than one writeStream in a code, which operates on the same readStream data, why does it produce only the first writeStream? I want the second one to be also printed on the console. How to do that? from pyspark.sql import SparkSession from pyspark.sql.functions import split,

Spark Structured Streaming Inner Queries fails

2018-04-05 Thread Aakash Basu
Hi, Why are inner queries not allowed in Spark Streaming? Spark assumes the inner query to be a separate stream altogether and expects it to be triggered with a separate writeStream.start(). Why so? Error: pyspark.sql.utils.StreamingQueryException: 'Queries with streaming sources must be

[Structured Streaming] How to save entire column aggregation to a file

2018-04-05 Thread Aakash Basu
Hi, I want to save an aggregate to a file without using any window, watermark or groupBy. So, my aggregation is at entire column level. df = spark.sql("select avg(col1) as aver from ds") Now, the challenge is as follows - 1) If I use outputMode = Append, but "*Append output mode not supported

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-02 Thread Aakash Basu
llect from the above code and use it, I get the following error - *pyspark.sql.utils.AnalysisException: u'Queries with streaming sources must be executed with writeStream.start();;\nkafka'* Any alternative (better) solution to get this job done, would suffice too. Any help shall be greatly acknowledg

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-02 Thread Aakash Basu
Any help, guys? On Mon, Apr 2, 2018 at 1:01 PM, Aakash Basu <aakash.spark@gmail.com> wrote: > Hi, > > This is a very interesting requirement, where I am getting stuck at a few > places. > > *Requirement* - > > Col1Col2 > 1 10 > 2

[Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-02 Thread Aakash Basu
Hi, This is a very interesting requirement, where I am getting stuck at a few places. *Requirement* - Col1Col2 1 10 2 11 3 12 4 13 5 14 *I have to calculate avg of col1 and then divide each row of col2 by that avg. And,

[Query] Columnar transformation without Structured Streaming

2018-03-29 Thread Aakash Basu
Hi, I started my Spark Streaming journey from Structured Streaming using Spark 2.3, where I can easily do Spark SQL transformations on streaming data. But, I want to know, how can I do columnar transformation (like, running aggregation or casting, et al) using the prior utility of DStreams? Is

Structured Streaming Spark 2.3 Query

2018-03-22 Thread Aakash Basu
Hi, What is the way to stop a Spark Streaming job if there is no data inflow for an arbitrary amount of time (eg: 2 mins)? Thanks, Aakash.

  1   2   >