Best Practices for Spark Join

2016-06-01 Thread Aakash Basu
Hi, Can you please write in order of importance, one by one, the Best Practices (necessary/better to follow) for doing a Spark Join. Thanks, Aakash.

Python to Scala

2016-06-17 Thread Aakash Basu
Hi all, I've a python code, which I want to convert to Scala for using it in a Spark program. I'm not so well acquainted with python and learning scala now. Any Python+Scala expert here? Can someone help me out in this please? Thanks & Regards, Aakash.

Re: Python to Scala

2016-06-17 Thread Aakash Basu
xcuse brevity. > On Jun 18, 2016 2:34 PM, "Aakash Basu" <raj2coo...@gmail.com> wrote: > >> Hi all, >> >> I've a python code, which I want to convert to Scala for using it in a >> Spark program. I'm not so well acquainted with python and learning sc

Re: Python to Scala

2016-06-17 Thread Aakash Basu
rk - or find someone else who feels > comfortable to do it. That kind of inquiry would likelybe appropriate on a > job board. > > > > 2016-06-17 21:47 GMT-07:00 Aakash Basu <raj2coo...@gmail.com>: > >> Hey, >> >> Our complete project is in Spark on Scala

Re: spark job automatically killed without rhyme or reason

2016-06-23 Thread Aakash Basu
Hey, I've come across this. There's a command called "yarn application -kill ", which kills the application with a one liner 'Killed'. If it is memory issue, the error shows up in form of 'GC Overhead' or forming up tree or something of the sort. So, I think someone killed your job by that

Spark JOIN Not working

2016-05-24 Thread Aakash Basu
Hi experts, I'm extremely new to the Spark Ecosystem, hence need a help from you guys. While trying to fetch data from CSV files and join querying them in accordance to the need, when I'm caching the data by using registeredTempTables and then using select query to select what I need as per the

Pros and Cons

2016-05-25 Thread Aakash Basu
Hi, I’m new to the Spark Ecosystem, need to understand the *Pros and Cons *of fetching data using *SparkSQL vs Hive in Spark vs Spark API.* *PLEASE HELP!* Thanks, Aakash Basu.

Unsubscribe

2016-08-09 Thread Aakash Basu

Re: Little idea needed

2016-07-20 Thread Aakash Basu
PCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content

Re: Little idea needed

2016-07-20 Thread Aakash Basu
ing one file out of all the deltas. On 19 Jul 2016, at 21:27, Aakash Basu <raj2coo...@gmail.com> wrote: Hi all, I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS. Then I will do delta loads everyday in the same fo

Little idea needed

2016-07-19 Thread Aakash Basu
Hi all, I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS. Then I will do delta loads everyday in the same folder in HDFS. Now, my query here is, DAY 0 - I did the initial load (full dump). DAY 1 - I'll load only

Re: How to convert RDD to DF for this case -

2017-02-17 Thread Aakash Basu
-+--+++ > | col1| col2|col3|col4| > +--+--+++ > | uihgf| Paris| 56| 5| > |asfsds| ***| 43| 1| > |fkwsdf|London| 45| 6| > | gddg| ABCD| 32| 2| > | grgzg| *CSD| 35| 3| > | gsrsn| ADR*| 22| 4| > +--+--+++ >

Re: Get S3 Parquet File

2017-02-23 Thread Aakash Basu
Hey, Please recheck your access key and secret key being used to fetch the parquet file. It seems to be a credential error. Either mismatch/load. If load, then first use it directly in code and see if the issue resolves, then it can be hidden and read from Input Params. Thanks, Aakash. On

Fwd: Need some help

2016-09-01 Thread Aakash Basu
-- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Thu, Aug 25, 2016 at 10:06 PM Subject: Need some help To: user@spark.apache.org Hi all, Aakash here, need a little help in KMeans clustering. This is needed to be done: "Implement Kmeans

Re: Fwd: Need some help

2016-09-01 Thread Aakash Basu
hon + scikit-learn. Or R. If you want to do it with a UI based software, > try Weka or Orange. > > Regards, > > Sivakumaran S > > On 1 Sep 2016 8:42 p.m., Aakash Basu <aakash.spark@gmail.com> wrote: > > > ------ Forwarded message -- > From: *Aa

Re: Fwd: Need some help

2016-09-02 Thread Aakash Basu
g what the KMeans > clustering algorithm is and then looking into how you can use the DataFrame > API to implement the KMeansClustering. > > Thanks, > Shashank > > On Thu, Sep 1, 2016 at 1:05 PM, Aakash Basu <aakash.spark@gmail.com> > wrote: > >> Hey Siv

Join Query

2016-11-17 Thread Aakash Basu
Hi, Conceptually I can understand below spark joins, when it comes to implementation I don’t find much information in Google. Please help me with code/pseudo code for below joins using java-spark or scala-spark. *Replication Join:* Given two datasets, where one is small

HDPCD SPARK Certification Queries

2016-11-17 Thread Aakash Basu
Hi all, I want to know more about this examination - http://hortonworks.com/training/certification/exam-objectives/#hdpcdspark If anyone's there who appeared for the examination, can you kindly help? 1) What are the kind of questions that come, 2) Samples, 3) All the other details. Thanks,

Hortonworks Spark Certification Query

2016-12-14 Thread Aakash Basu
Hi all, Is there anyone who wrote the HDPCD examination as in the below link? http://hortonworks.com/training/certification/exam-objectives/#hdpcdspark I'm going to sit for this with a very little time to prepare, can I please be helped with the questions to expect and their probable solutions?

Re: community feedback on RedShift with Spark

2017-04-24 Thread Aakash Basu
Hey afshin, Your point 1 is innumerably faster than the latter. It further shoots up the speed if you know how to properly use distKey and sortKey on the tables being loaded. Thanks, Aakash. https://www.linkedin.com/in/aakash-basu-5278b363 On 24-Apr-2017 10:37 PM, "Afshin, Bardia" &

Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu
Hi all, I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to fetch data from an excel file using *spark.read.format("com.crealytics.spark.excel")*, but it is inferring double for a date type column. The detailed description is given here (the question I posted) -

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu
Hey all, Forgot to attach the link to the overriding Schema through external package's discussion. https://github.com/crealytics/spark-excel/pull/13 You can see my comment there too. Thanks, Aakash. On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu <aakash.spark@gmail.com> wrote: &g

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu
is a difference between the actual value in the cell and what Excel formats that cell. You probably want to import that field as a string or not have it as a date format in Excel. Just a thought Thank You, Irving Duran On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu <aakash.spark@gmail.com

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-17 Thread Aakash Basu
wrote: > You can use Apache POI DateUtil to convert double to Date ( > https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html). > Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/ > hadoopoffice/wiki), it supports Spark 1.x or Spark 2.0 ds. > > O

Any solution for this?

2017-05-15 Thread Aakash Basu
Hi all, Any solution for this issue - http://stackoverfl ow.com/q/43921392/7998705 Thanks, Aakash.

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Aakash Basu
Hey, I work on Spark SQL and would pretty much be able to help you in this. Let me know your requirement. Thanks, Aakash. On 12-Jun-2017 11:00 AM, "bo yang" wrote: > Hi Guys, > > I am writing a small open source project > to use

Re: Spark-SQL collect function

2017-05-19 Thread Aakash Basu
Well described​, thanks! On 04-May-2017 4:07 AM, "JayeshLalwani" wrote: > In any distributed application, you scale up by splitting execution up on > multiple machines. The way Spark does this is by slicing the data into > partitions and spreading them on multiple

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-19 Thread Aakash Basu
Hey all, A reply on this would be great! Thanks, A.B. On 17-May-2017 1:43 AM, "Daniel Siegmann" wrote: > When using spark.read on a large number of small files, these are > automatically coalesced into fewer partitions. The only documentation I can > find on

Repartition vs PartitionBy Help/Understanding needed

2017-06-15 Thread Aakash Basu
Hi all, Everybody is giving a difference between coalesce and repartition, but nowhere I found a difference between partitionBy and repartition. My question is, is it better to write a data set in parquet partitioning by a column and then reading the respective directories to work on that column

Fwd: Repartition vs PartitionBy Help/Understanding needed

2017-06-16 Thread Aakash Basu
Hi all, Can somebody put some light on this pls? Thanks, Aakash. -- Forwarded message -- From: "Aakash Basu" <aakash.spark@gmail.com> Date: 15-Jun-2017 2:57 PM Subject: Repartition vs PartitionBy Help/Understanding needed To: "user" <user@

Help needed in Dividing open close dates column into multiple columns in dataframe

2017-09-19 Thread Aakash Basu
Hi, I've a csv dataset which has a column with all the details of store open and close timings as per dates, but the data is highly variant, as follows - Mon-Fri 10am-9pm, Sat 10am-8pm, Sun 12pm-6pm Mon-Sat 10am-8pm, Sun Closed Mon-Sat 10am-8pm, Sun 10am-6pm Mon-Friday 9-8 / Saturday 10-7 /

Efficient Spark-Submit planning

2017-09-11 Thread Aakash Basu
Hi, Can someone please clarify a little on how should we effectively calculate the parameters to be passed over using spark-submit. Parameters as in - Cores, NumExecutors, DriverMemory, etc. Is there any generic calculation which can be done over most kind of clusters with different sizes from

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-17 Thread Aakash Basu
f = lineitem_df.withColumn('shipdate',f.to_date(lineitem_ > df.shipdate)) > > —— > > You should have first ingested the column as a string; and then leveraged > the DF api to make the conversion to dateType. > > That should work. > > Kind Regar

Problem with CSV line break data in PySpark 2.1.0

2017-09-03 Thread Aakash Basu
Hi, I've a dataset where a few rows of the column F as shown below have line breaks in CSV file. [image: Inline image 1] When Spark is reading it, it is coming as below, which is a complete new line. [image: Inline image 2] I want my PySpark 2.1.0 to read it by forcefully avoiding the line

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
Hi all, Any help? PFB. Thanks, Aakash. On 20-Nov-2017 6:58 PM, "Aakash Basu" <aakash.spark@gmail.com> wrote: > Hi all, > > I have a table which will have 4 columns - > > | Expression|filter_condition| from_clause| > group_by_columns|

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Aakash Basu
pache.org/docs/latest/sql- > programming-guide.html#hive-tables > > Cheers > > On 21 November 2017 at 03:27, Aakash Basu <aakash.spark@gmail.com> > wrote: > >> Hi all, >> >> Any help? PFB. >> >> Thanks, >> Aakash. >> >> On 20-Nov-2

Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
Hi all, I have a table which will have 4 columns - | Expression|filter_condition| from_clause| group_by_columns| This file may have variable number of rows depending on the no. of KPIs I need to calculate. I need to write a SparkSQL program which will have to read this

Fwd: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hey all, Any help in the below please? Thanks, Aakash. -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Tue, Oct 31, 2017 at 9:17 PM Subject: Regarding column partitioning IDs and names as per hierarchical level SparkSQL To: user

Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hi all, I have to generate a table with Spark-SQL with the following columns - Level One Id: VARCHAR(20) NULL Level One Name: VARCHAR( 50) NOT NULL Level Two Id: VARCHAR( 20) NULL Level Two Name: VARCHAR(50) NULL Level Thr ee Id: VARCHAR(20) NULL Level Thr ee Name: VARCHAR(50) NULL Level Four

RE: Split column with dynamic data

2017-10-30 Thread Aakash Basu
ull-stop, and catering for the > possibility of a capital letter). > > > > This is untested, but it shoud do the trick based on your examples so far: > > > > df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”, > “”)) > > > > > > *From:

Split column with dynamic data

2017-10-30 Thread Aakash Basu
Hi all, I've a requirement to split a column and fetch only the description where I have numbers appended before that for some rows whereas other rows have only the description - Eg - (Description is the column header) *Description* Inventory Tree Products 1. AT Services 2. Accessories 4.

XGBoost on PySpark

2018-05-19 Thread Aakash Basu
Hi guys, I need help in implementing XG-Boost in PySpark. As per the conversation in a popular thread regarding XGB goes, it is available in Scala and Java versions but not Python. But, we've to implement a pythonic distributed solution (on Spark) maybe using DMLC or similar, to go ahead with

Fwd: XGBoost on PySpark

2018-05-23 Thread Aakash Basu
Guys any insight on the below? -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Sat, May 19, 2018 at 12:21 PM Subject: XGBoost on PySpark To: user <user@spark.apache.org> Hi guys, I need help in implementing XG-Boost in PySp

[Query] Weight of evidence on Spark

2018-05-25 Thread Aakash Basu
Hi guys, What's the best way to create feature column with Weight of Evidence calculated for categorical columns on target column (both Binary and Multi-Class)? Any insight? Thanks, Aakash.

Spark 2.3 Memory Leak on Executor

2018-05-26 Thread Aakash Basu
Hi, I am getting memory leak warning which ideally was a Spark bug back till 1.6 version and was resolved. Mode: Standalone IDE: PyCharm Spark version: 2.3 Python version: 3.6 Below is the stack trace - 2018-05-25 15:00:05 WARN Executor:66 - Managed memory leak detected; size = 262144 bytes,

Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
Hi, This query is based on one step further from the query in this link . In this scenario, I add 1 or 2 more columns to be processed, Spark throws an ERROR by printing the physical plan of queries. It

[Spark Streaming] Distinct Count on unrelated columns

2018-06-06 Thread Aakash Basu
Hi guys, Posted a question (link) on StackOverflow, any help? Thanks, Aakash.

[Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
Hi, I have submitted a job on* 4 node cluster*, where I see, most of the operations happening at one of the worker nodes and other two are simply chilling out. Picture below puts light on that - How to properly distribute the load? My cluster conf (4 node cluster [1 driver; 3 slaves]) -

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
oon as > it gets bigger you will see usage of more nodes. > > Hence increase your testing Dataset . > > On 11. Jun 2018, at 12:22, Aakash Basu wrote: > > Jorn - The code is a series of feature engineering and model tuning > operations. Too big to show. Yes, data volume is to

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Aakash Basu
; Regards, > Srinath. > > > On Tue, Jun 12, 2018 at 1:39 PM Aakash Basu > wrote: > >> Yes, but when I did increase my executor memory, the spark job is going >> to halt after running a few steps, even though, the executor isn't dying. >> >> Data -

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Aakash Basu
: > Aakash, > > Like Jorn suggested, did you increase your test data set? If so, did you > also update your executor-memory setting? It seems like you might exceeding > the executor memory threshold. > > Thanks > Vamshi Talla > > Sent from my iPhone > > On Jun 11, 20

Spark YARN Error - triggering spark-shell

2018-06-08 Thread Aakash Basu
Hi, Getting this error when trying to run Spark Shell using YARN - Command: *spark-shell --master yarn --deploy-mode client* 2018-06-08 13:39:09 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 2018-06-08 13:39:25

Re: Spark YARN Error - triggering spark-shell

2018-06-08 Thread Aakash Basu
ark.SparkException: Yarn application has already ended! It might > have been killed or unable to launch application master. > > > Check once on yarn logs > > Thanks, > Sathish- > > > On Fri, Jun 8, 2018 at 2:22 PM, Jeff Zhang wrote: > >> >> Check the yarn AM l

Re: Spark YARN job submission error (code 13)

2018-06-08 Thread Aakash Basu
y, no clue, anyone faced this problem, any help on this? Thanks, Aakash. On Fri, Jun 8, 2018 at 2:17 PM, Saisai Shao wrote: > In Spark on YARN, error code 13 means SparkContext doesn't initialize in > time. You can check the yarn application log to get more information. > > BTW, did you

Spark YARN job submission error (code 13)

2018-06-08 Thread Aakash Basu
Hi, I'm trying to run a program on a cluster using YARN. YARN is present there along with HADOOP. Problem I'm running into is as below - Container exited with a non-zero exit code 13 > Failing this attempt. Failing the application. > ApplicationMaster host: N/A > ApplicationMaster

Re: Spark YARN Error - triggering spark-shell

2018-06-08 Thread Aakash Basu
Fixed by adding 2 configurations in yarn-site,xml. Thanks all! On Fri, Jun 8, 2018 at 2:44 PM, Aakash Basu wrote: > Hi, > > I fixed that problem by putting all the Spark JARS in spark-archive.zip > and putting it in the HDFS (as that problem was happening for that reason) -

Re: Spark YARN job submission error (code 13)

2018-06-08 Thread Aakash Basu
Fixed by adding 2 configurations in yarn-site,xml. Thanks all! On Fri, Jun 8, 2018 at 2:44 PM, Aakash Basu wrote: > Hi, > > I fixed that problem by putting all the Spark JARS in spark-archive.zip > and putting it in the HDFS (as that problem was happening for that reason) - > &g

Using G1GC in Spark

2018-06-14 Thread Aakash Basu
Hi, I am trying to spark submit with G1GC for garbage collection, but it isn't working. What is the way to deploy a spark job with G1GC? Tried - *spark-submit --master spark://192.168.60.20:7077 --conf -XX:+UseG1GC /appdata/bblite-codebase/test.py* Didn't work.

StackOverFlow ERROR - Bulk interaction for many columns fail

2018-06-18 Thread Aakash Basu
Hi, When doing bulk interaction on around 60 columns, I want 3 columns to be created out of each one of them, since it has a combination of 3, then it becomes 60N2 * 3, which creates a lot of columns. So, for a lesser than 50 - 60 columns, even though it takes time, it still works fine, but, for

Fwd: StackOverFlow ERROR - Bulk interaction for many columns fail

2018-06-18 Thread Aakash Basu
*Correction, 60C2 * 3* -- Forwarded message -- From: Aakash Basu Date: Mon, Jun 18, 2018 at 4:15 PM Subject: StackOverFlow ERROR - Bulk interaction for many columns fail To: user Hi, When doing bulk interaction on around 60 columns, I want 3 columns to be created out of each

[Help] Codegen Stage grows beyond 64 KB

2018-06-16 Thread Aakash Basu
Hi guys, I'm getting an error when I'm feature engineering on 30+ columns to create about 200+ columns. It is not failing the job, but the ERROR shows. I want to know how can I avoid this. Spark - 2.3.1 Python - 3.6 Cluster Config - 1 Master - 32 GB RAM, 16 Cores 4 Slaves - 16 GB RAM, 8 Cores

Inferring from Event Timeline

2018-06-13 Thread Aakash Basu
Hi guys, What all can be inferred by closely watching an event time-line in Spark UI? I generally monitor the tasks taking more time and also how much in parallel they're spinning. What else? Eg Event Time-line from Spark UI: Thanks, Aakash.

Re: Using G1GC in Spark

2018-06-14 Thread Aakash Basu
.html#runtime-environment> > . > > --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" > > Regards, > Srinath. > > > On Thu, Jun 14, 2018 at 4:44 PM Aakash Basu > wrote: > >> Hi, >> >> I am trying to spark submit with G1GC for garbage c

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-16 Thread Aakash Basu
Vaquar khan > > On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu > wrote: > >> Hi guys, >> >> I'm getting an error when I'm feature engineering on 30+ columns to >> create about 200+ columns. It is not failing the job, but the ERROR shows. >> I want to know ho

Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-14 Thread Aakash Basu
Hi, Downloaded the latest Spark version because the of the fix for "ERROR AsyncEventQueue:70 - Dropping event from queue appStatus." After setting environment variables and running the same code in PyCharm, I'm getting this error, which I can't find a solution of. Exception in thread "main"

Crosstab/AproxQuantile Performance on Spark Cluster

2018-06-14 Thread Aakash Basu
Hi all, Is the Event Timeline representing a good shape? I mean at a point, to calculate WoE columns on categorical variables, I am having to do crosstab on each column, and on a cluster of 4 nodes, it is taking time as I've 230+ columns and 60,000 rows. How to make it more performant?

Understanding Event Timeline of Spark UI

2018-06-15 Thread Aakash Basu
Hi, I've a job running which shows the Event Timeline as follows, I am trying to guess the gaps between these single lines, they seem to be parallel but not immediately sequential with other stages. Any other insight from this, and what is the cluster doing during these gaps? Thanks, Aakash.

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
wrote: > I think it's dropTempView > > On Sat, May 26, 2018, 8:56 PM Aakash Basu <aakash.spark@gmail.com> > wrote: > >> Hi all, >> >> I'm trying to use dropTempTable() after the respective Temporary Table's >> use is over (to free up the memory for ne

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Well, it did, meaning, internally a TempTable and a TempView are the same. Thanks buddy! On Sat, May 26, 2018 at 9:23 PM, Aakash Basu <aakash.spark@gmail.com> wrote: > Question is, while registering, using registerTempTable() and while > dropping, using a dropTempView(), would i

Re: Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
> Please check if the right attribute(s) are used.; > > > > On Sat, May 26, 2018 at 6:16 PM, Aakash Basu <aakash.spark@gmail.com> > wrote: > >> Hi, >> >> This query is based on one step further from the query in this link >> <https://stackoverfl

Re: Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
comprehend that even though the name of column is same but they come from two different tables, isn't? Well, I'll try out the solution provided above, and see if it works for me. Thanks! On Sat, May 26, 2018 at 9:45 PM, Aakash Basu <aakash.spark@gmail.com> wrote: > You're right. >

Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Hi all, I'm trying to use dropTempTable() after the respective Temporary Table's use is over (to free up the memory for next calculations). Newer Spark Session doesn't need sqlContext, so, it is confusing me on how to use the function. 1) Tried, same DF which I used to register a temp table to

Fwd: [Help] PySpark Dynamic mean calculation

2018-05-31 Thread Aakash Basu
: col_namer.append(column+'_fold_'+str(fold)) df = df.withColumn(column+'_fold_'+str(folds)+'_mean', (sum(df[col] for col in col_namer)/(k_folds-1))) print(col_namer) df.show(1) -- Forwarded message -- From: Aakash Basu Date: Thu, May 31, 2018 at 3:40 PM Subject: [Help] PySpark Dynamic

[Help] PySpark Dynamic mean calculation

2018-05-31 Thread Aakash Basu
Hi, Using - Python 3.6 Spark 2.3 Original DF - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 1 1 2 3 4 5 6 2 7 5 3 5 2 1 I want to calculate means from the below dataframe as follows (like this for all columns and all folds) - key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2

[Suggestions needed] Weight of Evidence PySpark

2018-05-31 Thread Aakash Basu
Hi guys, I'm trying to calculate WoE on a particular categorical column depending on the target column. But the code is taking a lot of time on very few datapoints (rows). How can I optimize it to make it performant enough? Here's the code (here categorical_col is a python list of columns) -

Spark AsyncEventQueue doubt

2018-05-27 Thread Aakash Basu
Hi, I'm getting the below ERROR and WARN when running a little heavy calculation on a dataset - To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 2018-05-27 12:51:11 ERROR AsyncEventQueue:70 - Dropping event from queue > appStatus. This likely means

[Spark SQL] Efficiently calculating Weight of Evidence in PySpark

2018-06-01 Thread Aakash Basu
Hi guys, Can anyone please let me know if you've any clue on this problem I posted in StackOverflow - https://stackoverflow.com/questions/50638911/how-to-efficiently-calculate-woe-in-pyspark Thanks, Aakash.

Re: Append In-Place to S3

2018-06-02 Thread Aakash Basu
As Jay suggested correctly, if you're joining then overwrite otherwise only append as it removes dups. I think, in this scenario, just change it to write.mode('overwrite') because you're already reading the old data and your job would be done. On Sat 2 Jun, 2018, 10:27 PM Benjamin Kim, wrote:

Fundamental Question on Spark's distribution

2018-06-07 Thread Aakash Basu
Hi all, *Query 1)* Need a serious help! I'm running feature engineering of different types on a dataset and trying to benchmark from by tweaking different types of Spark properties. I don't know where it is going wrong that a single machine is working faster than a 3 node cluster, even though,

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Aakash Basu
oduce this problem? It would be very helpful that the > community will address this problem. > > Best regards, > Kazuaki Ishizaki > > > > From:vaquar khan > To:Eyal Zituny > Cc:Aakash Basu , user < > user@spark.apache.org> > Date:

G1GC vs ParallelGC

2018-06-20 Thread Aakash Basu
Hi guys, I just wanted to know, why my ParallelGC (*--conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"*) in a very long Spark ML Pipeline works faster than when I set G1GC (*--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"*), even though the Spark community suggests G1GC to be much

Way to avoid CollectAsMap in RandomForest

2018-06-20 Thread Aakash Basu
Hi, I'm running RandomForest model from Spark ML API on a medium sized data (2.25 million rows and 60 features), most of my time goes in the CollectAsMap of RandomForest but I've no option to avoid it as it is in the API. Is there a way to cutshort my end to end runtime? Thanks, Aakash.

[G1GC] -XX: -ResizePLAB How to provide in Spark Submit

2018-07-03 Thread Aakash Basu
Hi, I used the below in the Spark Submit for using G1GC - --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" Now, I want to use *-XX: -ResizePLAB *of the G1GC to control to avoid performance degradation caused by a large number of thread communications. How to do it? I tried submitting in

Re: Inferring Data driven Spark parameters

2018-07-03 Thread Aakash Basu
t; > On 3. Jul 2018, at 09:34, Aakash Basu wrote: > > Hi, > > Cluster - 5 node (1 Driver and 4 workers) > Driver Config: 16 cores, 32 GB RAM > Worker Config: 8 cores, 16 GB RAM > > I'm using the below parameters from which I know the first chunk is > cluster depe

Inferring Data driven Spark parameters

2018-07-03 Thread Aakash Basu
Hi, Cluster - 5 node (1 Driver and 4 workers) Driver Config: 16 cores, 32 GB RAM Worker Config: 8 cores, 16 GB RAM I'm using the below parameters from which I know the first chunk is cluster dependent and the second chunk is data/code dependent. --num-executors 4 --executor-cores 5

PySpark 2.1 Not instantiating properly

2017-10-20 Thread Aakash Basu
Hi all, I have Spark 2.1 installed in my laptop where I used to run all my programs. PySpark wasn't used for around 1 month, and after starting it now, I'm getting this exception (I've tried the solutions I could find on Google, but to no avail). Specs: Spark 2.1.1, Python 3.6, HADOOP 2.7,

Fwd: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Aakash Basu
Hi, Any help please? What can be the issue? Thanks, Aakash. -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Fri, Oct 20, 2017 at 1:00 PM Subject: PySpark 2.1 Not instantiating properly To: user <user@spark.apache.org> Hi all, I ha

Re: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Aakash Basu
t;> I remember having similar issue...either you have to give write perm to >> your /tmp directory or there's a spark config you need to override >> This error is not 2.1 specific...let me get home and check my configs >> I think I amended my /tmp permissions via xterm inst

Re: [G1GC] -XX: -ResizePLAB How to provide in Spark Submit

2018-07-03 Thread Aakash Basu
hould be no space > after the colon symbol > On Tue, Jul 3, 2018 at 3:01 AM Aakash Basu > wrote: > > > > Hi, > > > > I used the below in the Spark Submit for using G1GC - > > > > --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" > >

Spark MLLib vs. SciKitLearn

2018-01-19 Thread Aakash Basu
Hi all, I am totally new to ML APIs. Trying to get the *ROC_Curve* for Model Evaluation on both *ScikitLearn* and *PySpark MLLib*. I do not find any API for ROC_Curve calculation for BinaryClassification in SparkMLLib. The codes below have a wrapper function which is creating the respective

Re: Spark MLLib vs. SciKitLearn

2018-01-20 Thread Aakash Basu
Any help on the below? On 19-Jan-2018 7:12 PM, "Aakash Basu" <aakash.spark@gmail.com> wrote: > Hi all, > > I am totally new to ML APIs. Trying to get the *ROC_Curve* for Model > Evaluation on both *ScikitLearn* and *PySpark MLLib*. I do not find any >

Is there any Spark ML or MLLib API for GINI for Model Evaluation? Please help! [EOM]

2018-01-21 Thread Aakash Basu

[Help] Converting a Python Numpy code into Spark using RDD

2018-01-21 Thread Aakash Basu
Hi, How can I convert this Python Numpy code into Spark RDD so that the operations leverage the Spark distributed architecture for Big Data. Code is as follows - def gini(array): """Calculate the Gini coefficient of a numpy array.""" array = array.flatten() #all values are treated

[Doubt] GridSearch for Hyperparameter Tuning in Spark

2018-01-30 Thread Aakash Basu
Hi, Is there any available pyspark ML or MLLib API for Grid Search similar to GridSearchCV from model_selection of sklearn? I found this - https://spark.apache.org/docs/2.2.0/ml-tuning.html, but it has cross-validation and train-validation for hp-tuning and not pure grid search. Any help?

Re: Passing an array of more than 22 elements in a UDF

2017-12-25 Thread Aakash Basu
..@gmail.com> > *Sent:* Friday, December 22, 2017 3:15:14 AM > *To:* Aakash Basu > *Cc:* user > *Subject:* Re: Passing an array of more than 22 elements in a UDF > > Hi I think you are in correct track. You can stuff all your param in a > suitable data structure like array or dic

Passing an array of more than 22 elements in a UDF

2017-12-22 Thread Aakash Basu
Hi, I am using Spark 2.2 using Java, can anyone please suggest me how to take more than 22 parameters in an UDF? I mean, if I want to pass all the parameters as an array of integers? Thanks, Aakash.

Re: Best way to process this dataset

2018-06-19 Thread Aakash Basu
Georg, just asking, can Pandas handle such a big dataset? If that data is further passed into using any of the sklearn modules? On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler wrote: > use pandas or dask > > If you do want to use spark store the dataset as parquet / orc. And then > continue to

Re: Query on Profiling Spark Code

2018-07-31 Thread Aakash Basu
Okay, sure! On Tue, Jul 31, 2018 at 1:06 PM, Patil, Prashasth < prashasth.pa...@spglobal.com> wrote: > Hi Aakash, > > On a related note, you may want to try SparkLens for profiling which is > quite helpful in my opinion. > > > > > > -Prash > > &g

How to do PCA with Spark Streaming Dataframe?

2018-07-31 Thread Aakash Basu
Hi, Just curious to know, how can we run a Principal Component Analysis on streaming data in distributed mode? If we can, is it mathematically valid enough? Have anyone done that before? Can you guys share your experience over it? Is there any API Spark provides to do the same on Spark Streaming

Re: How to do PCA with Spark Streaming Dataframe?

2018-07-31 Thread Aakash Basu
FYI The relevant StackOverflow query on the same - https://stackoverflow.com/questions/51610482/how-to-do-pca-with-spark-streaming-dataframe On Tue, Jul 31, 2018 at 3:18 PM, Aakash Basu wrote: > Hi, > > Just curious to know, how can we run a Principal Component Analysis on > st

  1   2   >