Reading CSV and Transforming to Parquet Issue

2021-09-02 Thread R Nair
All, This is very surprising and I am sure I might be doing something wrong. The issue is, the following code is taking 8 hours. It reads a CSV file, takes the phone number column, extracts the first four digits and then partitions based on the four digits (phoneseries) and writes to Parquet. Any

Re: Query about Spark

2020-09-07 Thread R Nair
Please read this as well, thanks Disclaimer: it's my article. https://medium.com/@ravishankar.nair/online-and-batch-based-ml-execution-from-same-python-code-preserving-pre-and-post-transformation-ea7ebc27f50f?sk=c33bcf1d6c28b562b7bd36fa39809294 Best, Ravion On Mon, Sep 7, 2020, 8:29 AM Enrico

Re: Query about Spark

2020-09-06 Thread R Nair
Or use MLFlow's PySpark UDF. First create a mlflow.pyfunc. Best, Ravion On Sun, Sep 6, 2020, 9:43 AM ☼ R Nair wrote: > Question is not clear..use accumulators, if I took it correctly. > > Best, Ravion > > On Sun, Sep 6, 2020, 9:41 AM Ankur Das wrote: > >> >> G

Re: Query about Spark

2020-09-06 Thread R Nair
Question is not clear..use accumulators, if I took it correctly. Best, Ravion On Sun, Sep 6, 2020, 9:41 AM Ankur Das wrote: > > Good Evening Sir/Madam, > Hope you are doing well, I am experimenting on some ML techniques where I > need to test it on a distributed environment. > For example a

Partitioning query

2019-09-13 Thread R Nair
Hi, We are running a Spark JDBC code to pull data from Oracle, with some 200 partitions. Sometimes we are seeing that some tasks are failing or not moving forward. Is there anyway we can see/find the queries responsible for each partition or task ? How to enable this? Thanks Best, Ravion

Re: JDK11 Support in Apache Spark

2019-08-24 Thread R Nair
Finally!!! Congrats On Sat, Aug 24, 2019, 11:11 AM Dongjoon Hyun wrote: > Hi, All. > > Thanks to your many many contributions, > Apache Spark master branch starts to pass on JDK11 as of today. > (with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6) > > >

Re: Testing Apache Spark applications

2018-11-15 Thread R Nair
Sparklens from qubole is a good source. Other tests are to be handled by developer. Best, Ravi On Thu, Nov 15, 2018, 12:45 PM Hi all, > > > > How are you testing your Spark applications? > > We are writing features by using Cucumber. This is testing the behaviours. > Is this called functional

DB2 Sequence - Error while invoking

2018-11-07 Thread R Nair
Hi all, We are trying to call the DB2 Sequence through Spark and assign that value to one of the column (PK) in table. We are getting the below issue: SEQ: CITI_VENDOR_UNITED_LIST_TARGET_SEQ Table: CITI_VENDOR_UNITED_LIST_TARGET DB: CITIVENDORS Host: CIT_XX Port: 42194 Schema: MINE DB2 SQL

Re: Spark In Memory Shuffle

2018-10-18 Thread R Nair
gt; directories (with date on directory name) on your ramdisk > > > Sent using Zoho Mail <https://www.zoho.com/mail/> > > > On Wed, 17 Oct 2018 18:57:14 +0330 *☼ R Nair > >* wrote > > What are the steps to configure this? Thanks > > On Wed, Oct 17, 2

Re: Spark In Memory Shuffle

2018-10-17 Thread R Nair
What are the steps to configure this? Thanks On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester wrote: > Hi, > I failed to config spark for in-memory shuffle so currently just > using linux memory mapped directory (tmpfs) as working directory of spark, > so everything is fast > > Sent using

getBytes : save as pdf

2018-10-10 Thread R Nair
All, I am reading a zipped file into an RdD and getting the rdd._1as the name and rdd._2.getBytes() as the content. How can I save the latter as a PDF? In fact the zipped file is a set of PDFs. I tried saveAsObjectFile and saveAsTextFile, but cannot read back the PDF. Any clue please? Best,