Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-11 Thread Chitral Verma
try explain codegen on your DF and then pardee the string

On Fri, 7 Apr, 2023, 3:53 pm Chenghao Lyu,  wrote:

> Hi,
>
> The detailed stage page shows the involved WholeStageCodegen Ids in its
> DAG visualization from the Spark UI when running a SparkSQL. (e.g., under
> the link
> node:18088/history/application_1663600377480_62091/stages/stage/?id=1=0).
>
> However, I have trouble extracting the WholeStageCodegen ids from the DAG
> visualization via the RESTAPIs. Is there any other way to get the
> WholeStageCodegen Ids information for each stage automatically?
>
> Cheers,
> Chenghao
>


Re: Non string type partitions

2023-04-11 Thread Chitral Verma
Because the name of the directory cannot be an object, it has to be a
string to create partitioned dirs like "date=2023-04-10"

On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh,  wrote:

>
> Hi Team,
>
> We are running into the below error when we are trying to run a simple
> query a partitioned table in Spark.
>
> *MetaException(message:Filtering is supported only on partition keys of type 
> string)
> *
>
>
> Our the partition column has been to type *date *instead of string and
> query is a very simple SQL as shown below.
>
> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*
>
> Any idea why spark mandates partition columns to be of type string?. Is
> there a recommended work around for this issue?
>
>
>


Fwd: [New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Chitral Verma
Hi All,
I worked on this idea a few years back as a pet project to bridge *SparkSQL*
and *SparkML* and empower anyone to implement production grade, distributed
machine learning over Apache Spark as long as they have SQL skills.

In principle the idea works exactly like Google's BigQueryML but at a much
wider scope with no vendor lock-in on basically every source that's
supported by Spark in cloud or on-prem.

*Training* a ML model can look like,

FIT 'LogisticRegression' ESTIMATOR WITH PARAMS(maxIter = 3) TO (
SELECT * FROM mlDataset) AND OVERWRITE AT LOCATION '/path/to/lr-model';

*Prediction* a ML model can look like,

PREDICT FOR (SELECT * FROM mlTestDataset) USING MODEL STORED AT
LOCATION '/path/to/lr-model'

*Feature Preprocessing* can look like,

TRANSFORM (SELECT * FROM dataset) using 'StopWordsRemover' TRANSFORMER WITH
PARAMS (inputCol='raw', outputCol='filtered') AND WRITE AT LOCATION
'/path/to/test-transformer'


But a lot more can be done with this library.

I was wondering if any of you find this interesting and would like to
contribute to the project here,

https://github.com/chitralverma/sparksql-ml


Regards,
Chitral Verma


Re: Profiling data quality with Spark

2022-12-29 Thread Chitral Verma
Hi Rajat,
I have worked for years in democratizing data quality for some of the top
organizations and I'm also an Apache Griffin Contributor and PMC - so I
know a lot about this space. :)

Coming back to your original question, there are a lot of data quality
options available in the market today and I'm listing down some of my top
recommendations with some additional comments,

*Proprietary Solutions*

   - MonteCarlo 
  - Pros: State of the art DQ solution with multiple deployment models,
  lots of connectors, SOC-2 compliant and handles the complete DQ lifecycle
  including monitoring and alerting.
  - Cons: Not open source, cannot be a "completely on-prem solution"
  - Anomalo 
  - Pros: One of the best UI for DQ management and operations.
  - Cons: Same as monte carlo - not open source, cannot be a
  "completely on-prem solution"
   - Collibra
   
  - Pros: Predominantly a data cataloging solution, Collibra now offers
  full data governance with its DQ offerings
  - Cons: in my opinion, connectors can be a little pricey over time
  with usage. Also the same cons as monte carlo apply to Collibra as well.
  - IBM Solutions 
   - Pros: Lots of offerings in DQ space, comes with a UI, has profiling
  and other features built in. It's a solution for complete DQ management.
  - Cons: Proprietary solution which can result in vendor lock in.
  Customizations and extensions may be difficult.
   - Informatica Data Quality tool
   
  - Pros: Comes with a UI, has profiling and other features built in.
  Its a solution for complete DQ management.
  - Cons: Proprietary solution which can result in vendor lock in.
  Customizations and extensions may be difficult.

*Open Source Solutions*

   - Great Expectations 
   - Pros: built for technical users who want to code DQ as per their
  requirement, easy to extend via code and lots of connectors and
  "expectations" or checks are available out of the box. Fits nicely in a
  python environment with or without Pyspark. Can be made to fit in most
  stacks.
  - Cons: No UI, no alerting or monitoring. However, see the
  recommendation section below for more info on how to get around this.

  - Note: They are coming up with Cloud offering as well in 2023
  - Amazon Deequ 
  - Pros: Actively maintained project that allows technical users to
  code checks using this project as a base library. Contains profiler,
  anomaly detection etc. Runs checks using Spark. Pydeequ is available for
  python users.
  - Cons: Like great expectations, it's a library not a whole end to
  end DQ platform.
   - Apache Griffin 
  - Pros: Aims to be a complete open source DQ platform with support
  for lots of streaming and batch datasets. Run checks using spark.
  - Cons: Not actively maintained these days due to lack of
  contributors.

*Recommendation*

   - Make some choices like below to narrow down the offerings,
   - Buy or build the solution?
  - Cloud dominant, mostly On Prem or hybrid?
  - For technical users, non-technical or hybrid end users?
  - Automated workflows or manual custom workflows?
   - For Buy + Cloud dominant + hybrid users + Automation kind of choices
   my recommendation would be to go with Monte Carlo or Anomalo. Otherwise one
   of the open source offerings.
   - For Great Expectations, there is a guide available to push DQ results
   to the open source Datahub  Catalog. This
   combination vastly extends the reach of great expectations as a tool, you
   get a UI and for the missing things you can connect with other solutions.
   This Great Expectations + Datahub combination delivers solid valud and is
   basically equivalent to a lot of proprietary offerings like Collibra.
   However this requires some engineering.

*Other Notable mentions*

   - https://www.bigeye.com/
   - https://www.soda.io/

Hope this long note clarifies things for you. :)

On Thu, 29 Dec 2022 at 10:03, infa elance  wrote:

> You can also look at informatica data quality that runs on spark. Of
> course it’s not free but you can sign up for a 30 day free trial. They have
> both profiling and prebuilt data quality rules and accelerators.
>
> Sent from my iPhone
>
> On Dec 28, 2022, at 10:02 PM, vaquar khan  wrote:
>
> 
> @ Gourav Sengupta why you are sending unnecessary emails ,if you think
> snowflake good plz use it ,here question was different and you are talking
> totally different topic.
>
> Plz respects group guidelines
>
>
> Regards,
> Vaquar khan
>
> On Wed, Dec 28, 2022, 

[Spark SQL]: DataFrame schema resulting in NullPointerException

2017-11-19 Thread Chitral Verma
Hey,

I'm working on this use case that involves converting DStreams to
Dataframes after some transformations. I've simplified my code into the
following snippet so as to reproduce the error. Also, I've mentioned below
my environment settings.

*Environment:*

Spark Version: 2.2.0
Java: 1.8
Execution mode: local/ IntelliJ


*Code:*

object Tests {

def main(args: Array[String]): Unit = {
val spark: SparkSession =  ...
  import spark.implicits._

val df = List(
("jim", "usa"),
("raj", "india"))
.toDF("name", "country")

df.rdd
  .map(x => x.toSeq)
  .map(x => new GenericRowWithSchema(x.toArray, df.schema))
  .foreach(println)
  }
}


This results in NullPointerException as I'm directly using df.schema in
map().

What I don't understand is that if I use the following code (basically
storing the schema as a value before transforming), it works just fine.

object Tests {

def main(args: Array[String]): Unit = {
val spark: SparkSession =  ...
  import spark.implicits._

val df = List(
("jim", "usa"),
("raj", "india"))
.toDF("name", "country")

val sc = df.schema

df.rdd
  .map(x => x.toSeq)
  .map(x => new GenericRowWithSchema(x.toArray, sc))
  .foreach(println)
  }
}


I wonder why this is happening as *df.rdd* is not an action and there is
visible change in state of dataframe just yet. What are your thoughts on
this?

Regards,
Chitral Verma