how to get assertDataFrameEquals ignore nullable

2017-05-05 Thread A Shaikh
As part of TDD I am using com.holdenkarau.spark.testing.DatasetSuiteBase to
assert if 2 Dataframes values are equal using


assertDataFrameEquals(dataframe1, dataframe2)

Although the values are same but it fails assertion because nullable
property does not match for some column. Is there are way to get
assertDataFrameEquals  ignore nullable property?

Also can we also extends that to ignore datatypes as well and just match
the values?


Thanks,
Afzal


how to get assertDataFrameEquals ignore nullable

2017-05-05 Thread A Shaikh
As part of TDD I am using com.holdenkarau.spark.testing.DatasetSuiteBase to
assert if 2 Dataframes values are equal using


assertDataFrameEquals(dataframe1, dataframe2)

Although the values are same but it fails assertion because nullable
property does not match for some column. Is there are way to get
assertDataFrameEquals  ignore nullable property?

Also can we also extends that to ignore datatypes as well and just match
the values?


Thanks,
Afzal


DML in Spark ETL

2017-01-26 Thread A Shaikh
In past we used ETL tool wherein ETL task which update, insert and delete
rows in target database tables(Oracle/Netezza). Sparks Dataset (and RDD)
has .save* method which can insert rows.

How to delete or update records in database table in Spark?


Re: Saving from Dataset to Bigquery Table

2017-01-20 Thread A Shaikh
Thanks for responding Jorn. Currently I upload the jar to Google Cloud and
run my job not ideal for development. Do you know if we can run this from
within our local machine? given that all the required jars are downloaded
by SBT anyways.

On 20 January 2017 at 11:22, Jörn Franke <jornfra...@gmail.com> wrote:

> It is only on pairdd
>
> On 20 Jan 2017, at 11:54, A Shaikh <shaikh.af...@gmail.com> wrote:
>
> Has anyone experience saving Dataset to Bigquery Table?
>
> I am loading into BigQuery using the following example
> <https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example>sucessfully.
> This uses RDD.saveAsNewAPIHadoopDataset method to save data.
> I am using Dataset(or DataFrame) and looking for saveAsNewAPIHadoopDataset
> method but unable to find it.
>
>


Saving from Dataset to Bigquery Table

2017-01-20 Thread A Shaikh
Has anyone experience saving Dataset to Bigquery Table?

I am loading into BigQuery using the following example
sucessfully.
This uses RDD.saveAsNewAPIHadoopDataset method to save data.
I am using Dataset(or DataFrame) and looking for saveAsNewAPIHadoopDataset
method but unable to find it.


Re: TDD in Spark

2017-01-20 Thread A Shaikh
Thanks for all the suggestion. Very Helpful.

On 17 January 2017 at 22:04, Lars Albertsson <la...@mapflat.com> wrote:

> My advice, short version:
> * Start by testing one job per test.
> * Use Scalatest or a standard framework.
> * Generate input datasets with Spark routines, write to local file.
> * Run job with local master.
> * Read output with Spark routines, validate only the fields you care
> about for the test case at hand.
> * Focus on building a functional regression test suite with small test
> cases before testing with large input datasets. The former improves
> productivity more.
>
> Avoid:
> * Test frameworks coupled to your processing technology - they will
> make it difficult to switch.
> * Spending much effort to small unit tests. Internal interfaces in
> Spark tend to be volatile, and testing against them results in high
> maintenance costs.
> * Input files checked in to version control. They are difficult to
> maintain. Generate input files with code instead.
> * Expected output files checked in to VC. Same reason. Validate
> selected fields instead.
>
> For a longer answer, please search for my previous posts to the user
> list, or watch this presentation: https://vimeo.com/192429554
>
> Slides at http://www.slideshare.net/lallea/test-strategies-for-
> data-processing-pipelines-67244458
>
>
> Regards,
>
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> https://twitter.com/lalleal
> +46 70 7687109
> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>
>
> On Sun, Jan 15, 2017 at 7:14 PM, A Shaikh <shaikh.af...@gmail.com> wrote:
> > Whats the most popular Testing approach for Spark App. I am looking
> > something in the line of TDD.
>


TDD in Spark

2017-01-15 Thread A Shaikh
Whats the most popular Testing approach for Spark App. I am looking
something in the line of TDD.


Re: Running a spark code using submit job in google cloud platform

2017-01-12 Thread A Shaikh
You may have tested this code on Spark version on your local machine
version of which may be different to whats in Google Cloud Storage.
You need to select appropraite Spark version when you submit your job.

On 12 January 2017 at 15:51, Anahita Talebi 
wrote:

> Dear all,
>
> I am trying to run a .jar file as a job using submit job in google cloud
> console.
> https://cloud.google.com/dataproc/docs/guides/submit-job
>
> I actually ran the spark code on my local computer to generate a .jar
> file. Then in the Argument folder, I give the value of the arguments that I
> used in the spark code. One of the argument is training data set that I put
> in the same bucket that I save my .jar file. In the bucket, I put only the
> .jar file, training dataset and testing dataset.
>
> Main class or jar
> gs://Anahita/test.jar
>
> Arguments
>
> --lambda=.001
> --eta=1.0
> --trainFile=gs://Anahita/small_train.dat
> --testFile=gs://Anahita/small_test.dat
>
> The problem is that when I run the job I get the following error and
> actually it cannot read  my training and testing data sets.
>
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.spark.rdd.RDD.coalesce(IZLscala/math/Ordering;)Lorg/apache/spark/rdd/RDD;
>
> Can anyone help me how I can solve this problem?
>
> Thanks,
>
> Anahita
>
>
>


Re: Handling null in dataset

2017-01-11 Thread A Shaikh
I tried DataFrame option below, not sure what that is for but doesnt seems
to work.


   - nullValue: specifies a string that indicates a null value, nulls in
   the DataFrame will be written as this string.


On 11 January 2017 at 17:11, A Shaikh <shaikh.af...@gmail.com> wrote:

>
>
> How does Spark handle null values.
>
> case class AvroSource(name: String, age: Integer, sal: Long, col_float:
> Float, col_double: Double, col_bytes: String, col_bool: Boolean )
>
>
> val userDS = 
> spark.read.format("com.databricks.spark.avro").option("nullValue",
> "x").load("./users.avro")//.as[AvroSource]
> userDS.printSchema()
> userDS.show()
> userDS.createOrReplaceTempView("user")
> spark.sql("select * from user where xdouble is not null ").show()
>
>
>
> [image: Inline images 2]
>
>
> Adding Following lines to the code returns error which seems contradicting
> to the schema which says nullable = true. how to handle null here?
>
> val filteredDS = userDS.filter(_.age > 30)
> filteredDS.show(10)
>
> java.lang.RuntimeException: Null value appeared in non-nullable field:
> - field (class: "scala.Double", name: "col_double")
> - root class: "com.model.AvroSource"
> If the schema is inferred from a Scala tuple/case class, or a Java bean,
> please try to use scala.Option[_] or other nullable types (e.g.
> java.lang.Integer instead of int/scala.Int).
>
>


Handling null in dataset

2017-01-11 Thread A Shaikh
How does Spark handle null values.

case class AvroSource(name: String, age: Integer, sal: Long, col_float:
Float, col_double: Double, col_bytes: String, col_bool: Boolean )


val userDS =
spark.read.format("com.databricks.spark.avro").option("nullValue",
"x").load("./users.avro")//.as[AvroSource]
userDS.printSchema()
userDS.show()
userDS.createOrReplaceTempView("user")
spark.sql("select * from user where xdouble is not null ").show()



[image: Inline images 2]


Adding Following lines to the code returns error which seems contradicting
to the schema which says nullable = true. how to handle null here?

val filteredDS = userDS.filter(_.age > 30)
filteredDS.show(10)

java.lang.RuntimeException: Null value appeared in non-nullable field:
- field (class: "scala.Double", name: "col_double")
- root class: "com.model.AvroSource"
If the schema is inferred from a Scala tuple/case class, or a Java bean,
please try to use scala.Option[_] or other nullable types (e.g.
java.lang.Integer instead of int/scala.Int).


Dataset Type safety

2017-01-10 Thread A Shaikh
I have a simple people.csv and following SimpleApp


people.csv
--
name,age
abc,22
xyz,32


Working Code

Object SimpleApp {}
  case class Person(name: String, age: Long)
  def main(args: Array[String]): Unit = {
val spark = SparkFactory.getSparkSession("PIPE2Dataset")
import spark.implicits._

val peopleDS = spark.read.option("inferSchema","true").option("header",
"true").option("delimiter", ",").csv("/people.csv").as[Person]
}




Fails for data with no header

Removing header record "name,age" AND switching header option off
=>.option("header", "false") return error => *cannot resolve '`name`' given
input columns: [_c0, _c1]*
val peopleDS = spark.read.option("inferSchema","true").option("header",
"false").option("delimiter", ",").csv("/people.csv").as[Person]

Should'nt this just assing the header from Person class



invalid data

As I've specified *.as[Person]* which does schema inferance then
*"option("inferSchema","true")"
*is redundant and not needed!


And lastly does .as[Person] check that column value matches with data type
i.e. "age Long" would fail if it gets a non numeric value! because the
input file could be millions of row which could be very time consuming.


Re: Spark Read from Google store and save in AWS s3

2017-01-10 Thread A Shaikh
This should help
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example

On 8 January 2017 at 03:49, neil90  wrote:

> Here is how you would read from Google Cloud Storage(note you need to
> create
> a service account key) ->
>
> os.environ['PYSPARK_SUBMIT_ARGS'] = """--jars
> /home/neil/Downloads/gcs-connector-latest-hadoop2.jar pyspark-shell"""
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession, SQLContext
>
> conf = SparkConf()\
> .setMaster("local[8]")\
> .setAppName("GS")
>
> sc = SparkContext(conf=conf)
>
> sc._jsc.hadoopConfiguration().set("fs.gs.impl",
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
> sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl",
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
> sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "PUT UR GOOGLE
> PROJECT
> ID HERE")
>
> sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.email",
> "testa...@sparkgcs.iam.gserviceaccount.com")
> sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable",
> "true")
> sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.keyfile",
> "sparkgcs-96bd21691c29.p12")
>
> spark = SparkSession.builder\
> .config(conf=sc.getConf())\
> .getOrCreate()
>
> dfTermRaw = spark.read.format("csv")\
> .option("header", "true")\
> .option("delimiter" ,"\t")\
> .option("inferSchema", "true")\
> .load("gs://bucket_test/sample.tsv")
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-Read-from-Google-store-and-
> save-in-AWS-s3-tp28278p28286.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Custom delimiter file load

2016-12-31 Thread A Shaikh
In Pyspark 2 loading file wtih any delimiter into Dataframe is pretty
straightforward
spark.read.csv(file, schema=, sep='|')

Is there something similar in Spark 2 in Scala! spark.read.csv(path,
sep='|')?


Re: Spark SQL Syntax

2016-12-19 Thread A Shaikh
I use pyspark on Spark 2.
I used Oracle, Postgres syntax just to get back "unhappy response".
I do get it some of it resolved after some searching but that consumes a
lot of my time, having a platform to test my SQL Syntax and its results
would be very helpful.


On 19 December 2016 at 14:00, Ramesh Krishnan <ramesh.154...@gmail.com>
wrote:

> What is the version of spark you are using . If it is less than 2.0 ,
> consider using dataset API's to validate compile time checks on syntax.
>
> Thanks,
> Ramesh
>
> On Mon, Dec 19, 2016 at 6:36 PM, A Shaikh <shaikh.af...@gmail.com> wrote:
>
>> HI,
>>
>> I keep getting Spark SQL Syntax invalid especially for Dates/Timestamps
>> manipulation.
>> What's the best way to test SQL Syntax in Spark Dataframe is valid?
>> Any online site for test or run a demo SQL!
>>
>> Thanks,
>> Afzal
>>
>
>


Spark SQL Syntax

2016-12-19 Thread A Shaikh
HI,

I keep getting Spark SQL Syntax invalid especially for Dates/Timestamps
manipulation.
What's the best way to test SQL Syntax in Spark Dataframe is valid?
Any online site for test or run a demo SQL!

Thanks,
Afzal


Re: Multiple exceptions in Spark Streaming

2014-10-01 Thread Shaikh Riyaz
Hi Akhil,

Thanks for your reply.

We are using CDH 5.1.3 and spark configuration is taken care by Cloudera
configuration. Please let me know if you would like to review the
configuration.

Regards,
Riyaz

On Wed, Oct 1, 2014 at 10:10 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Looks like a configuration issue, can you paste your spark-env.sh on the
 worker?

 Thanks
 Best Regards

 On Wed, Oct 1, 2014 at 8:27 AM, Tathagata Das tathagata.das1...@gmail.com
  wrote:

 It would help to turn on debug level logging in log4j and see the logs.
 Just looking at the error logs is not giving me any sense. :(

 TD

 On Tue, Sep 30, 2014 at 4:30 PM, Shaikh Riyaz shaikh@gmail.com
 wrote:

 Hi TD,

 Thanks for your reply.

 Attachment in previous email was from Master.

 Below is the log message from one of the worker.

 ---
 2014-10-01 01:49:22,348 ERROR akka.remote.EndpointWriter:
 AssociationError [akka.tcp://sparkWorker@*worker4*:7078] -
 [akka.tcp://sparkExecutor@*worker4*:34010]: Error [Association failed
 with [akka.tcp://sparkExecutor@*worker4*:34010]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkExecutor@*worker4*:34010]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: *worker4*:34010
 ]
 2014-10-01 02:14:54,868 ERROR akka.remote.EndpointWriter:
 AssociationError [akka.tcp://sparkWorker@*worker4*:7078] -
 [akka.tcp://sparkExecutor@*worker4*:33184]: Error [Association failed
 with [akka.tcp://sparkExecutor@*worker4*:33184]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkExecutor@*worker4*:33184]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: *worker4*:33184
 ]
 2014-10-01 02:14:54,878 ERROR akka.remote.EndpointWriter:
 AssociationError [akka.tcp://sparkWorker@*worker4*:7078] -
 [akka.tcp://sparkExecutor@*worker4*:33184]: Error [Association failed
 with [akka.tcp://sparkExecutor@*worker4*:33184]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkExecutor@*worker4*:33184]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: *worker4*:33184
 ]
 2014-10-01 02:14:54,887 ERROR akka.remote.EndpointWriter:
 AssociationError [akka.tcp://sparkWorker@*worker4*:7078] -
 [akka.tcp://sparkExecutor@*worker4*:33184]: Error [Association failed
 with [akka.tcp://sparkExecutor@*worker4*:33184]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkExecutor@*worker4*:33184]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: *worker4*:33184
 ]

 -

 Your support will be highly appreciated.

 Regards,
 Riyaz

 On Wed, Oct 1, 2014 at 1:16 AM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Is this the logs of the worker where the failure occurs? I think issues
 similar to these have since been solved in later versions of Spark.

 TD

 On Tue, Sep 30, 2014 at 11:33 AM, Shaikh Riyaz shaikh@gmail.com
 wrote:

 Dear All,

 We are using Spark streaming version 1.0.0 in our Cloudea Hadoop
 cluster CDH 5.1.3.

 Spark streaming is reading messages from Kafka using
 https://github.com/dibbhatt/kafka-spark-consumer.

 We have allocated 4gb memory to executor and 5gb to each workers. We
 have total 6 workers spread across 6 machines.

 Please find the attach log file for detailed error messages.


 Thanks in advance.

 --
 Regards,

 Riyaz



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 Regards,

 Riyaz






-- 
Regards,

Riyaz


Re: Multiple exceptions in Spark Streaming

2014-09-30 Thread Shaikh Riyaz
Hi TD,

Thanks for your reply.

Attachment in previous email was from Master.

Below is the log message from one of the worker.
---
2014-10-01 01:49:22,348 ERROR akka.remote.EndpointWriter: AssociationError
[akka.tcp://sparkWorker@*worker4*:7078] - [akka.tcp://sparkExecutor@
*worker4*:34010]: Error [Association failed with
[akka.tcp://sparkExecutor@*worker4*:34010]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@*worker4*:34010]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: *worker4*:34010
]
2014-10-01 02:14:54,868 ERROR akka.remote.EndpointWriter: AssociationError
[akka.tcp://sparkWorker@*worker4*:7078] - [akka.tcp://sparkExecutor@
*worker4*:33184]: Error [Association failed with
[akka.tcp://sparkExecutor@*worker4*:33184]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@*worker4*:33184]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: *worker4*:33184
]
2014-10-01 02:14:54,878 ERROR akka.remote.EndpointWriter: AssociationError
[akka.tcp://sparkWorker@*worker4*:7078] - [akka.tcp://sparkExecutor@
*worker4*:33184]: Error [Association failed with
[akka.tcp://sparkExecutor@*worker4*:33184]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@*worker4*:33184]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: *worker4*:33184
]
2014-10-01 02:14:54,887 ERROR akka.remote.EndpointWriter: AssociationError
[akka.tcp://sparkWorker@*worker4*:7078] - [akka.tcp://sparkExecutor@
*worker4*:33184]: Error [Association failed with
[akka.tcp://sparkExecutor@*worker4*:33184]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@*worker4*:33184]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: *worker4*:33184
]
-

Your support will be highly appreciated.

Regards,
Riyaz

On Wed, Oct 1, 2014 at 1:16 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Is this the logs of the worker where the failure occurs? I think issues
 similar to these have since been solved in later versions of Spark.

 TD

 On Tue, Sep 30, 2014 at 11:33 AM, Shaikh Riyaz shaikh@gmail.com
 wrote:

 Dear All,

 We are using Spark streaming version 1.0.0 in our Cloudea Hadoop cluster
 CDH 5.1.3.

 Spark streaming is reading messages from Kafka using
 https://github.com/dibbhatt/kafka-spark-consumer.

 We have allocated 4gb memory to executor and 5gb to each workers. We have
 total 6 workers spread across 6 machines.

 Please find the attach log file for detailed error messages.


 Thanks in advance.

 --
 Regards,

 Riyaz



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Regards,

Riyaz


Data loading to Parquet using spark

2014-07-06 Thread Shaikh Riyaz
Hi,

We are planning to use spark to load data to Parquet and this data will be
query by Impala for present visualization through Tableau.

Can we achieve this flow? How to load data to Parquet from spark? Will
impala be able to access the data loaded by spark?

I will greatly appreciate if someone can help with the example to achieve
the goal.

Thanks in advance.

-- 
Regards,

Riyaz