Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Aayush Ranaut
What exactly is your requirement? 
Is the read before write mandatory?
Are you maintaining states in Cassandra?


Regards
Prathmesh Ranaut
https://linkedin.com/in/prathmeshranaut


> On Aug 29, 2019, at 3:35 PM, Shyam P  wrote:
> 
> 
> thanks Aayush.     For every record I need to get the data from cassandra 
> table and update it ? Else it may not update the existing record.
> 
>     What is this datastax-spark-connector ? is that not a "Cassandra 
> connector library written for spark"?
> If not , how to write ourselves.   
> Where and how to start ? Can you please guide me.
> 
> 
> 
> Thank you.
> Shyam
> 
> 
> 
> 
> On Thu, Aug 29, 2019 at 5:03 PM Aayush Ranaut > wrote:
> 
>> Cassandra is upsert, you should be able to do what you need with a single 
>> statement unless you’re looking to maintain counters. 
>> 
>> I’m not sure if there is a Cassandra connector library written for spark 
>> streaming because we wrote one ourselves when we wanted to do the same.
>> 
>> Regards
>> Prathmesh Ranaut
>> https://linkedin.com/in/prathmeshranaut
>> 
>> 
>> On Aug 29, 2019, at 7:21 AM, Shyam P >> wrote:
>> 
>> 
>>> Hi,
>>> I need to do a PoC for a business use-case.
>>> 
>>> Use case : Need to update a record in Cassandra table if exists.
>>> 
>>> Will spark streaming support compare each record and update existing 
>>> Cassandra record ?
>>> 
>>> For each record received from kakfa topic , If I want to check and compare 
>>> each record whether its already there in Cassandra or not , if yes , update 
>>> the record else insert a new record.
>>> 
>>> How can be this done using spark-structured streaming and cassandra? any 
>>> snippet or sample if you have.
>>> 
>>> Thank you,
>>> 
>>> Shyam
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 


Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Aayush Ranaut
Cassandra is upsert, you should be able to do what you need with a single 
statement unless you’re looking to maintain counters. 

I’m not sure if there is a Cassandra connector library written for spark 
streaming because we wrote one ourselves when we wanted to do the same.

Regards
Prathmesh Ranaut
https://linkedin.com/in/prathmeshranaut

> On Aug 29, 2019, at 7:21 AM, Shyam P  wrote:
> 
> Hi,
> I need to do a PoC for a business use-case.
> 
> Use case : Need to update a record in Cassandra table if exists.
> 
> Will spark streaming support compare each record and update existing 
> Cassandra record ?
> 
> For each record received from kakfa topic , If I want to check and compare 
> each record whether its already there in Cassandra or not , if yes , update 
> the record else insert a new record.
> 
> How can be this done using spark-structured streaming and cassandra? any 
> snippet or sample if you have.
> 
> Thank you,
> 
> Shyam


Re: Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Aayush Ranaut
This is the job of ContextCleaner. There are few a property that you can tweak 
to see if that helps: 
spark.cleaner.periodicGC.interval

spark.cleaner.referenceTracking

spark.cleaner.referenceTracking.blocking.shuffle



Regards

Prathmesh Ranaut

> On Jul 21, 2019, at 11:36 AM, Prathmesh Ranaut Gmail 
>  wrote:
> 
> 
> This is the job of ContextCleaner. There are few a property that you can 
> tweak to see if that helps: 
> spark.cleaner.periodicGC.interval
> 
> spark.cleaner.referenceTracking
> 
> spark.cleaner.referenceTracking.blocking.shuffle
> 
> 
> 
> Regards
> 
> Prathmesh Ranaut
>> On Jul 21, 2019, at 11:31 AM, Alex Landa  wrote:
>> 
>> 
>> Hi,
>> 
>> We are running a long running Spark application ( which executes lots of
>> quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
>> We see that old shuffle files ( a week old for example ) are not deleted
>> during the execution of the application, which leads to out of disk space
>> errors on the executor. 
>> If we re-deploy the application, the Spark cluster take care of the cleaning
>> and deletes the old shuffle data (since we have
>> /-Dspark.worker.cleanup.enabled=true/ in the worker config).
>> I don't want to re-deploy our app every week or two, but to be able to
>> configure spark to clean old shuffle data (as it should). 
>> 
>> How can I configure Spark to delete old shuffle data during the life time of
>> the application (not after)? 
>> 
>> 
>> Thanks,
>> Alex


Re: Spark Write method not ignoring double quotes in the csv file

2019-07-11 Thread Aayush Ranaut
Question 2:

You might be creating a dataframe while reading a parquet file.

df = spark.read.load(“file.parquet”)

df.select(rtrim(“columnName”));

Regards
Prathmesh Ranaut
https://linkedin.com/in/prathmeshranaut

> On Jul 12, 2019, at 9:15 AM, anbutech  wrote:
> 
> Hello All, Could you please help me to fix the below questions
> 
> Question 1:
> 
> I have tried the below options while writing the final data in a csv file to
> ignore double quotes in the same csv file .nothing is worked. I'm using
> spark version 2.2 and scala version 2.11 .
> 
> option("quote", "\"")
> 
> .option("escape", ":")
> 
> .option("escape", "")
> 
> .option("quote", "\u")
> 
> Code:
> 
> finaldataset
> 
> .repartitions(numberofpartitions)
> 
> .mode(Savemode.overwrite)
> 
> .option("delimiter","|")
> 
> .option("header","true")
> 
> .csv("path")
> 
> output_data.csv
> 
> field|field2|""|field4|field5|""|field6|""|field7
> 
> I want to remove double quotes in the csv file while writing spark method.is
> there any options available?
> 
> Question 2: Is there any way to remove the trailing white spaces in the
> fields while reading the parquet file.
> 
> Thanks Anbu
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org