Re: How to print DataFrame.show(100) to text file at HDFS

2019-04-13 Thread Nuthan Reddy
Hi Chetan,

You can use

spark-submit showDF.py | hadoop fs -put - showDF.txt

showDF.py:

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("Write stdout").getOrCreate()

spark.sparkContext.setLogLevel("OFF")


spark.table("").show(100,truncate=false)

But is there any specific reason you want to write it to hdfs? Is this for
human consumption?

Regards,
Nuthan

On Sat, Apr 13, 2019 at 6:41 PM Chetan Khatri 
wrote:

> Hello Users,
>
> In spark when I have a DataFrame and do  .show(100) the output which gets
> printed, I wants to save as it is content to txt file in HDFS.
>
> How can I do this?
>
> Thanks
>


-- 
Nuthan Reddy
Sigmoid Analytics

-- 
*Disclaimer*: This is not a mass e-mail and my intention here is purely 
from a business perspective, and not to spam or encroach your privacy. I am 
writing with a specific agenda to build a personal business connection. 
Being a reputed and genuine organization, Sigmoid respects the digital 
security of every prospect and tries to comply with GDPR and other regional 
laws. Please let us know if you feel otherwise and we will rectify the 
misunderstanding and adhere to comply in the future. In case we have missed 
any of the compliance, it is completely unintentional.


Re: writing into oracle database is very slow

2019-04-13 Thread Yeikel
Are you sure you only need 10 partitions?  Do you get the same performance
writing to HDFS with 10 partitions? 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best Practice for Writing data into a Hive table

2019-04-13 Thread Yeikel
Writing to CSV is very slow. 

>From what I've seen this is the preferred way to write to hive ; 

myDf.createOrReplaceTempView("mytempTable") 
sqlContext.sql("create table mytable as select * from mytempTable");


Source :
https://stackoverflow.com/questions/30664008/how-to-save-dataframe-directly-to-hive



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Question about relationship between number of files and initial tasks(partitions)

2019-04-13 Thread email
Before we can confirm that the issue is skewed data,  let’s confirm it : 

 

import org.apache.spark.sql.functions.spark_partition_id

 

df.groupBy(spark_partition_id).count

 

This should give the number of records you have in each partition. 

 

 

From: Sagar Grover  
Sent: Thursday, April 11, 2019 8:23 AM
To: yeikel valdes 
Cc: jasonnerot...@gmail.com; arthur...@flipp.com; user @spark/'user 
@spark'/spark users/user@spark 
Subject: Re: Question about relationship between number of files and initial 
tasks(partitions)

 

Extending Arthur's question,

I am facing the same problem(no of partitions were huge- cored 960, partitions 
- 16000). I tried to decrease the number of partitions with coalesce, but the 
problem is unbalanced data. After using coalesce, it gives me Java out of heap 
space error. There was no out of heap error without coalesce. I am guessing the 
error is due to uneven data and some heavy partitions getting merged together.

Let me know if you have any pointers on how to handle this.

 

On Wed, Apr 10, 2019 at 11:21 PM yeikel valdes mailto:em...@yeikel.com> > wrote:

If you need to reduce the number of partitions you could also try 

df.coalesce


 On Thu, 04 Apr 2019 06:52:26 -0700   
jasonnerot...@gmail.com wrote 

Have you tried something like this?

 

spark.conf.set("spark.sql.shuffle.partitions", "5" ) 

 

 

 

On Wed, Apr 3, 2019 at 8:37 PM Arthur Li <  
arthur...@flipp.com> wrote:

Hi Sparkers,

 

I noticed that in my spark application, the number of tasks in the first stage 
is equal to the number of files read by the application(at least for Avro) if 
the number of cpu cores is less than the number of files. Though If cpu cores 
are more than number of files, it's usually equal to default parallelism 
number. Why is it behave like this? Would this require a lot of resource from 
the driver? Is there any way we can do to decrease the number of 
tasks(partitions) in the first stage without merge files before loading? 

 

Thanks,

Arthur 

 


IMPORTANT NOTICE:  This message, including any attachments (hereinafter 
collectively referred to as "Communication"), is intended only for the 
addressee(s) named above.  This Communication may include information that is 
privileged, confidential and exempt from disclosure under applicable law.  If 
the recipient of this Communication is not the intended recipient, or the 
employee or agent responsible for delivering this Communication to the intended 
recipient, you are notified that any dissemination, distribution or copying of 
this Communication is strictly prohibited.  If you have received this 
Communication in error, please notify the sender immediately by phone or email 
and permanently delete this Communication from your computer without making a 
copy. Thank you.

 

 

-- 

Thanks,

Jason

 



Best Practice for Writing data into a Hive table

2019-04-13 Thread Debabrata Ghosh
Hi,
 Please can you let me know which of the following options
would be a best practice for writing data into a Hive table :

Option 1:
outputDataFrame.write
.mode(SaveMode.Overwrite)
.format("csv")
.save("hdfs_path")

Option 2: Get the data from a dataframe and insert the data into the Hive
table leveraging spark sql.

Thanks for your help in advance !

Cheers,
Debu


ApacheCon NA 2019 Call For Proposal and help promoting Spark project

2019-04-13 Thread Felix Cheung
Hi Spark community!

As you know ApacheCon NA 2019 is coming this Sept and it’s CFP is now open! 
This is an important milestone as we celebrate 20 years of ASF. We have tracks 
like Big Data and Machine Learning among many others. Please submit your 
talks/thoughts/challenges/learnings here:
https://www.apachecon.com/acna19/cfp.html

Second, as a community I think it’d be great if we have a post on 
http://spark.apache.org/ website to promote this event also. We already have a 
logo link up and perhaps we could add a post to talk about:
What is the Spark project, what might you learn, then a few suggestions of talk 
topics, why speak at the ApacheCon etc. This will then be linked to the 
ApacheCon official website. Any volunteer from the community?

Third, Twitter. I’m not sure who has access to the ApacheSpark Twitter account 
but it’d be great to promote this. Use the hashtags #ApacheCon and #ACNA19. 
Mention @Apachecon. Please use
https://www.apachecon.com/acna19/cfp.html to promote the CFP, and
https://www.apachecon.com/acna19 to promote the event as a whole.



Offline state manipulation tool for structured streaming query

2019-04-13 Thread Jungtaek Lim
Hi Spark users, especially Structured Streaming users who are dealing with
stateful queries,

I'm pleased to introduce Spark State Tools, which enables offline state
manipulations for structured streaming query.

Basically the tool provides state as batch source and output so that you
can read state and transform, and even write back to state. With the full
features of batch query Spark SQL provides, you can achieve what you've
just imagined with your state, including rescaling state (repartition) and
schema evolution.

Summarized features are below:

- Show state information which you'll need to provide to enjoy features
  - state operator information, state schema
- Create savepoint from existing checkpoint of Structured Streaming query
- Read state as batch source of Spark SQL
- Write DataFrame to state as batch sink of Spark SQL
- Migrate state format from old to new
  - migrating Streaming Aggregation from ver 1 to 2
  - migrating FlatMapGroupsWithState from ver 1 to 2

And here's Github repository of this tool.
https://github.com/HeartSaVioR/spark-state-tools

Artifacts are also published to Maven central so you can just pull the
artifact into your app.

I'd be happy to hear new ideas of improvements, and much appreciated for
contributions!

Enjoy!

Thanks,
Jungtaek Lim (HeartSaVioR)


How to print DataFrame.show(100) to text file at HDFS

2019-04-13 Thread Chetan Khatri
Hello Users,

In spark when I have a DataFrame and do  .show(100) the output which gets
printed, I wants to save as it is content to txt file in HDFS.

How can I do this?

Thanks