Re: [Spark 2.x Core] Adding to ArrayList inside rdd.foreach()

2018-04-07 Thread Bryan Jeffrey
You can just call rdd.flatMap(_._2).collect Get Outlook for Android From: klrmowse Sent: Saturday, April 7, 2018 1:29:34 PM To: user@spark.apache.org Subject: Re: [Spark 2.x Core] Adding to ArrayList inside

High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
Hi, I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The input data is HBase files in AWS S3 using EMRFS, but there is no HBase running on the Spark cluster itself. It is restoring the HBase snapshot into files on disk in another S3 folder used for temporary storage, then

Re: [Spark 2.x Core] Adding to ArrayList inside rdd.foreach()

2018-04-07 Thread klrmowse
okie, well... i'm working with a pair rdd i need to extract the values and store them somehow (maybe a simple Array??), which i later parallelize and reuse since adding to a list is a no-no, what, if any, are the other options? (Java Spark, btw) thanks -- Sent

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Jörn Franke
As far as I know the TableSnapshotInputFormat relies on a temporary folder https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormat.html Unfortunately some inputformats need a (local) tmp Directory. Sometimes this cannot be avoided. See also the source:

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
Thanks. I checked and it is using another s3 folder for the temporary restore space. The underlying code insists on the snapshot and the restore directory being on the same filesystem, so it is using Emrfs for both. So unless Emrfs is under the covers using some local disk space it doesn't seem

Re: spark2.3 on kubernets

2018-04-07 Thread lk_spark
resolved. need to add "kubernetes.default.svc" to k8s api server TLS config. 2018-04-08 lk_spark 发件人:"lk_spark" 发送时间:2018-04-08 11:15 主题:spark2.3 on kubernets 收件人:"user" 抄送: hi,all: I am trying spark on k8s with Pi sample. I got error with

[Spark 2.x Core] Adding to ArrayList inside rdd.foreach()

2018-04-07 Thread klrmowse
it gives null pointer exception... is there a workaround for adding to an arrayList during .foreach of an rdd? thank you -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: [Spark 2.x Core] Adding to ArrayList inside rdd.foreach()

2018-04-07 Thread Jörn Franke
What are you trying to achieve ? You should not use global variables in a spark application. Especially not adding to a list - that makes in most cases no sense. If you want to put everything into a file then you should repartition to 1 . > On 7. Apr 2018, at 19:07, klrmowse

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
I have been trying to monitor this while the job is running, I think I forgot to account for the 3-way hdfs replication, so right there the output is more like 21 TB instead of my claimed 7 TB. But it still looks like hdfs is losing more disk space than can be account for by just the output, going

Re: How to delete empty columns in df when writing to parquet?

2018-04-07 Thread Junfeng Chen
Hi, Thanks for explaining! Regard, Junfeng Chen On Wed, Apr 4, 2018 at 7:43 PM, Gourav Sengupta wrote: > Hi, > > I do not think that in a columnar database it makes much of a difference. > The amount of data that you will be parsing will not be much anyways. > >

spark2.3 on kubernets

2018-04-07 Thread lk_spark
hi,all: I am trying spark on k8s with Pi sample. I got error with driver : 2018-04-08 03:08:40 INFO SparkContext:54 - Successfully stopped SparkContext Exception in thread "main" org.apache.spark.SparkException: External scheduler cannot be instantiated at

Re: How to delete empty columns in df when writing to parquet?

2018-04-07 Thread Gourav Sengupta
Hi Junfeng, you are welcome. If users are extremely adamant on seeing only a few columns try to see if you can create a view on only the selected columns and give it to them, in case you are using hive metastore. Regards, Gourav On Sun, Apr 8, 2018 at 3:28 AM, Junfeng Chen

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Gourav Sengupta
Hi Saad, May I ask which EMR version and cluster size you are using? Usually if you are using C4.4xlarge systems then they have high disk space as I understand. The other thing that you can do is attach more disk space to the nodes, the option of which is available in the advanced cluster start

Re: [Spark 2.x Core] Adding to ArrayList inside rdd.foreach()

2018-04-07 Thread Gourav Sengupta
Hi, in case the key value store is large then can you give redis a try? SPARK does work quite well with redis. Regards, Gourav Sengupta On Sat, Apr 7, 2018 at 6:29 PM, klrmowse wrote: > okie, well... > > i'm working with a pair rdd > > i need to