Re: Setting log level to DEBUG while keeping httpclient.wire on WARN

2018-06-30 Thread yujhe.li
Daniel Haviv wrote > Hi, > I'm trying to debug an issue with Spark so I've set log level to DEBUG but > at the same time I'd like to avoid the httpclient.wire's verbose output by > setting it to WARN. > > I tried the following log4.properties config but I'm still getting DEBUG > outputs for

Re: Repartition not working on a csv file

2018-06-30 Thread yujhe.li
Abdeali Kothari wrote > I am using Spark 2.3.0 and trying to read a CSV file which has 500 > records. > When I try to read it, spark says that it has two stages: 10, 11 and then > they join into stage 12. What's your CSV size per file? I think Spark optimizer may put many files into one task when

Re: Repartition not working on a csv file

2018-06-30 Thread yujhe.li
Abdeali Kothari wrote > My entire CSV is less than 20KB. > By somewhere in between, I do a broadcast join with 3500 records in > another > file. > After the broadcast join I have a lot of processing to do. Overall, the > time to process a single record goes up-to 5mins on 1 executor > > I'm

Re: Pass config file through spark-submit

2018-08-16 Thread yujhe.li
So can you read the file on executor side? I think the file passed by --files my.app.conf would be added under classpath, and you can use it directly. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: read snappy compressed files in spark

2018-09-01 Thread yujhe.li
What's your Spark version? Do you have added hadoop native library to your path? like "spark.executor.extraJavaOptions -Djava.library.path=/hadoop-native/" in spark-defaults.conf. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Reading RDD by (key, data) from s3

2019-04-16 Thread yujhe.li
You can't, sparkcontext is a singleton object. You have to use hadoop library or aws client to read files on s3. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Is Spark rdd.toDF() thread-safe?

2021-03-17 Thread yujhe.li
Hi, I have an application that runs in a Spark-2.4.4 cluster and it transforms two RDD to DataFrame with `rdd.toDF()` then outputs them to file. For slave resource usage optimization, the application executes the job in multi-thread. The code snippet looks like this: And I found that