Re: Tune hive query launched thru spark-yarn job.

2019-09-05 Thread Sathi Chowdhury
What I can immediately think of is,  as you are doing IN in the where clause for a series of timestamps, if you can consider breaking them and for each epoch timestampYou can load your results to an intermediate staging table and then do a final aggregate from that table keeping the group by

Tune hive query launched thru spark-yarn job.

2019-09-05 Thread Himali Patel
Hello all, We have one use-case where we are aggregating billion of rows. It does huge shuffle. Example : As per ‘Job’ tab on yarn UI When Input size is 350 G something, shuffle size >3 TBs. This increases Non-DFS usage beyond warning limit and thus affecting entire cluster. It seems we need

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Gabor Somogyi
Hi, Let me share Spark 3.0 documentation part (Structured Streaming and not DStreams what you've mentioned but still relevant): kafka.group.id string none streaming and batch The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query

Collecting large dataset

2019-09-05 Thread Rishikesh Gawade
Hi. I have been trying to collect a large dataset(about 2 gb in size, 30 columns, more than a million rows) onto the driver side. I am aware that collecting such a huge dataset isn't suggested, however, the application within which the spark driver is running requires that data. While collecting

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Sethupathi T
Gabor, Thanks for the quick response and sharing about spark 3.0, we need to use spark streaming (KafkaUtils.createDirectStream) than structured streaming by following this document https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and re-iterating the issue again for

Re: Start point to read source codes

2019-09-05 Thread Hichame El Khalfi
Hey David, You can the source code on GitHub: https://github.com/apache/spark Hope this helps, Hichame From: zhou10...@gmail.com Sent: September 5, 2019 4:11 PM To: user@spark.apache.org Subject: Start point to read source codes Hi, I want to read the source codes. Is there any doc, wiki or

Re: Tune hive query launched thru spark-yarn job.

2019-09-05 Thread Himali Patel
Hi Sathi, Thanks for a quick reply, so this ( list of some epoch times in IN clause) was part of 30 days aggregation already. As per our input to output aggregation ratio, our cardinality is too high. So we require query tuning kind of thing. As we can’t assign additional resource for this

Start point to read source codes

2019-09-05 Thread da zhou
Hi, I want to read the source codes. Is there any doc, wiki or book which introduces the source codes. Thanks in advance. David

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Sethupathi T
Gabor, Thanks for the quick response and sharing about spark 3.0, we need to use spark streaming (KafkaUtils.createDirectStream) than structured streaming by following this document https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and re-iterating the issue again for

Re: Collecting large dataset

2019-09-05 Thread Marcin Tustin
Stop using collect for this purpose. Either continue your further processing in spark (maybe you need to use streaming), or sink the data to something that can accept the data (gcs/s3/azure storage/redshift/elasticsearch/whatever), and have further processing read from that sink. On Thu, Sep 5,

Re: read image or binary files / spark 2.3

2019-09-05 Thread Peter Liu
Hello experts, I have quick question: which API allows me to read images files or binary files (for SparkSession.readStream) from a local/hadoop file system in Spark 2.3? I have been browsing the following documentations and googling for it and didn't find a good example/documentation:

Re: Start point to read source codes

2019-09-05 Thread David Zhou
Hi Hichame, Thanks a lot. I forked it. There are lots of codes. Need documents to guide me which part I should start from. On Thu, Sep 5, 2019 at 1:30 PM Hichame El Khalfi wrote: > Hey David, > > You can the source code on GitHub: > https://github.com/apache/spark > > Hope this helps, > >

org.apache.spark.sql.AnalysisException: Detected implicit cartesian product

2019-09-05 Thread kyunam
"left join" complains and tells me I need to turn on "spark.sql.crossJoin.enabled=true". But when I persist one dataframe, it runs fine. Why do you have to "persist"? org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans SELECT * FROM

how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-05 Thread Shyam P
Hi, I am using spark-sql-2.4.1v to streaming in my PoC. how to refresh the loaded dataframe from hdfs/cassandra table every time new batch of stream processed ? What is the practice followed in general to handle this kind of scenario? Below is the SOF link for more details .

How to query StructField's metadata in spark sql?

2019-09-05 Thread kyunam
Using SQL, is it possible to query a column's metadata? Thanks, Kyunam -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Test mail

2019-09-05 Thread Himali Patel

[Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Sethupathi T
Hi Team, We have secured Kafka cluster (which only allows to consume from the pre-configured, authorized consumer group), there is a scenario where we want to use spark streaming to consume from secured kafka. so we have decided to use spark-streaming-kafka-0-10