I'm having trouble getting dynamic resource allocation to properly
terminate idle executors when using FSx Lustre for shuffle persistence on
EMR 7.8 (Spark 3.5.4) on EKS. I'm trying this strategy out to battle cost
via very severe data skew (I don't really care if a couple nodes run for
hours while
Dear Spark Development Community,
Our team is using PySpark (versions 3.5.x, currently testing 3.5.5) and we
integrate Static Application Security Testing (SAST/SCA) using tools like
Checkmarx into our CI/CD pipelines for our Python projects.
We've observed that a significant number of Critical
D, psf.concat(psf.lit(PREFIX_ORG),
psf.sha2(df.descr, 256)))
return df
Hope this email finds someone running into a similar issue in the future.
Kind regards,
Jelle
From: Mich Talebzadeh
Sent: Wednesday, May 1, 2024 11:56 AM
To: Stephen Coy
Cc: Nijland,
.bindAddress", "localhost"
).set("spark.driver.host", "127.0.0.1"
# ).set("spark.driver.port", "0"
).set("spark.ui.port", "4041"
).set("spark.executor.instances", "1"
).set("spark.executor.cores", "50"
)
___
From: Mich Talebzadeh
Sent: Wednesday, April 24, 2024 4:40 PM
To: Nijland, J.G.W. (Jelle, Student M-CS)
Cc: user@spark.apache.org
Subject: Re: [spark-graphframes]: Generating incorrect edges
OK few observations
1) ID Generation Method: How are you generating unique IDs (UUIDs, seque
tags: pyspark,spark-graphframes
Hello,
I am running pyspark in a podman container and I have issues with incorrect
edges when I build my graph.
I start with loading a source dataframe from a parquet directory on my server.
The source dataframe has the following columns:
+-+---+-
Greetings,
tl;dr there must have been a regression in spark *connect*'s ability to
retrieve data, more details in linked issues
https://issues.apache.org/jira/browse/SPARK-45598
https://issues.apache.org/jira/browse/SPARK-45769
we have projects that depend on spark connect 3.5 and we'd apprec
Hey vaquar,
The link does't explain the crucial detail we're interested in - does executor
re-use the data that exists on a node from previous executor and if not, how
can we configure it to do so?
We are not running on kubernetes, so EKS/Kubernetes-specific advice isn't
very relevant.
We are ru
Is Gigabyte GeForce RTX 3080 GPU support for running machine learning in
Spark?
you suggested right?
> But int to long / bigint seems to be a reasonable evolution (correct me if
> I'm wrong). Is it possible to reopen the jira i mentioned earlier? Any
> reason for that getting closed?
>
>
> Regards,
> Naresh
>
>
> On Mon, Nov 7, 2022, 16:55 Evy
Hi Naresh,
Have you tried any of the following in order to resolve your issue:
1. Reading the Parquet files (directly, not via Hive [i.e,
spark.read.parquet()]), casting to LongType and creating the hive
table based on this dataframe? Hive's BigInt and Spark's Long should have
the sam
Hi:
In apache spark we can read json using the following:
spark.read.json("path").
There is support to convert json string in a dataframe into structured element
using
(https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#from_json-org.apache.spark.sql.Column-org.
Hi,
We have a structured streaming application, and we face a memory leak while
caching in the foreachBatch block.
We do unpersist every iteration, and we also verify via
"spark.sparkContext.getPersistentRDDs" that we don't have unnecessary
cached data.
We also noted in the profiler that many sp
I have spark batch application writing to ADLS Gen2 (hierarchy).
When designing the application I was sure the spark would perform global
commit once the job is committed, but what it really does it commits on
each task, meaning *once task completes writing it moves from temp to
target storage*. So
Connection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
Any help would be much appreciated
--
Live every day as if it were your last, because one of these days, it will
be.
Regards,
Prasanth M Sasidharan
Connection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
Any help would be much appreciated
--
Live every day as if it were your last, because one of these days, it will
be.
Regards,
Prasanth M Sasidharan
Hi,
What is the expected behavior if the streaming is stopped after the write
commit and before the read commit? Should I expect data duplication?
Thanks.
Hi,
I'm developing a new Spark connector using data source v2 API (spark 3.1.1).
I noticed that the planInputPartitions method (in MicroBatchStream) is
called twice every micro-batch.
What the motivation/reason is?
Thanks,
Kineret
Unsubscribe
hi
Thank you. The suggestion is very good. There is no need to use
"repartitionByRange",
However, there is a little doubt that if the output file is required to be
globally ordered, "repartition" will disrupt the order of the data, and the
result of using "coalesce&quo
ange(5,column("v")).sortWithinPartitions("v").
write.parquet(outputPath)
Best Regards,
m li
Ivan Petrov wrote
> Ah... makes sense, thank you. i tried sortWithinPartition before and
> replaced with sort. It was a mistake.
>
> чт, 25 февр. 2021 г. в 15:25, Pietro Gentil
Hi,
I have read in many blogs that Spark framework is a compiler itself.
It generates the DAG; optimizes it and executes it. The DAG is generated from
the user submitted code ( be it in Java, Scala, Python or R). So when we submit
a JAR file (it has the list of compiled classes), in the first s
Hi All,
Am trying to submit my application using spark-submit in yarn mode.
But its failing because of unknown queue default, we specified the queue
name in spark-default.conf as spark.yarn.queue SecondaryQueue
its failing for one application, but for another application dont know the
reason.
p
Hi,
I would like to support data locality in Spark data source v2. How can I
provide Spark the ability to read and process data on the same node?
I didn't find any interface that supports 'getPreferredLocations' (or
equivalent).
Thanks!
spark version 2.1.0
Regards,
Sbm
On Mon, 16 Dec, 2019, 10:04 HARSH TAKKAR, wrote:
> Please share the spark version you are using .
>
> On Fri, 13 Dec, 2019, 4:02 PM SB M, wrote:
>
>> Hi All,
>>Am trying to create a dynamic partition with external table on hive
>
Hi All,
Am trying to create a dynamic partition with external table on hive
metastore using spark sql.
when am trying to create a partition column data type as bigint, partition
is not working even i tried with repair table. data is not shown when i ran
sample query select * from table.
but i
f the places I have seen logging done by log4j properties,
>> but no where people I have seen any solution where logs are being
>> compressed.
>>
>> Is there anyway I can compress the logs, So that further those logs can
>> be shipped to S3.
>>
>> --
>> Raman Gugnani
>>
>
--
Girish bhat m
;33554432")` to tune the partition size when reading from HDFS.
>
> Thanks,
> Manu Zhang
>
> On Mon, Apr 15, 2019 at 11:28 PM M Bilal wrote:
>
>> Hi,
>>
>> I have implemented a custom partitioning algorithm to partition graphs in
>> GraphX. Saving the
Hi All,
I get 'No plan for EventTimeWatermark' error while doing a query with
columns pruning using structured streaming with a custom data source that
implements Spark datasource v2.
My data source implementation that handles the schemas includes the
following:
class MyDataSourceReader extends
Hi,
I have implemented a custom partitioning algorithm to partition graphs in
GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
files in the output folder with the number of files equal to the number of
Partitions.
However, reading back the edges creates number of partiti
i
>
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
&
Hi,
I want to observe the log messages from DAGScheduler in Apache Spark. Which
log files do I need to check.
I have tried observing the driver logs and worker stderr logs but I can't
find any messages that are from that class.
I am using Spark 3.0.0 snapshot in standalone mode.
Thanks.
Regard
I write spark data source v2 in spark 2.3 and I want to support
writeStream. What should I do in order to do so?
my defaultSource class:
class MyDefaultSource extends DataSourceV2 with ReadSupport with
WriteSupport with MicroBatchReadSupport { ..
Which interface is missing?
I try to read a stream using my custom data source (v2, using spark 2.3),
and it fails *in the second iteration* with the following exception while
reading prune columns:Query [id=xxx, runId=yyy] terminated with exception:
assertion failed: Invalid batch: a#660,b#661L,c#662,d#663,,... 26 more
field
I have the same problem as described in the following question in
StackOverflow (but nobody has answered to it).
https://stackoverflow.com/questions/51103634/spark-streaming-schema-mismatch-using-microbatchreader-with-columns-pruning
Any idea of how to solve it (using Spark 2.3)?
Thanks,
Kineret
HI im using below code to submit a spark 2.3 application on kubernetes
cluster in scala using play framework
I have also tried as a simple scala program without using play framework
im trying to spark submit which was mentioned below programatically
https://spark.apache.org/docs/latest/running-on
10, 2018, 7:49:42 AM PDT, Daniel Hinojosa
wrote:
This looks more like a spark issue than it does a Kafka judging by the
stack trace, are you using Spark structured streaming with Kafka
integration by chance?
On Mon, Apr 9, 2018 at 8:47 AM, M Singh
wrote:
> Hi Folks:
> Just wanted
ob to a k8s cluster by running spark-submit programmatically, or
> some example Scala application that is to run on the cluster?
>
> On Wed, Apr 4, 2018 at 4:45 AM, Kittu M wrote:
>
>> Hi,
>>
>> I’m looking for a Scala program to spark submit a Scala application
>>
Hi,
I’m looking for a Scala program to spark submit a Scala application (spark
2.3 job) on k8 cluster .
Any help would be much appreciated. Thanks
Hi:
I am using Apache Spark Structured Streaming (2.2.1) to implement custom
sessionization for events. The processing is in two steps:1.
flatMapGroupsWithState (based on user id) - which stores the state of user and
emits events every minute until a expire event is received
2. The next step i
Hi:
I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState
and a groupBy count operators.
In the StreamExecution logs I see two enteries for stateOperators
"stateOperators" : [ {
"numRowsTotal" : 1617339,
"numRowsUpdated" : 9647
}, {
"numRowsTotal" : 1326355,
Hi:
I am working on spark structured streaming (2.2.1) with kafka and want 100
executors to be alive. I set spark.executor.instances to be 100. The process
starts running with 100 executors but after some time only a few remain which
causes backlog of events from kafka.
I thought I saw a sett
Hi:
I am working on a realtime application using spark structured streaming (v
2.2.1). The application reads data from kafka and if there is a failure, I
would like to ignore the checkpoint. Is there any configuration to just read
from last kafka offset after a failure and ignore any offset che
Hi:
I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last
few days, after running the application for 30-60 minutes get exception from
Kafka Consumer included below.
The structured streaming application is processing 1 minute worth of data from
kafka topic. So I've tried
Hi Vijay:
I am using spark-shell because I am still prototyping the steps involved.
Regarding executors - I have 280 executors and UI only show a few straggler
tasks on each trigger. The UI does not show too much time spend on GC.
suspect the delay is because of getting data from kafka. The num
Hi:
I am working with spark structured streaming (2.2.1) reading data from Kafka
(0.11).
I need to aggregate data ingested every minute and I am using spark-shell at
the moment. The message rate ingestion rate is approx 500k/second. During
some trigger intervals (1 minute) especially when t
helpful to
answer some of them.
For example: inputRowsPerSecond = numRecords / inputTimeSec,
processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why
the 2 rowPerSec difference.
On Feb 10, 2018, at 8:42 PM, M Singh wrote:
Hi:
I am working with spark 2.2.0 and am looking at
Just checking if anyone has any pointers for dynamically updating query state
in structured streaming.
Thanks
On Thursday, February 8, 2018 2:58 PM, M Singh
wrote:
Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that needs to
update the state
Hi:
I am working with spark 2.2.0 and am looking at the query status console
output.
My application reads from kafka - performs flatMapGroupsWithState and then
aggregates the elements for two group counts. The output is send to console
sink. I see the following output (with my questions in
Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that needs to
update the state periodically.
Here is the scenario:
1. I have a udf with a variable with default value (eg: 1) This value is
applied to a column (eg: subtract the variable from the column value )2.
s://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski
On Mon, Feb 5, 2018 at 8:11 PM, M Singh wrote:
Just checking if anyone has more details on how watermark works in cases where
event time is earlier than processing time stamp.
On Friday, February 2, 2018
Just checking if anyone has more details on how watermark works in cases where
event time is earlier than processing time stamp.
On Friday, February 2, 2018 8:47 AM, M Singh wrote:
Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is
Hi TD:
Just wondering if you have any insight for me or need more info.
Thanks
On Thursday, February 1, 2018 7:43 AM, M Singh
wrote:
Hi TD:
Here is the udpated code with explain and full stack trace.
Please let me know what could be the issue and what to look for in the explain
output
don't want to process it, you could do a filter based on its
EventTime field, but I guess you will have to compare it with the processing
time since there is no API to access Watermark by the user.
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh wrote:
Hi:
I am trying to filter out re
tion.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
On Wednesday, January 31, 2018 3:46 PM, Tathagata Das
wrote:
Could you give the full stack trace of the exception?
Also, can you do `dataframe2.explain(true)` and show us the
Hi Folks:
I have to add a column to a structured streaming dataframe but when I do that
(using select or withColumn) I get an exception. I can add a column in
structured non-streaming structured dataframe. I could not find any
documentation on how to do this in the following doc
[https://spar
Hi:
I am trying to filter out records which are lagging behind (based on event
time) by a certain amount of time.
Is the watermark api applicable to this scenario (ie, filtering lagging
records) or it is only applicable with aggregation ? I could not get a clear
understanding from the documen
8:36 PM, "M Singh" wrote:
Hi:
I am trying to create a custom structured streaming source and would like to
know if there is any example or documentation on the steps involved.
I've looked at the some methods available in the SparkSession but these are
internal to the sql package
Hi:
I am trying to create a custom structured streaming source and would like to
know if there is any example or documentation on the steps involved.
I've looked at the some methods available in the SparkSession but these are
internal to the sql package:
private[sql] def internalCreateDataFrame
ring-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski
On Thu, Jan 4, 2018 at 10:49 PM, M Singh wrote:
Thanks Tathagata for your answer.
The reason I was asking
lated note, these APIs are subject to change. In fact in the upcoming release
2.3, we are adding a DataSource V2 API for
batch/microbatch-streaming/continuous-streaming sources and sinks.
On Wed, Jan 3, 2018 at 11:23 PM, M Singh wrote:
Hi:
The documentation for Sink.addBatch is as follows:
/
Hi:
The documentation for Sink.addBatch is as follows:
/** * Adds a batch of data to this sink. The data for a given `batchId` is
deterministic and if * this method is called more than once with the same
batchId (which will happen in the case of * failures), then `data` should
only be ad
Hi Jeroen:
I am not sure if I missed it - but can you let us know what is your input
source and output sink ?
In some cases, I found that saving to S3 was a problem. In this case I started
saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which
solved our issue.
Mans
Hi:
I am working with DataSets so that I can use mapGroupsWithState for business
logic and then use dropDuplicates over a set of fields. I would like to use
the withWatermark so that I can restrict the how much state is stored.
>From the API it looks like withWatermark takes a string - timesta
n external system (like kafka)
Eyal
On Tue, Dec 26, 2017 at 10:37 PM, M Singh wrote:
Thanks Diogo. My question is how to gracefully call the stop method while the
streaming application is running in a cluster.
On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira
wrote:
Hi M Singh
Thanks Diogo. My question is how to gracefully call the stop method while the
streaming application is running in a cluster.
On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira
wrote:
Hi M Singh! Here I'm using query.stop()
Em 25 de dez de 2017 19:19, "M Singh&
Hi:I would like to use window function on a DataSet stream (Spark 2.2.0)The
window function requires Column as argument and can be used with DataFrames by
passing the column. Is there any analogous window function or pointers to how
window function can be used for DataSets ?
Thanks
Hi:
I am using spark structured streaming (v 2.2.0) to read data from files. I have
configured checkpoint location. On stopping and restarting the application, it
looks like it is reading the previously ingested files. Is that expected
behavior ?
Is there anyway to prevent reading files that
Hi:Are there any patterns/recommendations for gracefully stopping a structured
streaming application ?Thanks
Is it possible to concisely create a dataset from a dataframe with missing
columns? Specifically, suppose I create a dataframe with:
val df: DataFrame = Seq(("v1"),("v2")).toDF("f1")
Then, I have a case class for a dataset defined as:
case class CC(f1: String, f2: Option[String] = None)
I’d lik
Is it possible to concisely create a dataset from a dataframe with missing
columns? Specifically, suppose I create a dataframe with:
val df: DataFrame = Seq(("v1"),("v2")).toDF("f1")
Then, I have a case class for a dataset defined as:
case class CC(f1: String, f2: Option[String] = None)
I’d lik
I am trying to understand if I should be concerned about this warning:
"WARN Utils:66 - Truncated the string representation of a plan since it
was too large. This behavior can be adjusted by setting
'spark.debug.maxToStringFields' in SparkEnv.conf"
It occurs while writing a data frame to parquet
explore Apache Storm and
Apache Flink.
I suggest it is better to do a POC in each of them and then decide on
what works best for you.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
-Original Message-
From: Gaurav1809 [mailto:gauravhpan...@gmail.com
jars are of
version 1.2.1
I tried building spark from source and as spark uses hive 1.2.1
by default, I get the same set of jars.
How can we make Spark 2.1.0 work with Hive 2.1.1?
Thanks in advance!
Best regards / Mit freundlichen Grüßen / Sincères salutations
M
jars are of
version 1.2.1
I tried building spark from source and as spark uses hive 1.2.1
by default, I get the same set of jars.
How can we make Spark 2.1.0 work with Hive 2.1.1?
Thanks in advance!
Best regards / Mit freundlichen Grüßen / Sincères salutations
M
Hi all,
I am trying to trap UI kill event of a spark application from driver.
Some how the exception thrown is not propagated to the driver main
program. See for example using spark-shell below.
Is there a way to get hold of this event and shutdown the driver program?
Regards,
Noorul
spark@spa
Sending plain text mail to test whether my mail appear in the list.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/This-is-a-test-mail-please-ignore-tp28538.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
A better forum would be
https://groups.google.com/forum/#!forum/spark-jobserver
or
https://gitter.im/spark-jobserver/spark-jobserver
Regards,
Noorul
Madabhattula Rajesh Kumar writes:
> Hi,
>
> I am getting below an exception when I start the job-server
>
> ./server_start.sh: line 41: kill:
> When Initial jobs have not accepted any resources then what all can be
> wrong? Going through stackoverflow and various blogs does not help. Maybe
> need better logging for this? Adding dev
>
Did you take a look at the spark UI to see your resource availability?
Thanks and Regards
Noorul
Hi all,
I have a streaming application with batch interval 10 seconds.
val sparkConf = new SparkConf().setAppName("RMQWordCount")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
val ssc = new StreamingContext(sparkConf, Seconds(10))
I also use reduceByKeyAndWindow() API f
Hi,
I am new to Spark. I would like to learn Spark.
I think I should learn version 2.0.2.
Or should I still go for version 1.6.x and then come to version 2.0.2?
Please advise.
Thanks in advance.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M
Reza zade writes:
> Hi
>
> I have set up a cloudera cluster and work with spark. I want to install
> spark-jobserver on it. What should I do?
Maybe you should send this to spark-jobserver mailing list.
https://github.com/spark-jobserver/spark-jobserver#contact
Thanks and Regards
Noorul
--
Hi,
No, currently you can't change the setting.
// maropu
2016/08/27 11:40、Vadim Semenov のメッセージ:
> Hi spark users,
>
> I wonder if it's possible to change executors settings on-the-fly.
> I have the following use-case: I have a lot of non-splittable skewed files in
> a custom format that
kalkimann writes:
> Hi,
> spark 1.6.2 is the latest brew package i can find.
> spark 2.0.x brew package is missing, best i know.
>
> Is there a schedule when spark-2.0 will be available for "brew install"?
>
Did you do a 'brew update' before searching. I installed spark-2.0 this
week.
Regards
I was using the 1.1 driver. I upgraded that library to 2.1 and it resolved my
problem.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-and-spark-sql-hive-thriftServer-singleSession-setting-tp27340p27566.html
Sent from the Apache Spark User L
Im attempting to access a dataframe from jdbc:
However this temp table is not accessible from beeline when connected to
this instance of HiveServer2.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-HiveServer2-cannot-access-temp-tables-tp27515
How are you calling registerTempTable from hiveContext? It appears to be a
private method.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Table-registered-using-registerTempTable-not-found-in-HiveContext-tp26555p27514.html
Sent from the Apache Spark User Li
I am running HiveServer2 as well and when I connect with beeline I get the
following:
org.apache.spark.sql.internal.SessionState cannot be cast to
org.apache.spark.sql.hive.HiveSessionState
Do you know how to resolve this?
--
View this message in context:
http://apache-spark-user-list.10015
Hi all,
I was trying to test --supervise flag of spark-submit.
The documentation [1] says that, the flag helps in restarting your
application automatically if it exited with non-zero exit code.
I am looking for some clarification on that documentation. In this
context, does application means th
Spark version: 1.6.1
Cluster Manager: Standalone
I am experimenting with cluster mode deployment along with supervise for
high availability of streaming applications.
1. Submit a streaming job in cluster mode with supervise
2. Say that driver is scheduled on worker1. The app started
successfu
Hi,
The aws CLI already has your access key aid and secret access
key when you initially configured it.
Is your s3 bucket without any access restrictions?
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Ashic Mahtab
ache: 25600K
NUMA node0 CPU(s): 0
Thanks
From: Mich Talebzadeh
Sent: Thursday, June 2, 2016 5:00 PM
To: Andres M Jimenez T
Cc: user@spark.apache.org
Subject: Re: how to increase threads per executor
What are passing as parameters to Spark-
Hi,
I am working with Spark 1.6.1, using kafka direct connect for streaming data.
Using spark scheduler and 3 slaves.
Kafka topic is partitioned with a value of 10.
The problem i have is, there is only one thread per executor running my
function (logic implementation).
Can anybody tell me
Hi
Can you look at Apache Drill as sql engine on hive?
Lohith
Sent from my Sony Xperia™ smartphone
Tapan Upadhyay wrote
Thank you everyone for guidance.
Jorn our motivation is to move bulk of adhoc queries to hadoop so that we have
enough bandwidth on our DB for imp batch/queries.
either).
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
-Original Message-
From: kramer2...@126.com [mailto:kramer2...@126.com]
Sent: Monday, April 11, 2016 16.18
To: user@spark.apache.org
Subject: Why Spark having OutOfMemory Exception?
I use spark to do
ve one more question .. if i want to launch a spark application in
>> production environment so is there any other way so multiple users can
>> submit there job without having hadoop configuration .
>>
>> Regards
>> Prateek
>>
>>
>> On Fri, Ma
If all sql results have same set of columns you could UNION all the dataframes
Create an empty df and Union all
Then reassign new df to original df before next union all
Not sure if it is a good idea, but it works
Lohith
Sent from my Sony Xperia™ smartphone
Divya Gehlot wrote
Hi,
Hi,
If you can also format the condition file as a csv file similar
to the main file, then you can join the two dataframes and select only required
columns.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Divya Gehlot [mailto:divya.htco
Hi Arun,
You can do df.agg(max(,,), min(..)).
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Arunkumar Pillai [mailto:arunkumar1...@gmail.com]
Sent: Thursday, February 04, 2016 14.53
To: user@spark.apache.org
Subject: Need to user univariate
Hi all,
I am trying to copy data from one cassandra cluster to another using
spark + cassandra connector. At the source I have around 200 GB of data
But while running the spark stage shows output as 406 GB and the data is
still getting copied. I wonder why is it showing this high a number.
Envir
1 - 100 of 187 matches
Mail list logo