Technical Guidance: Dynamic Resource Allocation + External Shuffle Storage

2025-06-26 Thread Andrew M.
I'm having trouble getting dynamic resource allocation to properly terminate idle executors when using FSx Lustre for shuffle persistence on EMR 7.8 (Spark 3.5.4) on EKS. I'm trying this strategy out to battle cost via very severe data skew (I don't really care if a couple nodes run for hours while

Aligning pom.xml in Bundled PySpark JARs with Effective Runtime Dependencies for SCA Tools

2025-05-22 Thread Guzarevich, M. (Mikalai)
Dear Spark Development Community, Our team is using PySpark (versions 3.5.x, currently testing 3.5.5) and we integrate Static Application Security Testing (SAST/SCA) using tools like Checkmarx into our CI/CD pipelines for our Python projects. We've observed that a significant number of Critical

Re: [spark-graphframes]: Generating incorrect edges

2024-05-11 Thread Nijland, J.G.W. (Jelle, Student M-CS)
D, psf.concat(psf.lit(PREFIX_ORG), psf.sha2(df.descr, 256))) return df Hope this email finds someone running into a similar issue in the future. Kind regards, Jelle From: Mich Talebzadeh Sent: Wednesday, May 1, 2024 11:56 AM To: Stephen Coy Cc: Nijland,

Re: [spark-graphframes]: Generating incorrect edges

2024-04-25 Thread Nijland, J.G.W. (Jelle, Student M-CS)
.bindAddress", "localhost" ).set("spark.driver.host", "127.0.0.1" # ).set("spark.driver.port", "0" ).set("spark.ui.port", "4041" ).set("spark.executor.instances", "1" ).set("spark.executor.cores", "50" )

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
___ From: Mich Talebzadeh Sent: Wednesday, April 24, 2024 4:40 PM To: Nijland, J.G.W. (Jelle, Student M-CS) Cc: user@spark.apache.org Subject: Re: [spark-graphframes]: Generating incorrect edges OK few observations 1) ID Generation Method: How are you generating unique IDs (UUIDs, seque

[spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
tags: pyspark,spark-graphframes Hello, I am running pyspark in a podman container and I have issues with incorrect edges when I build my graph. I start with loading a source dataframe from a parquet directory on my server. The source dataframe has the following columns: +-+---+-

help needed with SPARK-45598 and SPARK-45769

2023-11-09 Thread Maksym M
Greetings, tl;dr there must have been a regression in spark *connect*'s ability to retrieve data, more details in linked issues https://issues.apache.org/jira/browse/SPARK-45598 https://issues.apache.org/jira/browse/SPARK-45769 we have projects that depend on spark connect 3.5 and we'd apprec

RE: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Maksym M
Hey vaquar, The link does't explain the crucial detail we're interested in - does executor re-use the data that exists on a node from previous executor and if not, how can we configure it to do so? We are not running on kubernetes, so EKS/Kubernetes-specific advice isn't very relevant. We are ru

GPU Support

2023-01-05 Thread K B M Kaala Subhikshan
Is Gigabyte GeForce RTX 3080 GPU support for running machine learning in Spark?

Re: ClassCastException while reading parquet data via Hive metastore

2022-11-07 Thread Evy M
you suggested right? > But int to long / bigint seems to be a reasonable evolution (correct me if > I'm wrong). Is it possible to reopen the jira i mentioned earlier? Any > reason for that getting closed? > > > Regards, > Naresh > > > On Mon, Nov 7, 2022, 16:55 Evy

Re: ClassCastException while reading parquet data via Hive metastore

2022-11-07 Thread Evy M
Hi Naresh, Have you tried any of the following in order to resolve your issue: 1. Reading the Parquet files (directly, not via Hive [i.e, spark.read.parquet()]), casting to LongType and creating the hive table based on this dataframe? Hive's BigInt and Spark's Long should have the sam

Apache Spark - How to concert DataFrame json string to structured element and using schema_of_json

2022-09-05 Thread M Singh
Hi: In apache spark we can read json using the following: spark.read.json("path"). There is support to convert json string in a dataframe into structured element using (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#from_json-org.apache.spark.sql.Column-org.

Memory leak while caching in foreachBatch block

2022-08-10 Thread kineret M
Hi, We have a structured streaming application, and we face a memory leak while caching in the foreachBatch block. We do unpersist every iteration, and we also verify via "spark.sparkContext.getPersistentRDDs" that we don't have unnecessary cached data. We also noted in the profiler that many sp

Partial data with ADLS Gen2

2022-07-24 Thread kineret M
I have spark batch application writing to ADLS Gen2 (hierarchy). When designing the application I was sure the spark would perform global commit once the job is committed, but what it really does it commits on each task, meaning *once task completes writing it moves from temp to target storage*. So

java.lang.NoSuchMethodError: org.apache.hadoop.hive.common.FileUtils.mkdir --> Spark to Hive

2022-05-26 Thread Prasanth M Sasidharan
Connection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748) Any help would be much appreciated -- Live every day as if it were your last, because one of these days, it will be. Regards, Prasanth M Sasidharan

Fwd: java.lang.NoSuchMethodError: org.apache.hadoop.hive.common.FileUtils.mkdir --> Spark to Hive

2022-05-26 Thread Prasanth M Sasidharan
Connection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748) Any help would be much appreciated -- Live every day as if it were your last, because one of these days, it will be. Regards, Prasanth M Sasidharan

Stopping streaming after the write commit and before the read commit?

2022-05-18 Thread kineret M
Hi, What is the expected behavior if the streaming is stopped after the write commit and before the read commit? Should I expect data duplication? Thanks.

Why planInputPartitions is called multiple times in a micro-batch?

2021-07-12 Thread kineret M
Hi, I'm developing a new Spark connector using data source v2 API (spark 3.1.1). I noticed that the planInputPartitions method (in MicroBatchStream) is called twice every micro-batch. What the motivation/reason is? Thanks, Kineret

Unsubscribe

2021-07-06 Thread sids m
Unsubscribe

Re: How to control count / size of output files for

2021-03-10 Thread m li
hi Thank you. The suggestion is very good. There is no need to use "repartitionByRange", However, there is a little doubt that if the output file is required to be globally ordered, "repartition" will disrupt the order of the data, and the result of using "coalesce&quo

Re: How to control count / size of output files for

2021-03-08 Thread m li
ange(5,column("v")).sortWithinPartitions("v"). write.parquet(outputPath) Best Regards, m li Ivan Petrov wrote > Ah... makes sense, thank you. i tried sortWithinPartition before and > replaced with sort. It was a mistake. > > чт, 25 февр. 2021 г. в 15:25, Pietro Gentil

How Spark Framework works a Compiler

2021-01-03 Thread Renganathan M
Hi, I have read in many blogs that Spark framework is a compiler itself. It generates the DAG; optimizes it and executes it. The DAG is generated from the user submitted code ( be it in Java, Scala, Python or R). So when we submit a JAR file (it has the list of compiled classes), in the first s

Spark Submit through yarn is failing with Default queue.

2020-03-10 Thread SB M
Hi All, Am trying to submit my application using spark-submit in yarn mode. But its failing because of unknown queue default, we specified the queue name in spark-default.conf as spark.yarn.queue SecondaryQueue its failing for one application, but for another application dont know the reason. p

How to implement "getPreferredLocations" in Data source v2?

2020-01-18 Thread kineret M
Hi, I would like to support data locality in Spark data source v2. How can I provide Spark the ability to read and process data on the same node? I didn't find any interface that supports 'getPreferredLocations' (or equivalent). Thanks!

Re: Hive External Table Partiton Data Type.

2019-12-15 Thread SB M
spark version 2.1.0 Regards, Sbm On Mon, 16 Dec, 2019, 10:04 HARSH TAKKAR, wrote: > Please share the spark version you are using . > > On Fri, 13 Dec, 2019, 4:02 PM SB M, wrote: > >> Hi All, >>Am trying to create a dynamic partition with external table on hive >

Hive External Table Partiton Data Type.

2019-12-13 Thread SB M
Hi All, Am trying to create a dynamic partition with external table on hive metastore using spark sql. when am trying to create a partition column data type as bigint, partition is not working even i tried with repair table. data is not shown when i ran sample query select * from table. but i

Re: How to get logging right for Spark applications in the YARN ecosystem

2019-08-02 Thread Girish bhat m
f the places I have seen logging done by log4j properties, >> but no where people I have seen any solution where logs are being >> compressed. >> >> Is there anyway I can compress the logs, So that further those logs can >> be shipped to S3. >> >> -- >> Raman Gugnani >> > -- Girish bhat m

Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-25 Thread M Bilal
;33554432")` to tune the partition size when reading from HDFS. > > Thanks, > Manu Zhang > > On Mon, Apr 15, 2019 at 11:28 PM M Bilal wrote: > >> Hi, >> >> I have implemented a custom partitioning algorithm to partition graphs in >> GraphX. Saving the

'No plan for EventTimeWatermark' error while using structured streaming with column pruning (spark 2.3.1)

2019-04-24 Thread kineret M
Hi All, I get 'No plan for EventTimeWatermark' error while doing a query with columns pruning using structured streaming with a custom data source that implements Spark datasource v2. My data source implementation that handles the schemas includes the following: class MyDataSourceReader extends

[GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread M Bilal
Hi, I have implemented a custom partitioning algorithm to partition graphs in GraphX. Saving the partitioning graph (the edges) to HDFS creates separate files in the output folder with the number of files equal to the number of Partitions. However, reading back the edges creates number of partiti

Re: Observing DAGScheduler Log Messages

2019-04-07 Thread M Bilal
i > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > &

Observing DAGScheduler Log Messages

2019-04-07 Thread M Bilal
Hi, I want to observe the log messages from DAGScheduler in Apache Spark. Which log files do I need to check. I have tried observing the driver logs and worker stderr logs but I can't find any messages that are from that class. I am using Spark 3.0.0 snapshot in standalone mode. Thanks. Regard

How to support writeStream in data source v2 (spark 2.3.1)?

2019-03-24 Thread kineret M
I write spark data source v2 in spark 2.3 and I want to support writeStream. What should I do in order to do so? my defaultSource class: class MyDefaultSource extends DataSourceV2 with ReadSupport with WriteSupport with MicroBatchReadSupport { .. Which interface is missing?

Spark streaming error - Query terminated with exception: assertion failed: Invalid batch: a#660,b#661L,c#662,d#663,,… 26 more fields != b#1291L

2019-03-21 Thread kineret M
I try to read a stream using my custom data source (v2, using spark 2.3), and it fails *in the second iteration* with the following exception while reading prune columns:Query [id=xxx, runId=yyy] terminated with exception: assertion failed: Invalid batch: a#660,b#661L,c#662,d#663,,... 26 more field

Spark Streaming: schema mismatch using MicroBatchReader with columns pruning

2019-03-16 Thread kineret M
I have the same problem as described in the following question in StackOverflow (but nobody has answered to it). https://stackoverflow.com/questions/51103634/spark-streaming-schema-mismatch-using-microbatchreader-with-columns-pruning Any idea of how to solve it (using Spark 2.3)? Thanks, Kineret

Spark launcher listener not getting invoked k8s Spark 2.3

2018-04-30 Thread purna m
HI im using below code to submit a spark 2.3 application on kubernetes cluster in scala using play framework I have also tried as a simple scala program without using play framework im trying to spark submit which was mentioned below programatically https://spark.apache.org/docs/latest/running-on

Re: Apache Kafka / Spark Integration - Exception - The server disconnected before a response was received.

2018-04-10 Thread M Singh
10, 2018, 7:49:42 AM PDT, Daniel Hinojosa wrote: This looks more like a spark issue than it does a Kafka judging by the stack trace, are you using Spark structured streaming with Kafka integration by chance? On Mon, Apr 9, 2018 at 8:47 AM, M Singh wrote: > Hi Folks: > Just wanted

Re: Scala program to spark-submit on k8 cluster

2018-04-06 Thread Kittu M
ob to a k8s cluster by running spark-submit programmatically, or > some example Scala application that is to run on the cluster? > > On Wed, Apr 4, 2018 at 4:45 AM, Kittu M wrote: > >> Hi, >> >> I’m looking for a Scala program to spark submit a Scala application >>

Scala program to spark-submit on k8 cluster

2018-04-04 Thread Kittu M
Hi, I’m looking for a Scala program to spark submit a Scala application (spark 2.3 job) on k8 cluster . Any help would be much appreciated. Thanks

Apache Spark - Structured Streaming State Management With Watermark

2018-03-28 Thread M Singh
Hi: I am using Apache Spark Structured Streaming (2.2.1) to implement custom sessionization for events.  The processing is in two steps:1. flatMapGroupsWithState (based on user id) - which stores the state of user and emits events every minute until a expire event is received 2. The next step i

Apache Spark - Structured Streaming StreamExecution Stats Description

2018-03-28 Thread M Singh
Hi: I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState and a groupBy count operators. In the StreamExecution logs I see two enteries for stateOperators "stateOperators" : [ {     "numRowsTotal" : 1617339,     "numRowsUpdated" : 9647   }, {     "numRowsTotal" : 1326355,

Apache Spark Structured Streaming - How to keep executor alive.

2018-03-23 Thread M Singh
Hi: I am working on spark structured streaming (2.2.1) with kafka and want 100 executors to be alive. I set spark.executor.instances to be 100.  The process starts running with 100 executors but after some time only a few remain which causes backlog of events from kafka.  I thought I saw a sett

Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-03-22 Thread M Singh
Hi: I am working on a realtime application using spark structured streaming (v 2.2.1). The application reads data from kafka and if there is a failure, I would like to ignore the checkpoint.  Is there any configuration to just read from last kafka offset after a failure and ignore any offset che

Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread M Singh
Hi: I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last few days, after running the application for 30-60 minutes get exception from Kafka Consumer included below. The structured streaming application is processing 1 minute worth of data from kafka topic. So I've tried

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-24 Thread M Singh
Hi Vijay: I am using spark-shell because I am still prototyping the steps involved. Regarding executors - I have 280 executors and UI only show a few straggler tasks on each trigger.  The UI does not show too much time spend on GC.  suspect the delay is because of getting data from kafka. The num

Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread M Singh
Hi: I am working with spark structured streaming (2.2.1) reading data from Kafka (0.11).  I need to aggregate data ingested every minute and I am using spark-shell at the moment.  The message rate ingestion rate is approx 500k/second.  During some trigger intervals (1 minute) especially when t

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread M Singh
helpful to answer some of them. For example:  inputRowsPerSecond = numRecords / inputTimeSec,  processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why the 2 rowPerSec difference. On Feb 10, 2018, at 8:42 PM, M Singh wrote: Hi: I am working with spark 2.2.0 and am looking at

Re: Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-10 Thread M Singh
Just checking if anyone has any pointers for dynamically updating query state in structured streaming. Thanks On Thursday, February 8, 2018 2:58 PM, M Singh wrote: Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state

Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-10 Thread M Singh
Hi: I am working with spark 2.2.0 and am looking at the query status console output.  My application reads from kafka - performs flatMapGroupsWithState and then aggregates the elements for two group counts.  The output is send to console sink.  I see the following output  (with my questions in

Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-08 Thread M Singh
Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state periodically. Here is the scenario: 1. I have a udf with a variable with default value (eg: 1)  This value is applied to a column (eg: subtract the variable from the column value )2.

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread M Singh
s://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Feb 5, 2018 at 8:11 PM, M Singh wrote: Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-05 Thread M Singh
Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018 8:47 AM, M Singh wrote: Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-05 Thread M Singh
Hi TD: Just wondering if you have any insight for me or need more info. Thanks On Thursday, February 1, 2018 7:43 AM, M Singh wrote: Hi TD: Here is the udpated code with explain and full stack trace. Please let me know what could be the issue and what to look for in the explain output

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-02 Thread M Singh
don't want to process it, you could do a filter based on its EventTime field, but I guess you will have to compare it with the processing time since there is no API to access Watermark by the user.  -Vishnu On Fri, Jan 26, 2018 at 1:14 PM, M Singh wrote: Hi: I am trying to filter out re

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-01 Thread M Singh
tion.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) ... 1 more On Wednesday, January 31, 2018 3:46 PM, Tathagata Das wrote: Could you give the full stack trace of the exception? Also, can you do `dataframe2.explain(true)` and show us the

Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread M Singh
Hi Folks: I have to add a column to a structured streaming dataframe but when I do that (using select or withColumn) I get an exception.  I can add a column in structured non-streaming structured dataframe. I could not find any documentation on how to do this in the following doc  [https://spar

Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread M Singh
Hi: I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time.   Is the watermark api applicable to this scenario (ie, filtering lagging records) or it is only applicable with aggregation ?  I could not get a clear understanding from the documen

Re: Apache Spark - Custom structured streaming data source

2018-01-26 Thread M Singh
8:36 PM, "M Singh" wrote: Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods available in the SparkSession but these are internal to the sql package

Apache Spark - Custom structured streaming data source

2018-01-25 Thread M Singh
Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods available in the SparkSession but these are internal to the sql package:   private[sql] def internalCreateDataFrame

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-05 Thread M Singh
ring-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Thu, Jan 4, 2018 at 10:49 PM, M Singh wrote: Thanks Tathagata for your answer. The reason I was asking

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread M Singh
lated note, these APIs are subject to change. In fact in the upcoming release 2.3, we are adding a DataSource V2 API for batch/microbatch-streaming/continuous-streaming sources and sinks. On Wed, Jan 3, 2018 at 11:23 PM, M Singh wrote: Hi: The documentation for Sink.addBatch is as follows:   /

Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread M Singh
Hi: The documentation for Sink.addBatch is as follows:   /**   * Adds a batch of data to this sink. The data for a given `batchId` is deterministic and if   * this method is called more than once with the same batchId (which will happen in the case of   * failures), then `data` should only be ad

Re: Spark on EMR suddenly stalling

2018-01-01 Thread M Singh
Hi Jeroen: I am not sure if I missed it - but can you let us know what is your input source and output sink ?   In some cases, I found that saving to S3 was a problem. In this case I started saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which solved our issue. Mans

Apache Spark - Using withWatermark for DataSets

2017-12-30 Thread M Singh
Hi: I am working with DataSets so that I can use mapGroupsWithState for business logic and then use dropDuplicates over a set of fields.  I would like to use the withWatermark so that I can restrict the how much state is stored.   >From the API it looks like withWatermark takes a string - timesta

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-30 Thread M Singh
n external system (like kafka) Eyal On Tue, Dec 26, 2017 at 10:37 PM, M Singh wrote: Thanks Diogo.  My question is how to gracefully call the stop method while the streaming application is running in a cluster. On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira wrote: Hi M Singh

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-26 Thread M Singh
Thanks Diogo.  My question is how to gracefully call the stop method while the streaming application is running in a cluster. On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira wrote: Hi M Singh! Here I'm using query.stop() Em 25 de dez de 2017 19:19, "M Singh&

Apache Spark - (2.2.0) - window function for DataSet

2017-12-25 Thread M Singh
Hi:I would like to use window function on a DataSet stream (Spark 2.2.0)The window function requires Column as argument and can be used with DataFrames by passing the column. Is there any analogous window function or pointers to how window function can be used for DataSets ? Thanks

Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread M Singh
Hi: I am using spark structured streaming (v 2.2.0) to read data from files. I have configured checkpoint location. On stopping and restarting the application, it looks like it is reading the previously ingested files.  Is that expected behavior ?   Is there anyway to prevent reading files that

Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread M Singh
Hi:Are there any patterns/recommendations for gracefully stopping a structured streaming application ?Thanks

Create dataset from dataframe with missing columns

2017-06-14 Thread Tokayer, Jason M.
Is it possible to concisely create a dataset from a dataframe with missing columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d lik

Create dataset from dataframe with missing columns

2017-06-14 Thread Tokayer, Jason M.
Is it possible to concisely create a dataset from a dataframe with missing columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d lik

Deciphering spark warning "Truncated the string representation of a plan since it was too large."

2017-06-12 Thread Henry M
I am trying to understand if I should be concerned about this warning: "WARN Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf" It occurs while writing a data frame to parquet

RE: IOT in Spark

2017-05-18 Thread Lohith Samaga M
explore Apache Storm and Apache Flink. I suggest it is better to do a POC in each of them and then decide on what works best for you. Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga -Original Message- From: Gaurav1809 [mailto:gauravhpan...@gmail.com

Spark 2.1.0 with Hive 2.1.1?

2017-05-08 Thread Lohith Samaga M
jars are of version 1.2.1 I tried building spark from source and as spark uses hive 1.2.1 by default, I get the same set of jars. How can we make Spark 2.1.0 work with Hive 2.1.1? Thanks in advance! Best regards / Mit freundlichen Grüßen / Sincères salutations M

Spark 2.1.0 and Hive 2.1.1

2017-05-03 Thread Lohith Samaga M
jars are of version 1.2.1 I tried building spark from source and as spark uses hive 1.2.1 by default, I get the same set of jars. How can we make Spark 2.1.0 work with Hive 2.1.1? Thanks in advance! Best regards / Mit freundlichen Grüßen / Sincères salutations M

Application kill from UI do not propagate exception

2017-03-27 Thread Noorul Islam K M
Hi all, I am trying to trap UI kill event of a spark application from driver. Some how the exception thrown is not propagated to the driver main program. See for example using spark-shell below. Is there a way to get hold of this event and shutdown the driver program? Regards, Noorul spark@spa

This is a test mail, please ignore!

2017-03-27 Thread Noorul Islam K M
Sending plain text mail to test whether my mail appear in the list. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/This-is-a-test-mail-please-ignore-tp28538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: spark jobserver

2017-03-05 Thread Noorul Islam K M
A better forum would be https://groups.google.com/forum/#!forum/spark-jobserver or https://gitter.im/spark-jobserver/spark-jobserver Regards, Noorul Madabhattula Rajesh Kumar writes: > Hi, > > I am getting below an exception when I start the job-server > > ./server_start.sh: line 41: kill:

Re: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-03-03 Thread Noorul Islam K M
> When Initial jobs have not accepted any resources then what all can be > wrong? Going through stackoverflow and various blogs does not help. Maybe > need better logging for this? Adding dev > Did you take a look at the spark UI to see your resource availability? Thanks and Regards Noorul

Message loss in streaming even with graceful shutdown

2017-02-20 Thread Noorul Islam K M
Hi all, I have a streaming application with batch interval 10 seconds. val sparkConf = new SparkConf().setAppName("RMQWordCount") .set("spark.streaming.stopGracefullyOnShutdown", "true") val ssc = new StreamingContext(sparkConf, Seconds(10)) I also use reduceByKeyAndWindow() API f

Spark 2 or Spark 1.6.x?

2016-12-11 Thread Lohith Samaga M
Hi, I am new to Spark. I would like to learn Spark. I think I should learn version 2.0.2. Or should I still go for version 1.6.x and then come to version 2.0.2? Please advise. Thanks in advance. Best regards / Mit freundlichen Grüßen / Sincères salutations M

Re: installing spark-jobserver on cdh 5.7 and yarn

2016-11-09 Thread Noorul Islam K M
Reza zade writes: > Hi > > I have set up a cloudera cluster and work with spark. I want to install > spark-jobserver on it. What should I do? Maybe you should send this to spark-jobserver mailing list. https://github.com/spark-jobserver/spark-jobserver#contact Thanks and Regards Noorul --

Re: Dynamically change executors settings

2016-08-26 Thread linguin . m . s
Hi, No, currently you can't change the setting. // maropu 2016/08/27 11:40、Vadim Semenov のメッセージ: > Hi spark users, > > I wonder if it's possible to change executors settings on-the-fly. > I have the following use-case: I have a lot of non-splittable skewed files in > a custom format that

Re: spark 2.0 home brew package missing

2016-08-26 Thread Noorul Islam K M
kalkimann writes: > Hi, > spark 1.6.2 is the latest brew package i can find. > spark 2.0.x brew package is missing, best i know. > > Is there a schedule when spark-2.0 will be available for "brew install"? > Did you do a 'brew update' before searching. I installed spark-2.0 this week. Regards

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-08-19 Thread Richard M
I was using the 1.1 driver. I upgraded that library to 2.1 and it resolved my problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-and-spark-sql-hive-thriftServer-singleSession-setting-tp27340p27566.html Sent from the Apache Spark User L

Spark 1.6.2 HiveServer2 cannot access temp tables

2016-08-11 Thread Richard M
Im attempting to access a dataframe from jdbc: However this temp table is not accessible from beeline when connected to this instance of HiveServer2. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-HiveServer2-cannot-access-temp-tables-tp27515

Re: Table registered using registerTempTable not found in HiveContext

2016-08-11 Thread Richard M
How are you calling registerTempTable from hiveContext? It appears to be a private method. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-registered-using-registerTempTable-not-found-in-HiveContext-tp26555p27514.html Sent from the Apache Spark User Li

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-08-11 Thread Richard M
I am running HiveServer2 as well and when I connect with beeline I get the following: org.apache.spark.sql.internal.SessionState cannot be cast to org.apache.spark.sql.hive.HiveSessionState Do you know how to resolve this? -- View this message in context: http://apache-spark-user-list.10015

Testing --supervise flag

2016-08-01 Thread Noorul Islam K M
Hi all, I was trying to test --supervise flag of spark-submit. The documentation [1] says that, the flag helps in restarting your application automatically if it exited with non-zero exit code. I am looking for some clarification on that documentation. In this context, does application means th

When worker is killed driver continues to run causing issues in supervise mode

2016-07-13 Thread Noorul Islam K M
Spark version: 1.6.1 Cluster Manager: Standalone I am experimenting with cluster mode deployment along with supervise for high availability of streaming applications. 1. Submit a streaming job in cluster mode with supervise 2. Say that driver is scheduled on worker1. The app started successfu

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Lohith Samaga M
Hi, The aws CLI already has your access key aid and secret access key when you initially configured it. Is your s3 bucket without any access restrictions? Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Ashic Mahtab

Re: how to increase threads per executor

2016-06-02 Thread Andres M Jimenez T
ache: 25600K NUMA node0 CPU(s): 0 Thanks From: Mich Talebzadeh Sent: Thursday, June 2, 2016 5:00 PM To: Andres M Jimenez T Cc: user@spark.apache.org Subject: Re: how to increase threads per executor What are passing as parameters to Spark-

how to increase threads per executor

2016-06-02 Thread Andres M Jimenez T
Hi, I am working with Spark 1.6.1, using kafka direct connect for streaming data. Using spark scheduler and 3 slaves. Kafka topic is partitioned with a value of 10. The problem i have is, there is only one thread per executor running my function (logic implementation). Can anybody tell me

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Lohith Samaga M
Hi Can you look at Apache Drill as sql engine on hive? Lohith Sent from my Sony Xperia™ smartphone Tapan Upadhyay wrote Thank you everyone for guidance. Jorn our motivation is to move bulk of adhoc queries to hadoop so that we have enough bandwidth on our DB for imp batch/queries.

RE: Why Spark having OutOfMemory Exception?

2016-04-11 Thread Lohith Samaga M
either). Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga -Original Message- From: kramer2...@126.com [mailto:kramer2...@126.com] Sent: Monday, April 11, 2016 16.18 To: user@spark.apache.org Subject: Why Spark having OutOfMemory Exception? I use spark to do

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread sunil m
ve one more question .. if i want to launch a spark application in >> production environment so is there any other way so multiple users can >> submit there job without having hadoop configuration . >> >> Regards >> Prateek >> >> >> On Fri, Ma

RE: append rows to dataframe

2016-03-14 Thread Lohith Samaga M
If all sql results have same set of columns you could UNION all the dataframes Create an empty df and Union all Then reassign new df to original df before next union all Not sure if it is a good idea, but it works Lohith Sent from my Sony Xperia™ smartphone Divya Gehlot wrote Hi,

RE: pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-05 Thread Lohith Samaga M
Hi, If you can also format the condition file as a csv file similar to the main file, then you can join the two dataframes and select only required columns. Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Divya Gehlot [mailto:divya.htco

RE: Need to user univariate summary stats

2016-02-04 Thread Lohith Samaga M
Hi Arun, You can do df.agg(max(,,), min(..)). Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Arunkumar Pillai [mailto:arunkumar1...@gmail.com] Sent: Thursday, February 04, 2016 14.53 To: user@spark.apache.org Subject: Need to user univariate

Stage shows incorrect output size

2016-01-26 Thread Noorul Islam K M
Hi all, I am trying to copy data from one cassandra cluster to another using spark + cassandra connector. At the source I have around 200 GB of data But while running the spark stage shows output as 406 GB and the data is still getting copied. I wonder why is it showing this high a number. Envir

  1   2   >