Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-18 Thread Sean Owen
, too; that's a link to master. On Fri, Nov 18, 2022 at 5:50 AM Ramakrishna Rayudu < ramakrishna560.ray...@gmail.com> wrote: > Hi Sean, > > Can you please let me know what is query spark internally fires for > getting count on dataframe. > > Long count=dataframe.count();

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
Weird, does Teradata not support LIMIT n? looking at the Spark source code suggests it won't. The syntax is "SELECT TOP"? I wonder if that's why the generic query that seems to test existence loses the LIMIT. But, that "SELECT 1" test seems to be used for MySQL, Postgres, s

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure nothing else is generating or changing those queries? On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu < ramakrishna560.ray...@gmail.com> wrote: > We are using spark 2.4.4 version. > I can see two types of querie

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Ramakrishna Rayudu
We are using spark 2.4.4 version. I can see two types of queries in DB logs. SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0 SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0 When we see `SELECT *` which ending up with `Where 1=0` but query starts with `SELECT 1` there is no where condition

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
Hm, actually that doesn't look like the queries that Spark uses to test existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0" depending on the dialect. What version, and are you sure something else is not sending those queries? On Thu, Nov 17, 2022 at

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
his. > > <https://stackoverflow.com/> > >1. > > > <https://stackoverflow.com/posts/74477662/timeline> > > We are connecting Tera data from spark SQL with below API > > Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, > connectionPropertie

[Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Ramakrishna Rayudu
Hi Team, I am facing one issue. Can you please help me on this. <https://stackoverflow.com/> 1. <https://stackoverflow.com/posts/74477662/timeline> We are connecting Tera data from spark SQL with below API Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connecti

Spark Structured Streaming - unable to change max.poll.records (showing as 1)

2022-09-07 Thread karan alang
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer clientId=consumer-spark-kafka-source-4e7e7f32-19ab-44d5-99f5-59fb5a462af2-594190416-driver-0-1, groupId=spark-kafka-source-4e7e7f32-19ab-44d5-99f5-59fb5a462af2-594190416-driver-0] Member consumer-spark-kafka-source-4e7e7f32-19ab-44d5-99f5-59fb5a462af2-594190416-driver-0-1

approx_count_distinct in spark always return 1

2022-06-02 Thread marc nicole
but even when i have duplicate column values i still get 1 at the "freq" column, Also when i specify the rsd param to be 0 then i get arrayIndexOutOfBounds kind of error. Why?

Re: GCP Dataproc - Failed to construct kafka consumer, Failed to load SSL keystore dataproc-versa-sase-p12-1.jks of type JKS

2022-02-02 Thread karan alang
or > Failed to construct kafka consumer, Failed to load SSL keystore > dataproc-versa-sase-p12-1.jks of type JKS > > Details in stackoverflow - > https://stackoverflow.com/questions/70964198/gcp-dataproc-failed-to-construct-kafka-consumer-failed-to-load-ssl-keystore-d > > From my loc

GCP Dataproc - Failed to construct kafka consumer, Failed to load SSL keystore dataproc-versa-sase-p12-1.jks of type JKS

2022-02-02 Thread karan alang
in the current working directory. The truststore and keystores are passed onto the Kafka Consumer/Producer. However - i'm getting an error Failed to construct kafka consumer, Failed to load SSL keystore dataproc-versa-sase-p12-1.jks of type JKS Details in stackoverflow - https://stackoverflow.com/questions

Re: Why are in 1 stage most of my executors idle: are tasks within a stage dependent of each other?

2021-09-10 Thread Joris Billen
at my other nodes (8 of them and > over 45 healthy executors) are idle for over 3 hours. >I notice in the logs that all tasks are run at "NODE_LOCAL" > >I wonder what is causing this and if I can do something to make the idle > executors also do work. 2 options: &

Re: Why are in 1 stage most of my executors idle: are tasks within a stage dependent of each other?

2021-09-10 Thread Lalwani, Jayesh
er 3 hours. I notice in the logs that all tasks are run at "NODE_LOCAL" I wonder what is causing this and if I can do something to make the idle executors also do work. 2 options: 1)It is just the way it is: at some point in this stage, there are dependencies of the further tasks.

Why are in 1 stage most of my executors idle: are tasks within a stage dependent of each other?

2021-09-10 Thread Joris Billen
that means that my other nodes (8 of them and over 45 healthy executors) are idle for over 3 hours. I notice in the logs that all tasks are run at "NODE_LOCAL" I wonder what is causing this and if I can do something to make the idle executors also do work. 2 options: 1)It is just the way it i

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Silvio Fiorito
As I suggested, you need to use repartition(1) in place of coalesce(1) That will give you a single file output without losing parallelization for the rest of the job. From: James Yu Date: Wednesday, February 3, 2021 at 2:19 PM To: Silvio Fiorito , user Subject: Re: Poor performance caused

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Gourav Sengupta
tage boundary"? > > Thanks > -- > *From:* Silvio Fiorito > *Sent:* Wednesday, February 3, 2021 11:05 AM > *To:* James Yu ; user > *Subject:* Re: Poor performance caused by coalesce to 1 > > > Coalesce is reducing the parallelization o

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Mich Talebzadeh
icitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 3 Feb 2021 at 19:08, Sean Owen wrote: > Probably could also be because that coalesce can cause some upstream > transformations to also have parallelism o

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
rito Sent: Wednesday, February 3, 2021 11:05 AM To: James Yu ; user Subject: Re: Poor performance caused by coalesce to 1 Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absol

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Sean Owen
Probably could also be because that coalesce can cause some upstream transformations to also have parallelism of 1. I think (?) an OK solution is to cache the result, then coalesce and write. Or combine the files after the fact. or do what Silvio said. On Wed, Feb 3, 2021 at 12:55 PM James Yu

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Stéphane Verlet
w to improve it: > >We have a particular dataset which we aggregate from other datasets and >like to write out to one single file (because it is small enough). We >found that after a series of transformations (GROUP BYs, FLATMAPs), we >coalesced the final RDD to 1 part

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Silvio Fiorito
Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absolutely need a single file output, you can instead add a stage boundary and use repartition(1). This will give your query

Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations. We hope

Re: Elastic Search sink showing -1 for numOutputRows

2020-09-07 Thread jainshasha
Thanks Jungtaek Lim-2 for replying. May i knw the reference of the API version for sink for both types (DSv1 and DSv2) in code ? Where could i see it ? Under what module of spark code ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Elastic Search sink showing -1 for numOutputRows

2020-09-07 Thread Jungtaek Lim
t; Using structured spark streaming and sink the data into ElasticSearch. > In the stats emit for each batch the "numOutputRows" showing -1 for > ElasticSearch sink always > whereas when i see other sinks like Kafka it shows either 0 or some values > when it emit data. &g

Re: Elastic Search sink showing -1 for numOutputRows

2020-09-07 Thread jainshasha
Hi, Using structured spark streaming and sink the data into ElasticSearch. In the stats emit for each batch the "numOutputRows" showing -1 for ElasticSearch sink always whereas when i see other sinks like Kafka it shows either 0 or some values when it emit data. What could be

Elastic Search sink showing -1 for numOutputRows

2020-09-07 Thread jainshasha
Hi, Using structured spark streaming and sink the data into ElasticSearch. In the stats emit for each batch the "numOutputRows" showing -1 for ElasticSearch sink always whereas when i see other sinks like Kafka it shows either 0 or some values when it emit data. What could be

Re: Spark Streaming on Compact Kafka topic - consumers 1 message per partition per batch

2020-04-08 Thread Hrishikesh Mishra
here a > reason you chose to start reading again from the beginning by using a new > consumer group rather then sticking to the same consumer group? > > In your application, are you manually committing offsets to Kafka? > > Regards, > > Waleed > > On Wed, Apr 1, 202

Re: Spark Streaming on Compact Kafka topic - consumers 1 message per partition per batch

2020-04-01 Thread Waleed Fateem
our application, are you manually committing offsets to Kafka? Regards, Waleed On Wed, Apr 1, 2020 at 1:31 AM Hrishikesh Mishra wrote: > Hi > > Our Spark streaming job was working fine as expected (the number of events > to process in a batch). But due to some reasons, we added compaction o

Spark Streaming on Compact Kafka topic - consumers 1 message per partition per batch

2020-04-01 Thread Hrishikesh Mishra
is 1M records and consumer has huge lag. Driver log which fetches 1 message per partition. 20/03/31 18:25:55 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4] Resetting offset for partition demandIngestion.SLTarget-45 to offset 211951. 20/03/31 18:26:00 INFO Fetcher: [groupId=pc-nfr-loop-31-march

[Spark MicroBatchExecution] Error fetching kafka/checkpoint/state/0/0/1.delta does not exist

2020-03-12 Thread Miguel Silvestre
Hi community, I'm having this error in some kafka streams: Caused by: java.io.FileNotFoundException: File file:/efs/.../kafka/checkpoint/state/0/0/1.delta does not exist Because of this I have some streams down. How can I fix this? Thank you. -- Miguel Silvestre

join with just 1 record causes all data to go to a single node

2019-11-21 Thread Marcelo Valle
contact the sender immediately upon receipt. KTech Services Ltd is registered in England as company number 10704940. Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United Kingdom

Re: Executors idle, driver heap exploding and maxing only 1 cpu core

2019-05-29 Thread Akshay Bhardwaj
Hi, A few thoughts to add to Nicholas' apt reply. We were loading multiple files from AWS S3 in our Spark application. When the spark step of load files is called, the driver spends significant time fetching the exact path of files from AWS s3. Especially because we specified S3 paths like regex

Re: 1 task per executor

2019-05-28 Thread Arnaud LARROQUE
am using spark 2.2 > I have enabled spark dynamic allocation with executor cores 4, driver > cores 4 and executor memory 12GB driver memory 10GB. > > In Spark UI, I see only 1 task is launched per executor. > > Could anyone please help on this? > > Kind Regards, > Sachit Murarka >

1 task per executor

2019-05-28 Thread Sachit Murarka
Hi All, I am using spark 2.2 I have enabled spark dynamic allocation with executor cores 4, driver cores 4 and executor memory 12GB driver memory 10GB. In Spark UI, I see only 1 task is launched per executor. Could anyone please help on this? Kind Regards, Sachit Murarka

Re: Executors idle, driver heap exploding and maxing only 1 cpu core

2019-05-23 Thread Nicholas Hakobian
One potential case that can cause this is the optimizer being a little overzealous with determining if a table can be broadcasted or not. Have you checked the UI or query plan to see if any steps include a BroadcastHashJoin? Its possible that the optimizer thinks that it should be able to fit the

Executors idle, driver heap exploding and maxing only 1 cpu core

2019-05-23 Thread Ashic Mahtab
Hi, We have a quite long winded Spark application we inherited with many stages. When we run on our spark cluster, things start off well enough. Workers are busy, lots of progress made, etc. etc. However, 30 minutes into processing, we see CPU usage of the workers drop drastically. At this

CfP VHPC19: HPC Virtualization-Containers: Paper due May 1, 2019 (extended)

2019-04-03 Thread VHPC 19
. (Springer LNCS Proceedings) Date: June 20, 2019 Workshop URL: http://vhpc.org Paper Submission Deadline: May 1, 2019 (extended) Springer LNCS, rolling abstract submission Abstract/Paper Submission Link: https://edas.info

Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-12 Thread bsikander
Forgot to add the link https://jira.apache.org/jira/browse/KAFKA-5649 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-09 Thread bsikander
Could you please give some feedback. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread bsikander
Actually, our job runs fine for 17-18 hours and this behavior just suddenly starts happening after that. We found the following ticket which is exactly what is happening in our Kafka cluster also. WARN Failed to send SSL Close message (org.apache.kafka.common.network.SslTransportLayer) You

Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread Biplob Biswas
will end up having a lot of scheduling delay. Maybe see, why does it take 1 min to process 100 records and fix the logic. Also, I see you have higher number of events which takes some time lower amount of processing time. Fix the code logic and this should be fixed. Thanks & Regards Biplob Bi

[Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread bsikander
We are facing an issue with very long scheduling delays in Spark (upto 1+ hours). We are using Spark-standalone. The data is being pulled from Kafka. Any help would be much appreciated. I have attached the screenshots. <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8018/1-stats.

1

2018-10-24 Thread twinmegami
1

Databricks 1/2 day certification course at Spark Summit

2018-05-25 Thread Sumona Routh
Hi all, My company just now approved for some of us to go to Spark Summit in SF this year. Unfortunately, the day long workshops on Monday are sold out now. We are considering what we might do instead. Have others done the 1/2 day certification course before? Is it worth considering? Does

Re: Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-24 Thread Jeff Zhang
I don't think it is possible to have less than 1 core for AM, this is due to yarn not spark. The number of AM comparing to the number of executors should be small and acceptable. If you do want to save more resources, I would suggest you to use yarn cluster mode where driver and AM run

Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-18 Thread peay
status from YARN. Is that correct? spark.yarn.am.cores is 1, and that AM gets one full vCore on the cluster. Because I am using DominantResourceCalculator to take vCores into account for scheduling, this results in a lot of unused CPU capacity overall because all those AMs each block one full vCore

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
Hi Gerard, "If your actual source is Kafka, the original solution of using `spark.streams.awaitAnyTermination` should solve the problem." I tried literally everything, nothing worked out. 1) Tried NC from two different ports for two diff streams, still nothing worked. 2) Tried

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Gerard Maas
; *Date: *Friday, April 13, 2018 at 11:49 PM >> *To: *Aakash Basu <aakash.spark@gmail.com> >> *Cc: *Panagiotis Garefalakis <panga...@gmail.com>, user < >> user@spark.apache.org> >> *Subject: *Re: [Structured Streaming] More than 1 streaming in a code &g

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Lalwani, Jayesh
panga...@gmail.com>, user <user@spark.apache.org> Subject: Re: [Structured Streaming] More than 1 streaming in a code If I use timestamp based windowing, then my average will not be global average but grouped by timestamp, which is not my requirement. I want to recalculate the avg of enti

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
...@capitalone.com> > *Cc: *spark receiver <spark.recei...@gmail.com>, Panagiotis Garefalakis < > panga...@gmail.com>, user <user@spark.apache.org> > > *Subject: *Re: [Structured Streaming] More than 1 streaming in a code > > > > Hey Jayesh and Others,

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Lalwani, Jayesh
ark@gmail.com> Date: Monday, April 16, 2018 at 4:52 AM To: "Lalwani, Jayesh" <jayesh.lalw...@capitalone.com> Cc: spark receiver <spark.recei...@gmail.com>, Panagiotis Garefalakis <panga...@gmail.com>, user <user@spark.apache.org> Subject: Re: [Structured St

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
t; *To: *Aakash Basu <aakash.spark@gmail.com> > *Cc: *Panagiotis Garefalakis <panga...@gmail.com>, user < > user@spark.apache.org> > *Subject: *Re: [Structured Streaming] More than 1 streaming in a code > > > > Hi Panagiotis , > > > > Wonder

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-15 Thread Lalwani, Jayesh
Friday, April 13, 2018 at 11:49 PM To: Aakash Basu <aakash.spark@gmail.com> Cc: Panagiotis Garefalakis <panga...@gmail.com>, user <user@spark.apache.org> Subject: Re: [Structured Streaming] More than 1 streaming in a code Hi Panagiotis , Wondering you solved the problem or

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-13 Thread spark receiver
: 0 > --- > ++ > |aver| > ++ > | 3.0| > ++ > > --- > Batch: 1 > --- > ++ > |aver| > ++ > | 4.0| > ++ >

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-06 Thread Aakash Basu
--- ++ |aver| ++ | 3.0| ++ --- Batch: 1 --- ++ |aver| ++ | 4.0| ++ *Updated Code -* from pyspark.sql import SparkSession from pyspark.sql.functions import split spark = SparkSession \ .builder

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-06 Thread Panagiotis Garefalakis
please clarify the doubt? > > -- Forwarded message -- > From: Aakash Basu <aakash.spark@gmail.com> > Date: Thu, Apr 5, 2018 at 3:18 PM > Subject: [Structured Streaming] More than 1 streaming in a code > To: user <user@spark.apache.org> > > > Hi, &g

Fwd: [Structured Streaming] More than 1 streaming in a code

2018-04-06 Thread Aakash Basu
Any help? Need urgent help. Someone please clarify the doubt? -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Thu, Apr 5, 2018 at 3:18 PM Subject: [Structured Streaming] More than 1 streaming in a code To: user <user@spark.apache.org> Hi

[Structured Streaming] More than 1 streaming in a code

2018-04-05 Thread Aakash Basu
servers", "localhost:9092") \ .option("subscribe", "test1") \ .load() ID = data.select('value') \ .withColumn('value', data.value.cast("string")) \ .withColumn("Col1", split(col("value"

Re: 1 Executor per partition

2018-04-04 Thread utkarsh_deep
You are correct. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: 1 Executor per partition

2018-04-04 Thread Gourav Sengupta
; > Hello list! > > I am trying to familiarize with Apache Spark. I would like to ask > something about partitioning and executors. > > Can I have e.g: 500 partitions but launch only one executor that will run > operations in only 1 partition of the 500? And the

1 Executor per partition

2018-04-04 Thread Thodoris Zois
Hello list! I am trying to familiarize with Apache Spark. I would like to ask something about partitioning and executors. Can I have e.g: 500 partitions but launch only one executor that will run operations in only 1 partition of the 500? And then I would like my job to die. Is there any

Fwd: Spark 1.x - End of life

2017-10-24 Thread Ismaël Mejía
Hi Ismael, > > It depends on what you mean by “support”. In general, there won’t be new > feature releases for 1.X (e.g. Spark 1.7) because all the new features are > being added to the master branch. However, there is always room for bug fix > releases if there is a catastrophic bug, and c

Re: Spark 1.x - End of life

2017-10-19 Thread Matei Zaharia
Hi Ismael, It depends on what you mean by “support”. In general, there won’t be new feature releases for 1.X (e.g. Spark 1.7) because all the new features are being added to the master branch. However, there is always room for bug fix releases if there is a catastrophic bug, and committers can

Spark 1.x - End of life

2017-10-19 Thread Ismaël Mejía
Hello, I noticed that some of the (Big Data / Cloud Managed) Hadoop distributions are starting to (phase out / deprecate) Spark 1.x and I was wondering if the Spark community has already decided when will it end the support for Spark 1.x. I ask this also considering that the latest release

Apache Spark GraphX: java.lang.ArrayIndexOutOfBoundsException: -1

2017-10-16 Thread Andy Long
We have hit a bug with GraphX when calling the connectedComponents function, where it errors with the following error java.lang.ArrayIndexOutOfBoundsException: -1 I've found this bug report: https://issues.apache.org/jira/browse/SPARK-5480 Has anyone else hit this issue and if so did how did you

Why is spark library using outputs(i+1) in MultilayerPerceptron for previous Delta Calculations

2017-10-06 Thread shadow
Looking at this <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala#L541> code for (i <- (L - 2) to (0, -1)) { layerModels(i + 1).computePrevDelta(deltas(i + 1), outputs(i + 1), deltas(i)) } I want to understand why are we passin

Container exited with a non-zero exit code 1

2017-06-23 Thread Link Qian
status: 1) 17/06/22 15:18:44 INFO yarn.YarnAllocator: Container marked as failed: container_1498115278902_0001_02_13. Exit status: 1. Diagnostics: Exception from container-launch. Container id: container_1498115278902_0001_02_13 Exit code: 1 Stack trace: ExitCodeException exitCode=1

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread ??????????
--- From: "Sean Owen"<so...@cloudera.com> Date: 2017/6/15 16:13:11 To: "user"<user@spark.apache.org>;"dev"<d...@spark.apache.org>;"??"<1427357...@qq.com>; Subject: Re: the dependence length of RDD, can its size be greater

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Sean Owen
Yes. Imagine an RDD that results from a union of other RDDs. On Thu, Jun 15, 2017, 09:11 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Reynold Xin
A join? On Thu, Jun 15, 2017 at 1:11 AM 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue about this.

the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread ??????????
Hi all, The RDD code keeps a member as below: dependencies_ : seq[Dependency[_]] It is a seq, that means it can keep more than one dependency. I have an issue about this. Is it possible that its size is greater than one please? If yes, how to produce it please? Would you like show me some

Checkpointing fro reduceByKeyAndWindow with a window size of 1 hour and 24 hours

2017-05-30 Thread SRK
Hi, What happens if I dont specify checkpointing on a DStream that has reduceByKeyAndWindow with no inverse function? Would it cause the memory to be overflown? My window sizes are 1 hour and 24 hours. I cannot provide an inserse function for this as it is based on HyperLogLog. My code looks

Re: Reading ASN.1 files in Spark

2017-04-06 Thread Yong Zhang
-c/photo.jpg]<http://awcoleman.blogspot.com/2014/07/processing-asn1-call-detail-records.html> Processing ASN.1 Call Detail Records with Hadoop (using ...<http://awcoleman.blogspot.com/2014/07/processing-asn1-call-detail-records.html> awcoleman.blogspot.com Processing ASN.1 Call De

Re: Reading ASN.1 files in Spark

2017-04-06 Thread vincent gromakowski
I would also be interested... 2017-04-06 11:09 GMT+02:00 Hamza HACHANI <hamza.hach...@supcom.tn>: > Does any body have a spark code example where he is reading ASN.1 files ? > Thx > > Best regards > Hamza >

Reading ASN.1 files in Spark

2017-04-06 Thread Hamza HACHANI
Does any body have a spark code example where he is reading ASN.1 files ? Thx Best regards Hamza

Re: Spark submit on yarn does not return with exit code 1 on exception

2017-02-03 Thread Shashank Mandil
I may have found my problem. We have a scala wrapper on top of spark-submit to run the shell command through scala. We were kind of eating the exit code from spark-submit in that wrapper. When I looked at what the actual exit code was stripping away the wrapper I got 1. So I think spark-submit

Re: Spark submit on yarn does not return with exit code 1 on exception

2017-02-03 Thread Jacek Laskowski
Hi, ➜ spark git:(master) ✗ ./bin/spark-submit whatever || echo $? Error: Cannot load main class from JAR file:/Users/jacek/dev/oss/spark/whatever Run with --help for usage help or --verbose for debug output 1 I see 1 and there are other cases for 1 too. Pozdrawiam, Jacek Laskowski https

Re: Spark submit on yarn does not return with exit code 1 on exception

2017-02-03 Thread Ali Gouta
Hello, +1, i have exactly the same issue. I need the exit code to make a decision on oozie executing actions. Spark-submit always returns 0 when catching the exception. From spark 1.5 to 1.6.x, i still have the same issue... It would be great to fix it or to know if there is some work around

Re: Spark submit on yarn does not return with exit code 1 on exception

2017-02-03 Thread Jacek Laskowski
Hi, An interesting case. You don't use Spark resources whatsoever. Creating a SparkConf does not use YARN...yet. I think any run mode would have the same effect. So, although spark-submit could have returned exit code 1, the use case touches Spark very little. What version is that? Do you see

Spark submit on yarn does not return with exit code 1 on exception

2017-02-03 Thread Shashank Mandil
println("all done!") } catch { case e: RuntimeException => { println("There is an exception in the script exiting with status 1") System.exit(1) } } } When I run this code using spark-submit I am expecting to get an exit code of 1, however I keep gett

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
vide the results yourself. > I don't think it will be back-ported because the the behavior was intended > in 1.x, just wrongly documented, and we don't want to change the behavior > in 1.x. The results are still correctly ordered anyway. > > On Thu, Dec 29, 2016 at 10:11 PM Manish

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
e back-ported because the the behavior was intended > in 1.x, just wrongly documented, and we don't want to change the behavior > in 1.x. The results are still correctly ordered anyway. > > On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi <tr.man...@gmail.com> > wrote: > >&

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
was intended in 1.x, just wrongly documented, and we don't want to change the behavior in 1.x. The results are still correctly ordered anyway. On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi <tr.man...@gmail.com> wrote: > Sean, > > Thanks for answer. I am using Spark 1.6 so are you saying

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If we don't normalize it would have a norm in the denominator so output is same. But I understand you are saying in Spark 1.x, one vector was not normalized. If that is the case then it makes sense. Any idea how to fix this (get the right

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
word2vec algorithm of spark to compute documents vector of a text. > > I then used the findSynonyms function of the model object to get synonyms > of few words. > > I see something like this: > > > ​ > > I do not understand why the cosine similarity is being calculated as

Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
I used a word2vec algorithm of spark to compute documents vector of a text. I then used the findSynonyms function of the model object to get synonyms of few words. I see something like this: ​ I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity

TaskSetManager stalls for 1 min in the middle of a job

2016-12-14 Thread Oleg Mazurov
:35,276 (dag-scheduler-event-loop) DEBUG [o.a.s.s.TaskSetManager] - Valid locality levels for TaskSet 1.0: PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, ANY 22:32:35,288 (dispatcher-event-loop-20) INFO [o.a.s.s.TaskSetManager] - Starting task 1.0 in stage 1.0 (TID 37, localhost, partition 1, PROCESS_LOCAL

spark ml - ngram - how to preserve single word (1-gram)

2016-11-08 Thread Nirav Patel
uot;, "I heard", "heard about", "about Spark") Currently if I want to do it I will have to manually transform column first using current ngram implementation then join 1-gram tokens to each column value. basically I have to do this outside of pipeline. --

WARN 1 block locks were not released with MLlib ALS

2016-11-04 Thread Mikael Ståldal
I get a few warnings like this in Spark 2.0.1 when using org .apache.spark.mllib.recommendation.ALS: WARN org.apache.spark.executor.Executor - 1 block locks were not released by TID = 1448: [rdd_239_0] What can be the reason for that? -- [image: MagineTV] *Mikael Ståldal* Senior software

Spark Sql - "broadcast-exchange-1" java.lang.OutOfMemoryError: Java heap space

2016-10-25 Thread Selvam Raman
on sfa.snum = sf1.snum " + " join ann at on at.anum = sfa.anum AND at.atypenum = 11 " + " join data dr on r.rnum = dr.rnum " + " join cit cd on dr.dnum = cd.dnum " + " join cit on cd.cnum = ci.cnum " +

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
y by default yarn does not > honor cpu cores as resource, so you will always see vcore is 1 no matter > what number of cores you set in spark. > > On Wed, Aug 3, 2016 at 12:11 PM, satyajit vegesna > <satyajit.apas...@gmail.com> wrote: >> >> Hi All, >> >>

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
y by default yarn does not > honor cpu cores as resource, so you will always see vcore is 1 no matter > what number of cores you set in spark. > > On Wed, Aug 3, 2016 at 12:11 PM, satyajit vegesna > <satyajit.apas...@gmail.com> wrote: >> >> Hi All, >> >> I am t

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Saisai Shao
Use dominant resource calculator instead of default resource calculator will get the expected vcores as you wanted. Basically by default yarn does not honor cpu cores as resource, so you will always see vcore is 1 no matter what number of cores you set in spark. On Wed, Aug 3, 2016 at 12:11 PM

Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-02 Thread satyajit vegesna
Hi All, I am trying to run a spark job using yarn, and i specify --executor-cores value as 20. But when i go check the "nodes of the cluster" page in http://hostname:8088/cluster/nodes then i see 4 containers getting created on each of the node in cluster. But can only see 1 vco

Re: Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
am concerned that this will reduce concurrency Thanks Andy From: Ted Yu <yuzhih...@gmail.com> Date: Friday, July 22, 2016 at 2:54 PM To: Andrew Davidson <a...@santacruzintegration.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: Exception in thr

Re: Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Ted Yu
constituentDFS = getDataFrames(constituentDataSets) > > results = ["{} {}".format(name, constituentDFS[name].count()) for name > in constituentDFS] > > print(results) > > return results > > > %timeit -n 1 -r 1 results = work() > > > in (.0)

Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
node has 6G. Any suggestions would be greatly appreciated Andy def work(): constituentDFS = getDataFrames(constituentDataSets) results = ["{} {}".format(name, constituentDFS[name].count()) for name in constituentDFS] print(results) return results %timeit -n 1 -r

Re: Task not serializable: java.io.NotSerializableException: org.json4s.Serialization$$anon$1

2016-07-19 Thread RK Aduri
Did you check this: case class Example(name : String, age ; Int) there is a semicolon. should have been (age : Int) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-not-serializable-java-io-NotSerializableException-org-json4s-Serialization-anon-1

Re: Task not serializable: java.io.NotSerializableException: org.json4s.Serialization$$anon$1

2016-07-19 Thread joshuata
r-list.1001560.n3.nabble.com/Task-not-serializable-java-io-NotSerializableException-org-json4s-Serialization-anon-1-tp8233p27359.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail

Re: 'numBins' property not honoured in BinaryClassificationMetrics class when spark.default.parallelism is not set to 1

2016-07-03 Thread Sean Owen
Metrics > import org.apache.spark.{SparkConf, SparkContext} > > /** > * Created by sneha.shukla on 17/06/16. > */ > > object TestCode { > > def main(args: Array[String]): Unit = { > > val sparkConf = new > SparkConf().setAppName("HBaseRead

  1   2   3   4   >