from:"Yash Sharma"

[Streaming][Kinesis][SPARK-20168] Could I get some reviews of the patch that resolves kinesis timestamp resume

2018-07-05 Thread Yash Sharma

Hi Team, Could I get some review at the patch here. Would love to hear suggestions here on the patch. I had to reopen SPARK-20168 because of this bug. https://github.com/apache/spark/pull/21541 https://issues.apache.org/jira/browse/SPARK-20168 Cheers, Yash

Structured Streaming with S3 file source duplicates data because of eventual consistency

2018-01-11 Thread Yash Sharma

Hi Team, I have been using Structured Streaming with the S3 data source but I am seeing it duplicate the data intermittently. New run seem to fix it, but the duplication happens ~10% of time. The ratio increases with more number of files in the source. Investigating more, I see this is clearly an i

[kinesis][streaming] Could I request a review on this PR

2017-12-11 Thread Yash Sharma

Hi All, Could I request a review on this patch on Spark-Kinesis streaming. It has been sitting there for few months looking for some love. Please help. The patch proposes resuming Kinesis data from a specified timestamp, similar to Kafka, and improves kinesis crash recovery avoiding scanning ok un

[spark-kinesis] [SPARK-20168] Requesting some attention for a review

2017-11-14 Thread Yash Sharma

Hi Team, Could I please pull some attention towards the pull request on Spark-Kinesis operability. We have iterated over the patch for past few months, and it would be great to have some final review of the patch. I think its very close now. I would love to work on improvements if any. This patch

[Streaming] Requesting more Committers for Spark-Kinesis integration

2017-09-28 Thread Yash Sharma

Hi Fellow Spark developers/ PMC Members, I am a new member of the community and have started my tiny contributions to Spark-Kinesis Integration. I am trying to fill in the gaps in making spark operate with Kinesis as nicely as Kafka. I am writing this mail to highlight an issue with the kinesis mo

[Spark][Kinesis] Could I get some committer review on the pull request

2017-09-05 Thread Yash Sharma

Hi All, I've been working on a pull request [1] to allow Spark read from a specific timestamp from Kinesis. I have iterated the patch with the help of other contributors and we think that its in a good state now. This patch would save hours of crash recovery time for Spark while reading off Kinesi

Spark reading parquet files behaved differently with number of paths

2017-04-27 Thread Yash Sharma

Hi Fellow Devs, I have noticed the spark parquet reader behaves very differently in the two scenarios over the same data set while: 1. passing a single path to parent path to data, vs 2. passing all the files individually to parquet(paths: String*) The paths has about ~50K files. The first option

[DStream][Kinesis] Requesting review for spark-kinesis retries

2017-04-18 Thread Yash Sharma

Hi Fellow Devs, Please share your thoughts on the pull request that allows spark to have more graceful retries with kinesis streaming. The patch removes simple hard codings in the code and allows user to pass the values in config. This will help users to cope up with kinesis throttling errors and

Re: [Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-08 Thread Yash Sharma

27;re probably interested in the S3PartitionedOutputCommitter. > > rb > > On Thu, Apr 6, 2017 at 10:08 PM, Yash Sharma wrote: > > Hi All, > This is another issue that I was facing with the spark - s3 operability > and wanted to ask to the broader community if its faced by anyone else. &

[Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-06 Thread Yash Sharma

Hi All, This is another issue that I was facing with the spark - s3 operability and wanted to ask to the broader community if its faced by anyone else. I have a rather simple aggregation query with a basic transformation. The output however has lot of output partitions (20K partitions). The spark

[Streaming][Kinesis] Please review the kinesis-spark hard codings pull request

2017-04-06 Thread Yash Sharma

Hi fellow Spark Devs, If anyone here has some experience in spark kinesis streaming, would it be possible to provide your thoughts on this pull request [1]. Some info: The patch removes two important hard coded values for kinesis retries and will make kinesis recovery from crashes more reliable.

Spark - Kinesis integration needs improvements

2017-03-30 Thread Yash Sharma

Hello fellow spark devs, hope you are doing fabulous, Dropping a brain dump here about the Spark kinesis integration. I am able to get spark kinesis to work perfectly under ideal conditions, but see a lot of open ends when things are not so ideal. I feel there are lot of open ends and are specific

Re: subscribe to spark dev list

2017-03-21 Thread Yash Sharma

Sorry for the spam, used the wrong email address. On Wed, 22 Mar 2017 at 12:01 Yash Sharma wrote: > subscribe to spark dev list >

subscribe to spark dev list

2017-03-21 Thread Yash Sharma

subscribe to spark dev list

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-24 Thread Yash Sharma

files you are trying to read? Number of > executors are very high > On 24 Sep 2016 10:28, "Yash Sharma" wrote: > >> Have been playing around with configs to crack this. Adding them here >> where it would be helpful to others :) >> Number of executors and timeout se

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

emory. This can be around 48 assuming 12 nodes x 4 cores each. You could > start with processing a subset of your data and see if you are able to get > a decent performance. Then gradually increase the maximum # of execs for > dynamic allocation and process the remaining data. > > &

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

:27 AM, Yash Sharma wrote: > Have been playing around with configs to crack this. Adding them here > where it would be helpful to others :) > Number of executors and timeout seemed like the core issue. > > {code} > --driver-memory 4G \ > --conf spark.dynamicAllocation.en

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

gt; be 12 executors for testing and let know the status. > > Get Outlook for Android <https://aka.ms/ghei36> > > > > On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" > wrote: > > Thanks Aditya, appreciate the help. >> >> I had the exact

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

ws it as 168510 which is on very high side. Try reducing your executors. > > > On Friday 23 September 2016 12:34 PM, Yash Sharma wrote: > >> Hi All, >> I have a spark job which runs over a huge bulk of data with Dynamic >> allocation enabled. >> The job takes

Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

Hi All, I have a spark job which runs over a huge bulk of data with Dynamic allocation enabled. The job takes some 15 minutes to start up and fails as soon as it starts*. Is there anything I can check to debug this problem. There is not a lot of information in logs for the exact cause but here is

Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

2016-07-06 Thread Yash Sharma

Hi All, While writing a partitioned data frame as partitioned text files I see that Spark deletes all available partitions while writing few new partitions. dataDF.write.partitionBy(“year”, “month”, > “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”) Is this an expected behavior ?

Re: Does dataframe write append mode work with text format

2016-06-18 Thread Yash Sharma

uot;p2", "p3", "p4", > "p5").text(dir.getCanonicalPath) > val newDF = spark.read.text(dir.getCanonicalPath) > newDF.show() > > df.write.partitionBy("p1", "p2", "p3", "p4", "p5") >.mode(SaveMo

Does dataframe write append mode work with text format

2016-06-18 Thread Yash Sharma

Hi All, I have been using the parquet append mode for write which works just fine. Just wanted to check if the same is supported for plain text format. The below code blows up with error saying the file already exists. {code} userEventsDF.write.mode("append").partitionBy("year", "month", "date")

Re: Quick question on spark performance

2016-05-20 Thread Yash Sharma

Tb data of around 400 Megs gz files. The workload is a scan/filter/reduceBy which needs to scan the entire data. On Sat, May 21, 2016 at 11:07 AM, Yash Sharma wrote: > The median GC time is 1.3 mins for a median duration of 41 mins. What > parameters can I tune for controlling GC. &

Re: Quick question on spark performance

2016-05-20 Thread Yash Sharma

ynold Xin" wrote: > It's probably due to GC. > > On Fri, May 20, 2016 at 5:54 PM, Yash Sharma wrote: > >> Hi All, >> I am here to get some expert advice on a use case I am working on. >> >> Cluster & job details below - >> >> Data - 6

Quick question on spark performance

2016-05-20 Thread Yash Sharma

Hi All, I am here to get some expert advice on a use case I am working on. Cluster & job details below - Data - 6 Tb Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps) Parameters- --executor-memory 10G \ --executor-cores 6 \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.d

Re: Spark Sql on large number of files (~500Megs each) fails after couple of hours

2016-04-10 Thread Yash Sharma

log? Most of time, it > shows more details, we are using CDH, the log is at: > > > > [yucai@sr483 container_1457699919227_0094_01_14]$ pwd > > > /mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_14 > > [yucai@sr483 con

Spark Sql on large number of files (~500Megs each) fails after couple of hours

2016-04-10 Thread Yash Sharma

Hi All, I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs. I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gig

Re: Spark not able to fetch events from Amazon Kinesis

2016-02-22 Thread Yash Sharma

{ // Doesn't Work !! > rdd => > println(rdd.count) > println("rdd isempty:" + rdd.isEmpty) > }*/ unionStreams.foreachRDD ((rdd: RDD[Array[Byte]], time: Time) => { // Works, > Yeah !! > println(rdd.count) > println("rdd isempty:" + rdd.isEmpty) >

Re: Spark not able to fetch events from Amazon Kinesis

2016-01-30 Thread Yash Sharma

ompatibilities, either > due to protobuf or jackson. That may be your culprit. The problem is that > all failures by the Kinesis Client Lib is silent, therefore don't show up > on the logs. It's very hard to debug those buggers. > > Best, > Burak > > On Sat, Jan 30, 2016

Re: Spark not able to fetch events from Amazon Kinesis

2016-01-30 Thread Yash Sharma

: > w.r.t. protobuf-java version mismatch, I wonder if you can rebuild Spark > with the following change (using maven): > > http://pastebin.com/fVQAYWHM > > Cheers > > On Sat, Jan 30, 2016 at 12:49 AM, Yash Sharma wrote: > >> Hi All, >> I have a quick question if an

Spark not able to fetch events from Amazon Kinesis

2016-01-30 Thread Yash Sharma

Hi All, I have a quick question if anyone has experienced this here. I have been trying to get Spark read events from Kinesis recently but am having problem in receiving the events. While Spark is able to connect to Kinesis and is able to get metadata from Kinesis, Its not able to get events from

[Streaming][Kinesis][SPARK-20168] Could I get some reviews of the patch that resolves kinesis timestamp resume

Structured Streaming with S3 file source duplicates data because of eventual consistency

[kinesis][streaming] Could I request a review on this PR

[spark-kinesis] [SPARK-20168] Requesting some attention for a review

[Streaming] Requesting more Committers for Spark-Kinesis integration

[Spark][Kinesis] Could I get some committer review on the pull request

Spark reading parquet files behaved differently with number of paths

[DStream][Kinesis] Requesting review for spark-kinesis retries

Re: [Discuss][Spark staging dir] way to disable spark writing to _temporary

[Discuss][Spark staging dir] way to disable spark writing to _temporary

[Streaming][Kinesis] Please review the kinesis-spark hard codings pull request

Spark - Kinesis integration needs improvements

Re: subscribe to spark dev list

subscribe to spark dev list

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

Re: Does dataframe write append mode work with text format

Does dataframe write append mode work with text format

Re: Quick question on spark performance

Re: Quick question on spark performance

Quick question on spark performance

Re: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Spark Sql on large number of files (~500Megs each) fails after couple of hours

Re: Spark not able to fetch events from Amazon Kinesis

Re: Spark not able to fetch events from Amazon Kinesis

Re: Spark not able to fetch events from Amazon Kinesis

Spark not able to fetch events from Amazon Kinesis

32 matches

Site Navigation

Mail list logo

Footer information