Hi Team,
Could I get some review at the patch here. Would love to hear suggestions
here on the patch. I had to reopen SPARK-20168 because of this bug.
https://github.com/apache/spark/pull/21541
https://issues.apache.org/jira/browse/SPARK-20168
Cheers,
Yash
Hi Team,
I have been using Structured Streaming with the S3 data source but I am
seeing it duplicate the data intermittently. New run seem to fix it, but
the duplication happens ~10% of time. The ratio increases with more number
of files in the source. Investigating more, I see this is clearly an i
Hi All,
Could I request a review on this patch on Spark-Kinesis streaming. It has
been sitting there for few months looking for some love. Please help.
The patch proposes resuming Kinesis data from a specified timestamp,
similar to Kafka, and improves kinesis crash recovery avoiding scanning ok
un
Hi Team,
Could I please pull some attention towards the pull request on
Spark-Kinesis operability.
We have iterated over the patch for past few months, and it would be great
to have some final review of the patch. I think its very close now. I would
love to work on improvements if any.
This patch
Hi Fellow Spark developers/ PMC Members,
I am a new member of the community and have started my tiny contributions
to Spark-Kinesis Integration. I am trying to fill in the gaps in making
spark operate with Kinesis as nicely as Kafka.
I am writing this mail to highlight an issue with the kinesis mo
Hi All,
I've been working on a pull request [1] to allow Spark read from a specific
timestamp from Kinesis. I have iterated the patch with the help of other
contributors and we think that its in a good state now.
This patch would save hours of crash recovery time for Spark while reading
off Kinesi
Hi Fellow Devs,
I have noticed the spark parquet reader behaves very differently in the two
scenarios over the same data set while:
1. passing a single path to parent path to data, vs
2. passing all the files individually to parquet(paths: String*)
The paths has about ~50K files. The first option
Hi Fellow Devs,
Please share your thoughts on the pull request that allows spark to have
more graceful retries with kinesis streaming.
The patch removes simple hard codings in the code and allows user to pass
the values in config. This will help users to cope up with kinesis
throttling errors and
27;re probably interested in the S3PartitionedOutputCommitter.
>
> rb
>
> On Thu, Apr 6, 2017 at 10:08 PM, Yash Sharma wrote:
>
> Hi All,
> This is another issue that I was facing with the spark - s3 operability
> and wanted to ask to the broader community if its faced by anyone else.
&
Hi All,
This is another issue that I was facing with the spark - s3 operability and
wanted to ask to the broader community if its faced by anyone else.
I have a rather simple aggregation query with a basic transformation. The
output however has lot of output partitions (20K partitions). The spark
Hi fellow Spark Devs,
If anyone here has some experience in spark kinesis streaming, would it be
possible to provide your thoughts on this pull request [1].
Some info:
The patch removes two important hard coded values for kinesis retries and
will make kinesis recovery from crashes more reliable.
Hello fellow spark devs, hope you are doing fabulous,
Dropping a brain dump here about the Spark kinesis integration. I am able
to get spark kinesis to work perfectly under ideal conditions, but see a
lot of open ends when things are not so ideal. I feel there are lot of open
ends and are specific
Sorry for the spam, used the wrong email address.
On Wed, 22 Mar 2017 at 12:01 Yash Sharma wrote:
> subscribe to spark dev list
>
subscribe to spark dev list
files you are trying to read? Number of
> executors are very high
> On 24 Sep 2016 10:28, "Yash Sharma" wrote:
>
>> Have been playing around with configs to crack this. Adding them here
>> where it would be helpful to others :)
>> Number of executors and timeout se
emory. This can be around 48 assuming 12 nodes x 4 cores each. You could
> start with processing a subset of your data and see if you are able to get
> a decent performance. Then gradually increase the maximum # of execs for
> dynamic allocation and process the remaining data.
>
>
&
:27 AM, Yash Sharma wrote:
> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and timeout seemed like the core issue.
>
> {code}
> --driver-memory 4G \
> --conf spark.dynamicAllocation.en
gt; be 12 executors for testing and let know the status.
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
>
>
> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma"
> wrote:
>
> Thanks Aditya, appreciate the help.
>>
>> I had the exact
ws it as 168510 which is on very high side. Try reducing your executors.
>
>
> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>
>> Hi All,
>> I have a spark job which runs over a huge bulk of data with Dynamic
>> allocation enabled.
>> The job takes
Hi All,
I have a spark job which runs over a huge bulk of data with Dynamic
allocation enabled.
The job takes some 15 minutes to start up and fails as soon as it starts*.
Is there anything I can check to debug this problem. There is not a lot of
information in logs for the exact cause but here is
Hi All,
While writing a partitioned data frame as partitioned text files I see that
Spark deletes all available partitions while writing few new partitions.
dataDF.write.partitionBy(“year”, “month”,
> “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)
Is this an expected behavior ?
uot;p2", "p3", "p4",
> "p5").text(dir.getCanonicalPath)
> val newDF = spark.read.text(dir.getCanonicalPath)
> newDF.show()
>
> df.write.partitionBy("p1", "p2", "p3", "p4", "p5")
>.mode(SaveMo
Hi All,
I have been using the parquet append mode for write which works just
fine. Just wanted to check if the same is supported for plain text format.
The below code blows up with error saying the file already exists.
{code}
userEventsDF.write.mode("append").partitionBy("year", "month",
"date")
Tb data of around 400 Megs gz files. The
workload is a scan/filter/reduceBy which needs to scan the entire data.
On Sat, May 21, 2016 at 11:07 AM, Yash Sharma wrote:
> The median GC time is 1.3 mins for a median duration of 41 mins. What
> parameters can I tune for controlling GC.
&
ynold Xin" wrote:
> It's probably due to GC.
>
> On Fri, May 20, 2016 at 5:54 PM, Yash Sharma wrote:
>
>> Hi All,
>> I am here to get some expert advice on a use case I am working on.
>>
>> Cluster & job details below -
>>
>> Data - 6
Hi All,
I am here to get some expert advice on a use case I am working on.
Cluster & job details below -
Data - 6 Tb
Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps)
Parameters-
--executor-memory 10G \
--executor-cores 6 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.d
log? Most of time, it
> shows more details, we are using CDH, the log is at:
>
>
>
> [yucai@sr483 container_1457699919227_0094_01_14]$ pwd
>
>
> /mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_14
>
> [yucai@sr483 con
Hi All,
I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K).
Each file is roughly 400-500 Megs.
I am issuing a fairly simple hive query on the dataset with just filters
(No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs
and processes about 80-100 Gig
{ // Doesn't Work !!
> rdd =>
> println(rdd.count)
> println("rdd isempty:" + rdd.isEmpty)
> }*/
unionStreams.foreachRDD ((rdd: RDD[Array[Byte]], time: Time) => { // Works,
> Yeah !!
> println(rdd.count)
> println("rdd isempty:" + rdd.isEmpty)
>
ompatibilities, either
> due to protobuf or jackson. That may be your culprit. The problem is that
> all failures by the Kinesis Client Lib is silent, therefore don't show up
> on the logs. It's very hard to debug those buggers.
>
> Best,
> Burak
>
> On Sat, Jan 30, 2016
:
> w.r.t. protobuf-java version mismatch, I wonder if you can rebuild Spark
> with the following change (using maven):
>
> http://pastebin.com/fVQAYWHM
>
> Cheers
>
> On Sat, Jan 30, 2016 at 12:49 AM, Yash Sharma wrote:
>
>> Hi All,
>> I have a quick question if an
Hi All,
I have a quick question if anyone has experienced this here.
I have been trying to get Spark read events from Kinesis recently but am
having problem in receiving the events. While Spark is able to connect to
Kinesis and is able to get metadata from Kinesis, Its not able to get
events from
32 matches
Mail list logo