Hi Enrico, Using Spark version 3.1.3 and turning AQE off seems to fix the
sorting. Looking into why, do you have thoughts?
Thanks, Swetha
On Sat, Sep 17, 2022 at 1:58 PM Enrico Minack
wrote:
> Hi,
>
> from a quick glance over your transformations, sortCol should be sorted.
>
>
7960|
|1207876|581|4990757154529|
4990796737202|
| 10|581|4990806212169|
4990751997961|
|1207876|581| 4990803020856|
4990796737203|
+---+---+-+
Hi!
We expected the order of sorted partitions to be preserved after a
dataframe write. We use the following code to write out one file per
partition, with the rows sorted by a column.
*df.repartition($"col1").sortWithinPartitions("col1", "col2")
.write.partitionBy("col1").cs
type=U then i have to store all rows data into
a separate table called Table3.
Can anyone help me how to read row by row and split the columns and apply
the condition based on indicator type and store columns data into
respective tables.
Thanks,
Swetha
Glad to help!
On Sat, Jul 13, 2019 at 12:17 PM Gourav Sengupta
wrote:
> Hi Swetha,
> I always look into the source code a lot but it never occured to me to
> look into the test suite, thank a ton for the tip. Does definitely give
> quite a few ideas - thanks a ton.
>
>
.
https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
<https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala>
Regards
Swetha
> On Jul 11,
If you are using Spark 2.4.0, I think you can try something like this:
.option("quote", "\u")
.option("emptyValue", “”)
.option("nullValue", null)
Regards
Swetha
> On Jul 11, 2019, at 1:45 PM, Anil Kulkarni wrote:
>
> Hi Spark users,
>
Thanks TD, but the sql plan does not seem to provide any information on
which stage is taking longer time or to identify any bottlenecks about
various stages. Spark kafka Direct used to provide information about
various stages in a micro batch and the time taken by each stage. Is there
a way to fin
-> "test1"
)
val hubbleStream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val kafkaStreamRdd = kafkaStream.transform { rdd =>
rdd.map(c
There is no difference in performance even with Cache being enabled.
On Mon, Aug 28, 2017 at 11:27 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> There is no difference in performance even with Cache being disabled.
>
> On Mon, Aug 28, 2017 at 7:43 AM, Cody Koe
nable enough. If preferred locations is
> behaving correctly you shouldn't need cached consumers for all 96
> partitions on any one executor, so that maxCapacity setting is
> probably unnecessary.
>
> On Fri, Aug 25, 2017 at 7:04 PM, swetha kasireddy
> wrote:
> > Becaus
er.poll.ms" -> Integer.*valueOf*(1024),
http://markmail.org/message/n4cdxwurlhf44q5x
https://issues.apache.org/jira/browse/SPARK-19185
Also, I have a batch of 60 seconds. What do you suggest the following to
be?
session.timeout.ms, heartbeat.interval.ms
On Fri, Aug 25, 2017 at 5:04 PM, sw
Because I saw some posts that say that consumer cache enabled will have
concurrentModification exception with reduceByKeyAndWIndow. I see those
errors as well after running for sometime with cache being enabled. So, I
had to disable it. Please see the tickets below. We have 96 partitions. So
if I
Hi Cody,
I think the Assign is used if we want it to start from a specified offset.
What if we want it to start it from the latest offset with something like
returned by "auto.offset.reset" -> "latest",.
Thanks!
On Mon, Aug 21, 2017 at 9:06 AM, Cody Koeninger wrote:
> Yes, you can start from
beyond 2
minutes when trying to recover from checkpoint. Any suggestions on this
would be of great help.
sparkConf.set("spark.streaming.minRememberDuration","120s")
sparkConf.set("spark.streaming.fileStream.minRememberDuration","120s")
Thanks,
l 13, 2017 at 1:01 PM, SRK wrote:
>
>> Hi,
>>
>> Do we need to specify checkpointing for mapWithState just like we do for
>> updateStateByKey?
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context: http:/
Yes, the Spark UI has some information but, it's not that helpful to find
out which particular stage is taking time.
On Wed, Jun 28, 2017 at 12:51 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
> You can find the information from the spark UI
>
> ---Original---
> *From:* "SRK"
> *Date:* 2017/6/28 02:36:37
at 8:43 AM, swetha kasireddy
wrote:
> I changed the datastructure to scala.collection.immutable.Set and I still
> see the same issue. My key is a String. I do the following in my reduce
> and invReduce.
>
> visitorSet1 ++visitorSet2.toTraversable
>
>
> visitorSet1 --v
OK. Can we use Spark Kafka Direct with Structured Streaming?
On Thu, Jun 8, 2017 at 4:46 PM, swetha kasireddy
wrote:
> OK. Can we use Spark Kafka Direct as part of Structured Streaming?
>
> On Thu, Jun 8, 2017 at 3:35 PM, Tathagata Das > wrote:
>
>> YES. At Databrick
tabricks.com/blog/2017/06/06/simple-super-fast-
> streaming-engine-apache-spark.html
>
> On Thu, Jun 8, 2017 at 3:03 PM, SRK wrote:
>
>> Hi,
>>
>> Is structured streaming ready for production usage in Spark 2.2?
>>
>> Thanks,
>> Swetha
>>
>>
:
> Yes, and in general any mutable data structure. You have to immutable data
> structures whose hashcode and equals is consistent enough for being put in
> a set.
>
> On Jun 6, 2017 4:50 PM, "swetha kasireddy"
> wrote:
>
>> Are you suggesting against the
Are you suggesting against the usage of HashSet?
On Tue, Jun 6, 2017 at 3:36 PM, Tathagata Das
wrote:
> This may be because of HashSet is a mutable data structure, and it seems
> you are actually mutating it in "set1 ++set2". I suggest creating a new
> HashMap in the function (and add both maps
Even the hive configurations like the following would work with this?
sqlContext.setConf("hive.default.fileformat", "Orc")
sqlContext.setConf("hive.exec.orc.memory.pool", "1.0")
sqlContext.setConf("hive.optimize.sort.dynamic.partition", "true")
sqlContext.setConf("hive.exec.reducer
ing consultant
> www.mapflat.com
> +46 70 7687109
> Calendar: https://goo.gl/6FBtlS
>
>
>
> On Thu, Jun 30, 2016 at 8:19 PM, SRK wrote:
> > Hi,
> >
> > I need to do integration tests using Spark Streaming. My idea is to spin
> up
> > kafka using doc
; I need to do integration tests using Spark Streaming. My idea is to spin
> up
> > kafka using docker locally and use it to feed the stream to my Streaming
> > Job. Any suggestions on how to do this would be of great help.
> >
> > Thanks,
> > Swetha
> >
sampleMap is populated from inside a method that is getting called from
updateStateByKey
On Thu, Jun 23, 2016 at 1:13 PM, Ted Yu wrote:
> Can you illustrate how sampleMap is populated ?
>
> Thanks
>
> On Thu, Jun 23, 2016 at 12:34 PM, SRK wrote:
>
>> Hi,
>>
>> I keep getting the following error
Hi Mich,
No I have not tried that. My requirement is to insert that from an hourly
Spark Batch job. How is it different by trying to insert with Hive CLI or
beeline?
Thanks,
Swetha
On Tue, Jun 14, 2016 at 10:44 AM, Mich Talebzadeh wrote:
> Hi Swetha,
>
> Have you actually tried d
Hi Bijay,
This approach might not work for me as I have to do partial
inserts/overwrites in a given table and data_frame.write.partitionBy will
overwrite the entire table.
Thanks,
Swetha
On Mon, Jun 13, 2016 at 9:25 PM, Bijay Pathak
wrote:
> Hi Swetha,
>
> One option is to use Hive
rId, ps.userRecord,
ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
""".stripMargin)
On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> Hi Bijay,
>
> If I am hitting this issue,
> https://issues.ap
Hi Bijay,
If I am hitting this issue,
https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
Incrementing to higher version of hive is the only solution?
Thanks!
On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> Hi,
>
>
alebzadeh
>>
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>&g
No, I am reading the data from hdfs, transforming it , registering the data
in a temp table using registerTempTable and then doing insert overwrite
using Spark SQl' hiveContext.
On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh
wrote:
> how are you doing the insert? from an existing table?
>
> Dr
400 cores are assigned to this job.
On Thu, Jun 9, 2016 at 1:16 PM, Stephen Boesch wrote:
> How many workers (/cpu cores) are assigned to this job?
>
> 2016-06-09 13:01 GMT-07:00 SRK :
>
>> Hi,
>>
>> How to insert data into 2000 partitions(directories) of ORC/parquet at a
>> time using Spark SQ
ew?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 20:38, swetha kasireddy
> wrote:
>
>> Around 14000 partitions need to be loaded every hour. Yes, I tested this
>> and its taking a lot
/www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 20:24, swetha kasireddy
> wrote:
>
>
The data is not very big. Say 1MB-10 MB at the max per partition. What is
the best way to insert this 14k partitions with decent performance?
On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh wrote:
> the acid question is how many rows are you going to insert in a batch
> session? btw if this is
kedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 19:43, swetha kasireddy
> wrote:
>
>>
gt;*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 19:11, swetha kasireddy
> wrote:
>
>> I am looking at ORC. I insert the data using the following query.
>>
>> sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records
I am looking at ORC. I insert the data using the following query.
sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING,
record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
stored as ORC LOCATION '/user/users' ")
sqlContext.sql(" orc.compress= SNAPPY"
Also, the Spark SQL insert seems to take only two tasks per stage. That
might be the reason why it does not have sufficient memory. Is there a way
to increase the number of tasks when doing the sql insert?
Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle
ReadShuffle Wri
Hi Lars,
Do you have any examples for the methods that you described for Spark batch
and Streaming?
Thanks!
On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson wrote:
> Thanks!
>
> It is on my backlog to write a couple of blog posts on the topic, and
> eventually some example code, but I am curre
t; is why I'm saying if you're losing leaders, you should look at Kafka.
>
> On Fri, Apr 29, 2016 at 11:21 AM, swetha kasireddy
> wrote:
> > OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a
> > default for rebalancing and they say that ref
refresh.leader.backoff.ms and then retry
again depending on the number of retries?
On Fri, Apr 29, 2016 at 8:14 AM, swetha kasireddy wrote:
> OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a
> default for rebalancing and they say that refresh.leader.backoff.ms of
> 200 t
OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a
default for rebalancing and they say that refresh.leader.backoff.ms of 200
to refresh leader is very aggressive and suggested us to increase it to
2000. Even after increasing to 2500 I still get Leader Lost Errors.
Is refresh.
OK. I did take a look at them. So once I have an accumulater for a HashSet,
how can I check if a particular key is already present in the HashSet
accumulator? I don't see any .contains method there. My requirement is that
I need to keep accumulating the keys in the HashSet across all the tasks in
v
Thanks. I tried this yesterday and it seems to be working.
On Wed, Mar 2, 2016 at 1:49 AM, James Hammerton wrote:
> Hi,
>
> Based on the behaviour I've seen using parquet, the number of partitions
> in the DataFrame will determine the number of files in each parquet
> partition.
>
> I.e. when yo
It seems to be failing when I do something like following in both
sqlContext and hiveContext
sqlContext.sql("SELECT ssd.savedDate from saveSessionDatesRecs ssd
where ssd.partitioner in (SELECT sr1.partitioner from
sparkSessionRecords1 sr1))")
On Tue, Feb 23, 2016 at 5:57 PM, swetha
These tables are stored in hdfs as parquet. Can sqlContext be applied for
the subQueries?
On Tue, Feb 23, 2016 at 5:31 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:
> Assuming these are all in Hive, you can either use spark-sql or
> spark-shell.
>
> HiveContext has r
of
a number of small files and also to be able to scan faster.
Something like ...df.write.format("parquet").partitionBy( "userIdHash"
, "userId").mode(SaveMode.Append).save("userRecords");
On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy wrote:
> So
Can you describe what you are trying to accomplish? What would the custom
> partitioner be?
>
> On Tue, Feb 16, 2016 at 1:21 PM, SRK wrote:
>
>> Hi,
>>
>> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>>
>>
>> Thanks,
>&
How to use a customPartttioner hashed by userId inside saveAsTable using a
dataframe?
On Mon, Feb 15, 2016 at 11:24 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> How about saving the dataframe as a table partitioned by userId? My User
> records have userId, number of ses
AS TotalSales
>
> FROM t_s, t_t, t_c
>
> WHERE t_s.TIME_ID = t_t.TIME_ID
>
> AND t_s.CHANNEL_ID = t_c.CHANNEL_ID
>
> GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ) rs
>
> LIMIT 10
>
>
>
&
OK. would it only query for the records that I want in hive as per filter
or just load the entire table? My user table will have millions of records
and I do not want to cause OOM errors by loading the entire table in memory.
On Mon, Feb 15, 2016 at 12:51 AM, Mich Talebzadeh
wrote:
> Also worthw
Hi,
I want to edit/delete a message posted in Spark User List. How do I do that?
Thanks!
r in the qa job restarted the job
> automatically and the application UI was up. But, in the prod job, the
> driver did not restart the application. Any idea as to why the prod driver
> not able to restart the job with everything being same in qa/prod including
> the --supervise opt
OK. What should the table be? Suppose I have a bunch of parquet files, do I
just specify the directory as the table?
On Fri, Jan 1, 2016 at 11:32 PM, UMESH CHAUDHARY
wrote:
> Ok, so whats wrong in using :
>
> var df=HiveContext.sql("Select * from table where id = ")
> //filtered data frame
> df.
rtition/foreachPartition in a stage. Any idea as to why this is
>> happening?
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-batc
Hi,
How to verify whether the GangliaSink directory got created?
Thanks,
Swetha
On Mon, Dec 15, 2014 at 11:29 AM, danilopds wrote:
> Thanks tsingfu,
>
> I used this configuration based in your post: (with ganglia unicast mode)
> # Enable GangliaSink for all instances
> *.sin
?
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-build-Spark-with-Ganglia-to-enable-monitoring-using-Ganglia-tp25625.html
> Sent from the Apache Spark Us
Any documentation/sample code on how to use Ganglia with Spark?
On Sat, Dec 5, 2015 at 10:29 PM, manasdebashiskar
wrote:
> spark has capability to report to ganglia, graphite or jmx.
> If none of that works for you you can register your own spark extra
> listener
> that does your bidding.
>
> ..
2015 at 3:40 PM, Cody Koeninger wrote:
> KafkaRDD.scala , handleFetchErr
>
> On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>> How to look at Option 2(see the following)? Which portion of the code i
Following is the Option 2 that I was talking about:
2.Catch that exception and somehow force things to "reset" for that
partition And how would it handle the offsets already calculated in the
backlog (if there is one)?
On Tue, Dec 1, 2015 at 1:39 PM, swetha kasireddy
wrote:
> Hi
Hi Cody,
How to look at Option 2(see the following)? Which portion of the code in
Spark Kafka Direct to look at to handle this issue specific to our
requirements.
2.Catch that exception and somehow force things to "reset" for that
partition And how would it handle the offsets already calculated
t;
> The KafkaRDD will use the value of refresh.leader.backoff.ms, so you can
> try adjusting that to get a longer sleep before retrying the task.
>
> On Mon, Nov 30, 2015 at 1:50 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>>
heckpoint folder to help the job
>> recover? E.g. Go 2 steps back, hoping that kafka has those offsets.
>>
>> -adrian
>>
>> From: swetha kasireddy
>> Date: Monday, November 9, 2015 at 10:40 PM
>> To: Cody Koeninger
>> Cc: "user@spark.apache.org&qu
>
> Also in the application, is the main thread waiting on
> streamingContext.awaitTermination()? That is designed to catch exceptions
> in running job and throw it in the main thread, so that the java program
> exits with an exception and non-zero exit code.
>
>
>
>
gt;> automatically. This is in a cluster mode. Any suggestion on how to make
>> Automatic Driver Restart work would be of great help.
>>
>> --supervise
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in c
's failing for all on the same leader?
> Have there been any leader rebalances?
> Do you have enough log retention?
> If you log the offset for each message as it's processed, when do you see
> the problem?
>
> On Tue, Nov 24, 2015 at 10:28 AM, swetha kasireddy <
>
:31 AM, Cody Koeninger wrote:
> No, the direct stream only communicates with Kafka brokers, not Zookeeper
> directly. It asks the leader for each topicpartition what the highest
> available offsets are, using the Kafka offset api.
>
> On Mon, Nov 23, 2015 at 11:36 PM, s
eported the ending offset was 221572238, but during processing, kafka
> stopped returning messages before reaching that ending offset.
>
> That probably means something got screwed up with Kafka - e.g. you lost a
> leader and lost messages in the process.
>
> On Mon, Nov 23, 2015 at
intln(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
...
}
On Mon, Nov 23, 2015 at 6:31 PM, swetha kasireddy wrote:
> Also, does Kafka direct query the offsets from the zookeeper directly?
> From where does it get the offsets? There is data in those offsets,
Also, does Kafka direct query the offsets from the zookeeper directly? From
where does it get the offsets? There is data in those offsets, but somehow
Kafka Direct does not seem to pick it up?
On Mon, Nov 23, 2015 at 6:18 PM, swetha kasireddy wrote:
> I mean to show the Spark Kafka Dir
mer reporting?
>
> I'd log the offsets in your spark job and try running
>
> kafka-simple-consumer-shell.sh --partition $yourbadpartition
> --print-offsets
>
> at the same time your spark job is running
>
> On Mon, Nov 23, 2015 at 7:37 PM, swetha wrote:
>
>>
Hi,
We see a bunch of issues like the following in Our Spark Kafka Direct. Any
idea as to how make Kafka Direct Consumers show up in Kafka Consumer
reporting to debug this issue?
Job aborted due to stage failure: Task 47 in stage 336.0 failed 4 times,
most recent failure: Lost task 47.3 in stag
):
java.lang.AssertionError: assertion failed: Ran out of messages before
reaching ending offset 221572238 for topic hubble_stream partition 88 start
221563725. This should not happen, and indicates that messages may have been
lost
Thanks,
Swetha
--
View this message in context:
http://apache
at the end.
>
> Cheers
>
> On Wed, Nov 18, 2015 at 7:28 PM, swetha wrote:
>
>> Hi,
>>
>> We have a lot of temp files that gets created due to shuffles caused by
>> group by. How to clear the files that gets created due to intermediate
>> operations in group by
Hi,
Has anybody used FastUtil equivalent to HashSet for Strings in Spark? Any
example would be of great help.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/FastUtil-DataStructures-in-Spark-tp25429.html
Sent from the Apache Spark User List
That was actually an issue with our Mesos.
On Wed, Nov 18, 2015 at 5:29 PM, Tathagata Das wrote:
> If possible, could you give us the root cause and solution for future
> readers of this thread.
>
> On Wed, Nov 18, 2015 at 6:37 AM, swetha kasireddy <
> swethakasire...@gmail.c
Hi,
We have a lot of temp files that gets created due to shuffles caused by
group by. How to clear the files that gets created due to intermediate
operations in group by?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-clear-the
It works fine after some changes.
-Thanks,
Swetha
On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das wrote:
> Can you verify that the cluster is running the correct version of Spark.
> 1.5.2.
>
> On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy <
> swethakasire...@gmail.com&
Looks like I can use mapPartitions but can it be done using
forEachPartition?
On Tue, Nov 17, 2015 at 11:51 PM, swetha wrote:
> Hi,
>
> How to return an RDD of key/value pairs from an RDD that has
> foreachPartition applied. I have my code something like the following. It
> lo
Hi,
How to return an RDD of key/value pairs from an RDD that has
foreachPartition applied. I have my code something like the following. It
looks like an RDD that has foreachPartition can have only the return type as
Unit. How do I apply foreachPartition and do a save and at the same return a
pair
at 7:19 PM, swetha kasireddy wrote:
> Hi TD,
>
> Basically, I see two issues. With provided the job does
> not start localy. It does start in Cluster but seems no data is getting
> processed.
>
> Thanks,
> Swetha
>
> On Tue, Nov 17, 2015 at 7:04 PM, Tim Barthram
&g
Hi TD,
Basically, I see two issues. With provided the job does not
start localy. It does start in Cluster but seems no data is getting
processed.
Thanks,
Swetha
On Tue, Nov 17, 2015 at 7:04 PM, Tim Barthram
wrote:
> If you are running a local context, could it be that you should
This error I see locally.
On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das wrote:
> Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster?
>
> On Tue, Nov 17, 2015 at 5:34 PM, swetha wrote:
>
>>
>>
>> Hi,
>>
>> I see java.lang.NoClassDefF
Hi,
I see java.lang.NoClassDefFoundError after changing the Streaming job
version to 1.5.2. Any idea as to why this is happening? Following are my
dependencies and the error that I get.
org.apache.spark
spark-core_2.10
${sparkVersion}
provided
Hi ,
What is the appropriate dependency to include for Spark Indexed RDD? I get
compilation error if I include 0.3 as the version as shown below:
amplab
spark-indexedrdd
0.3
Thanks,
Swetha
--
View this message in context:
http://apache
inger wrote:
> Without knowing more about what's being stored in your checkpoint
> directory / what the log output is, it's hard to say. But either way, just
> deleting the checkpoint directory probably isn't sufficient to restart the
> job...
>
> On Mon, No
ory is a good way to restart
> the streaming job, you should stop the spark context or at the very least
> kill the driver process, then restart.
>
> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>>
ay that if you've regularly got problems with kafka falling over
> for half an hour, I'd look at fixing that before worrying about spark
> monitoring...
>
>
> On Mon, Nov 9, 2015 at 12:26 PM, swetha wrote:
>
>> Hi,
>>
>> How to recover Kafka Direct
when the broker fails for sometime say 30 minutes? What kind of monitors
should be in place to recover the job?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up
I am using the following:
com.twitter
parquet-avro
1.6.0
On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu
wrote:
> Which Spark version used?
>
> It was fixed in Parquet-1.7x, so Spark-1.5.x will be work.
>
>
>
>
> > On Nov 9, 2015, at 3:43 PM, swetha w
Hi,
I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
Please find below the code and the Warning message. Any idea as to how to
avoid the unwanted Warning message?
activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
classOf[ActiveSession],
classOf[
Hi,
I see a lot of unwanted SysOuts when I try to save an RDD as parquet file.
Following is the code and
SysOuts. Any idea as to how to avoid the unwanted SysOuts?
ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, ActiveSess
I think they are roughly of equal size.
On Fri, Nov 6, 2015 at 3:45 PM, Ted Yu wrote:
> Can you tell us a bit more about your use case ?
>
> Are the two RDDs expected to be of roughly equal size or, to be of vastly
> different sizes ?
>
> Thanks
>
> On Fri, Nov 6, 2015 a
Hi,
What is the efficient way to join two RDDs? Would converting both the RDDs
to IndexedRDDs be of any help?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-efficient-way-to-Join-two-RDDs-tp25310.html
Sent from the Apache Spark
IndexedRDD for a certain set of data
and then get those keys that are present in the IndexedRDD but not present
in some other RDD.
How would an IndexedRDD support such an usecase in an efficient manner?
Thanks,
Swetha
On Mon, Nov 2, 2015 at 9:56 PM, Deenar Toraskar
wrote:
> Swetha
>
>
IndexedRDD for a certain set of data
and then get those keys that are present in the IndexedRDD but not present
in some other RDD.
How would an IndexedRDD support such an usecase in an efficient manner?
Thanks,
Swetha
On Wed, Jul 15, 2015 at 2:46 AM, Jem Tucker wrote:
> This is v
I read about the IndexedRDD. Is the IndexedRDD join with another RDD that
is not an IndexedRDD efficient?
On Mon, Nov 2, 2015 at 9:56 PM, Deenar Toraskar
wrote:
> Swetha
>
> Currently IndexedRDD is an external library and not part of Spark Core.
> You can use it by adding a depende
t[T]])
sc.newAPIHadoopFile(
parquetFile,
classOf[ParquetInputFormat[T]],
classOf[Void],
tag.runtimeClass.asInstanceOf[Class[T]],
jobConf)
.map(_._2.asInstanceOf[T])
}
On Thu, Nov 5, 2015 at 2:14 PM, swetha kasireddy
wrote:
> No scala. Suppose I read the Parquet file as sh
1 - 100 of 144 matches
Mail list logo