it can also be done with repartition + sortWithinPartitions + mapPartitions.
perhaps not as convenient but it does not rely on undocumented behavior.
i used this approach in spark-sorted. see here:
https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/Group
I agreed that to make sure this work, you might need to know the Spark
internal implementation for APIs such as `groupBy`.
But without any more changes to current Spark implementation, I think this
is the one possible way to achieve the required function to aggregate on
sorted data per key.
Hello all
I've tried to add some task metrics in
org.apache.spark.executor.ShuffleReadMetrics.scala in Spark 2.0.2, following
the format of other existing metrics, but when submitting applications, I
got these errors:
ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry
it.
j
i think this works but it relies on groupBy and agg respecting the sorting.
the api provides no such guarantee, so this could break in future versions.
i would not rely on this i think...
On Dec 20, 2016 18:58, "Liang-Chi Hsieh" wrote:
Hi,
Can you try the combination of `repartition` + `sortWit
Your sample codes first select distinct zipcodes, and then save the rows of
each distinct zipcode into a parquet file.
So I think you can simply partition your data by using
`DataFrameWriter.partitionBy` API, e.g.,
df.repartition("zip_code").write.partitionBy("zip_code").parquet(.)
-
If you run the main driver and other Spark jobs in client mode, you can make
sure they (I meant all the drivers) are running at the same node. Of course
all drivers now consume the resources at the same node.
If you run the main driver in client mode, but run other Spark jobs in
cluster mode, the
I am not familiar of any problem with that.
Anyway, If you run spark applicaction you would have multiple jobs, which makes
sense that it is not a problem.
Thanks David.
From: Naveen [mailto:hadoopst...@gmail.com]
Sent: Wednesday, December 21, 2016 9:18 AM
To: dev@spark.apache.org; u...@spark.ap
Is there any reason you need a context on the application launching the
jobs?
You can use SparkLauncher in a normal app and just listen for state
transitions
On Wed, 21 Dec 2016, 11:44 Naveen, wrote:
> Hi Team,
>
> Thanks for your responses.
> Let me give more details in a picture of how I am tr
> On Dec 21, 2016, at 5:15 AM, Al Pivonka wrote:
>
>
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Hi,
val df = spark.read.parquet()
df.registerTempTable("df")
val zip = df.select("zip_code").distinct().as[String].rdd
def comp(zipcode:String):Unit={
val zipval = "SELECT * FROM df WHERE
zip_code='$zipvalrepl'".replace("$zipvalrepl",
zipcode)
val data = spark.sql(zipval)
data.write.parquet(
Incremental load traditionally means generating hfiles and
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
data into hbase.
For your use case, the producer needs to find rows where the flag is 0 or 1.
After such rows are obtained, it is up to you how the result of process
Ok, Sure will ask.
But what would be generic best practice solution for Incremental load from
HBASE.
On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote:
> I haven't used Gobblin.
> You can consider asking Gobblin mailing list of the first option.
>
> The second option would work.
>
>
> On Wed, Dec 2
I haven't used Gobblin.
You can consider asking Gobblin mailing list of the first option.
The second option would work.
On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri
wrote:
> Hello Guys,
>
> I would like to understand different approach for Distributed Incremental
> load from HBase, Is there
Thanks Liang!
I get your point. It would mean that when launching spark jobs, mode needs
to be specified as client for all spark jobs.
However, my concern is to know if driver's memory(which is launching spark
jobs) will be used completely by the Future's(sparkcontext's) or these
spawned sparkconte
Hi Sebastian,
Yes, for fetching the details from Hive and HBase, I would want to use
Spark's HiveContext etc.
However, based on your point, I might have to check if JDBC based driver
connection could be used to do the same.
Main reason for this is to avoid a client-server architecture design.
If
OK.
I think it is little unusual use pattern, but it should work.
As I said before, if you want those Spark applications to share cluster
resources, proper configs is needed for Spark.
If you submit the main driver and all other Spark applications in client
mode under yarn, you should make sure
Hi Team,
Thanks for your responses.
Let me give more details in a picture of how I am trying to launch jobs.
Main spark job will launch other spark-job similar to calling multiple
spark-submit within a Spark driver program.
These spawned threads for new jobs will be totally different components,
Hello Guys,
I would like to understand different approach for Distributed Incremental
load from HBase, Is there any *tool / incubactor tool* which satisfy
requirement ?
*Approach 1:*
Write Kafka Producer and maintain manually column flag for events and
ingest it with Linkedin Gobblin to HDFS / S
Hi,
As you launch multiple Spark jobs through `SparkLauncher`, I think it
actually works like you run multiple Spark applications with `spark-submit`.
By default each application will try to use all available nodes. If your
purpose is to share cluster resources across those Spark jobs/application
21 matches
Mail list logo