Works like a charm. Thanks a lot!
-jan
On 22 Dec 2015, at 23:08, Michael Armbrust
mailto:mich...@databricks.com>> wrote:
You need to say .mode("append") if you want to append to existing data.
On Tue, Dec 22, 2015 at 6:48 AM, Yash Sharma
mailto:yash...@gmail.com>> wrote:
Well you are right.
You need to say .mode("append") if you want to append to existing data.
On Tue, Dec 22, 2015 at 6:48 AM, Yash Sharma wrote:
> Well you are right. Having a quick glance at the source[1] I see that the
> path creation does not consider the partitions.
>
> It tries to create the path before lookin
Well you are right. Having a quick glance at the source[1] I see that the
path creation does not consider the partitions.
It tries to create the path before looking for partitions columns.
Not sure what would be the best way to incorporate it. Probably you can
file a jira and experienced contrib
In my example directories were distinct.
So If I would like to have to distinct directories ex.
/tmp/data/year=2012
/tmp/data/year=2013
It does not work with
val df = Seq((2012, "Batman")).toDF("year","title")
df.write.partitionBy("year").avro("/tmp/data")
val df2 = Seq((2013, "Batman")).toDF
Well this will indeed hit the error if the next run has similar year and
months and writing would not be possible.
You can try working around by introducing a runCount in partition or in the
output path.
Something like-
/tmp/data/year/month/01
/tmp/data/year/month/02
Or,
/tmp/data/01/year/month
Hi Yash,
the error is caused by the fact that first run creates the base directory ie.
"/tmp/data" and the second batch stumbles to the existing base directory. I
understand that the existing base directory is a challenge but I do not
understand how to make this work with streaming example wher
Hi Jan,
Is the error because a past run of the job has already written to the
location?
In that case you can add more granularity with 'time' along with year and
month. That should give you a distinct path for every run.
Let us know if it helps or if i missed anything.
Goodluck
- Thanks, via mo
Hi,
I'm stuck with writing partitioned data to hdfs. Example below ends up with
'already exists' -error.
I'm wondering how to handle streaming use case.
What is the intended way to write streaming data to hdfs? What am I missing?
cheers,
-jan
import com.databricks.spark.avro._
import org.a