The avro-mapred module includes a Seekable implementation that works with HDFS called FsInput:
http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/FsInput.html With this, your example can be made considerably smaller. Doug On Thu, Feb 7, 2013 at 8:28 AM, Harsh J <ha...@cloudera.com> wrote: > I assume by non-trivial you meant the extra Seekable stuff I needed to > wrap around the DFS output streams to let Avro take it as append-able? > I don't think its possible for Avro to carry it since Avro (core) does > not reverse-depend on Hadoop. Should we document it somewhere though? > Do you have any ideas on the best place to do that? > > On Thu, Feb 7, 2013 at 6:12 AM, Michael Malak <michaelma...@yahoo.com> wrote: >> Thanks so much for the code -- it works great! >> >> Since it is a non-trivial amount of code required to achieve append, I >> suggest attaching that code to AVRO-1035, in the hopes that someone will >> come up with an interface that requires just one line of user code to >> achieve append. >> >> --- On Wed, 2/6/13, Harsh J <ha...@cloudera.com> wrote: >> >>> From: Harsh J <ha...@cloudera.com> >>> Subject: Re: Is it possible to append to an already existing avro file >>> To: user@avro.apache.org >>> Date: Wednesday, February 6, 2013, 11:17 AM >>> Hey Michael, >>> >>> It does implement the regular Java OutputStream interface, >>> as seen in >>> the API: >>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataOutputStream.html. >>> >>> Here's a sample program that works on Hadoop 2.x in my >>> tests: >>> https://gist.github.com/QwertyManiac/4724582 >>> >>> On Wed, Feb 6, 2013 at 9:00 AM, Michael Malak <michaelma...@yahoo.com> >>> wrote: >>> > I don't believe a Hadoop FileSystem is a Java >>> OutputStream? >>> > >>> > --- On Tue, 2/5/13, Doug Cutting <cutt...@apache.org> >>> wrote: >>> > >>> >> From: Doug Cutting <cutt...@apache.org> >>> >> Subject: Re: Is it possible to append to an already >>> existing avro file >>> >> To: user@avro.apache.org >>> >> Date: Tuesday, February 5, 2013, 5:27 PM >>> >> It will work on an OutputStream that >>> >> supports append. >>> >> >>> >> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(org.apache.avro.file.SeekableInput, >>> >> java.io.OutputStream) >>> >> >>> >> So it depends on how well HDFS implements >>> >> FileSystem#append(), not on >>> >> any changes in Avro. >>> >> >>> >> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path) >>> >> >>> >> I have no recent personal experience with append >>> in >>> >> HDFS. Does anyone >>> >> else here? >>> >> >>> >> Doug >>> >> >>> >> On Tue, Feb 5, 2013 at 4:10 PM, Michael Malak >>> <michaelma...@yahoo.com> >>> >> wrote: >>> >> > My understanding is that will append to a file >>> on the >>> >> local filesystem, but not to a file on HDFS. >>> >> > >>> >> > --- On Tue, 2/5/13, Doug Cutting <cutt...@apache.org> >>> >> wrote: >>> >> > >>> >> >> From: Doug Cutting <cutt...@apache.org> >>> >> >> Subject: Re: Is it possible to append to >>> an already >>> >> existing avro file >>> >> >> To: user@avro.apache.org >>> >> >> Date: Tuesday, February 5, 2013, 5:08 PM >>> >> >> The Jira is: >>> >> >> >>> >> >> https://issues.apache.org/jira/browse/AVRO-1035 >>> >> >> >>> >> >> It is possible to append to an existing >>> Avro file: >>> >> >> >>> >> >> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) >>> >> >> >>> >> >> Should we close that issue as "fixed"? >>> >> >> >>> >> >> Doug >>> >> >> >>> >> >> On Fri, Feb 1, 2013 at 11:32 AM, Michael >>> Malak >>> >> <michaelma...@yahoo.com> >>> >> >> wrote: >>> >> >> > Was a JIRA ticket ever created >>> regarding >>> >> appending to >>> >> >> an existing Avro file on HDFS? >>> >> >> > >>> >> >> > What is the status of such a >>> capability, a >>> >> year out >>> >> >> from when the issue below was raised? >>> >> >> > >>> >> >> > On Wed, 22 Feb 2012 10:57:48 +0100, >>> >> "Vyacheslav >>> >> >> Zholudev" <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> > >>> >> >> >> Thanks for your reply, I >>> suspected this. >>> >> >> >> >>> >> >> >> I will create a JIRA ticket. >>> >> >> >> >>> >> >> >> Vyacheslav >>> >> >> >> >>> >> >> >> On Feb 21, 2012, at 6:02 PM, >>> Scott Carey >>> >> wrote: >>> >> >> >> >>> >> >> >>> >>> >> >> >>> On 2/21/12 7:29 AM, >>> "Vyacheslav >>> >> Zholudev" >>> >> >> <vyacheslav.zholu...@gmail.com> >>> >> >> >>> wrote: >>> >> >> >>> >>> >> >> >>>> Yep, I saw that method as >>> well as >>> >> the >>> >> >> stackoverflow post. However, I'm >>> >> >> >>>> interested how to append >>> to a file >>> >> on the >>> >> >> arbitrary file system, not >>> >> >> >>>> only on the local one. >>> >> >> >>>> >>> >> >> >>>> I want to get an >>> OutputStream >>> >> based on the >>> >> >> Path and the FileSystem >>> >> >> >>>> implementation and then >>> pass it >>> >> for >>> >> >> appending to avro methods. >>> >> >> >>>> >>> >> >> >>>> Is that possible? >>> >> >> >>> >>> >> >> >>> It is not possible without >>> modifying >>> >> >> DataFileWriter. Please open a JIRA >>> >> >> >>> ticket. >>> >> >> >>> >>> >> >> >>> It could not simply append to >>> an >>> >> OutputStream, >>> >> >> since it must either: >>> >> >> >>> * Seek to the start to >>> validate the >>> >> schemas >>> >> >> match and find the sync >>> >> >> >>> marker, or >>> >> >> >>> * Trust that the schemas >>> match and >>> >> find the >>> >> >> sync marker from the last >>> >> >> >>> block >>> >> >> >>> >>> >> >> >>> DataFileWriter cannot refer >>> to Hadoop >>> >> classes >>> >> >> such as FileSystem, but we >>> >> >> >>> could add something to the >>> mapred >>> >> module that >>> >> >> takes a Path and >>> >> >> >>> FileSystem and returns >>> something that >>> >> >> implemements an interface that >>> >> >> >>> DataFileWriter can append >>> to. >>> >> This would >>> >> >> be something that is both a >>> >> >> >>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html >>> >> >> >>> and an OutputStream, or has >>> both an >>> >> InputStream >>> >> >> from the start of the >>> >> >> >>> existing file and an >>> OutputStream at >>> >> the end. >>> >> >> >>> >>> >> >> >>>> Thanks, >>> >> >> >>>> Vyacheslav >>> >> >> >>>> >>> >> >> >>>> On Feb 21, 2012, at 5:29 >>> AM, Harsh >>> >> J >>> >> >> wrote: >>> >> >> >>>> >>> >> >> >>>>> Hi, >>> >> >> >>>>> >>> >> >> >>>>> Use the appendTo >>> feature of >>> >> the >>> >> >> DataFileWriter. See >>> >> >> >>>>> >>> >> >> >>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) >>> >> >> >>>>> >>> >> >> >>>>> For a quick setup >>> example, >>> >> read also: >>> >> >> >>>>> >>> >> >> >>>>> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file >>> >> >> >>>>> >>> >> >> >>>>> On Tue, Feb 21, 2012 >>> at 3:15 >>> >> AM, >>> >> >> Vyacheslav Zholudev >>> >> >> >>>>> <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> >>>>>> Hi, >>> >> >> >>>>>> >>> >> >> >>>>>> is it possible to >>> append >>> >> to an >>> >> >> already existing avro file when it was >>> >> >> >>>>>> written and >>> closed >>> >> before? >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> >> >> >>>>>> outputStream = >>> >> >> fs.append(avroFilePath); >>> >> >> >>>>>> >>> >> >> >>>>>> then later on I >>> get: >>> >> >> java.io.IOException: Invalid sync! >>> >> >> >>>>>> >>> >> >> >>>>>> Probably because >>> the >>> >> schema is >>> >> >> written twice and some other issues. >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> outputStream = >>> >> >> fs.create(avroFilePath); then the avro >>> file >>> >> >> >>>>>> gets >>> >> >> >>>>>> overwritten. >>> >> >> >>>>>> >>> >> >> >>>>>> Thanks, >>> >> >> >>>>>> Vyacheslav >>> >> >> >>>>> >>> >> >> >>>>> -- >>> >> >> >>>>> Harsh J >>> >> >> >>>>> Customer Ops. >>> Engineer >>> >> >> >>>>> Cloudera | http://tiny.cloudera.com/about >>> >> >> > >>> >> >> >>> >> >> On Fri, Feb 1, 2013 at 11:32 AM, Michael >>> Malak >>> >> <michaelma...@yahoo.com> >>> >> >> wrote: >>> >> >> > Was a JIRA ticket ever created >>> regarding >>> >> appending to >>> >> >> an existing Avro file on HDFS? >>> >> >> > >>> >> >> > What is the status of such a >>> capability, a >>> >> year out >>> >> >> from when the issue below was raised? >>> >> >> > >>> >> >> > On Wed, 22 Feb 2012 10:57:48 +0100, >>> >> "Vyacheslav >>> >> >> Zholudev" <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> > >>> >> >> >> Thanks for your reply, I >>> suspected this. >>> >> >> >> >>> >> >> >> I will create a JIRA ticket. >>> >> >> >> >>> >> >> >> Vyacheslav >>> >> >> >> >>> >> >> >> On Feb 21, 2012, at 6:02 PM, >>> Scott Carey >>> >> wrote: >>> >> >> >> >>> >> >> >>> >>> >> >> >>> On 2/21/12 7:29 AM, >>> "Vyacheslav >>> >> Zholudev" >>> >> >> <vyacheslav.zholu...@gmail.com> >>> >> >> >>> wrote: >>> >> >> >>> >>> >> >> >>>> Yep, I saw that method as >>> well as >>> >> the >>> >> >> stackoverflow post. However, I'm >>> >> >> >>>> interested how to append >>> to a file >>> >> on the >>> >> >> arbitrary file system, not >>> >> >> >>>> only on the local one. >>> >> >> >>>> >>> >> >> >>>> I want to get an >>> OutputStream >>> >> based on the >>> >> >> Path and the FileSystem >>> >> >> >>>> implementation and then >>> pass it >>> >> for >>> >> >> appending to avro methods. >>> >> >> >>>> >>> >> >> >>>> Is that possible? >>> >> >> >>> >>> >> >> >>> It is not possible without >>> modifying >>> >> >> DataFileWriter. Please open a JIRA >>> >> >> >>> ticket. >>> >> >> >>> >>> >> >> >>> It could not simply append to >>> an >>> >> OutputStream, >>> >> >> since it must either: >>> >> >> >>> * Seek to the start to >>> validate the >>> >> schemas >>> >> >> match and find the sync >>> >> >> >>> marker, or >>> >> >> >>> * Trust that the schemas >>> match and >>> >> find the >>> >> >> sync marker from the last >>> >> >> >>> block >>> >> >> >>> >>> >> >> >>> DataFileWriter cannot refer >>> to Hadoop >>> >> classes >>> >> >> such as FileSystem, but we >>> >> >> >>> could add something to the >>> mapred >>> >> module that >>> >> >> takes a Path and >>> >> >> >>> FileSystem and returns >>> something that >>> >> >> implemements an interface that >>> >> >> >>> DataFileWriter can append >>> to. >>> >> This would >>> >> >> be something that is both a >>> >> >> >>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html >>> >> >> >>> and an OutputStream, or has >>> both an >>> >> InputStream >>> >> >> from the start of the >>> >> >> >>> existing file and an >>> OutputStream at >>> >> the end. >>> >> >> >>> >>> >> >> >>>> Thanks, >>> >> >> >>>> Vyacheslav >>> >> >> >>>> >>> >> >> >>>> On Feb 21, 2012, at 5:29 >>> AM, Harsh >>> >> J >>> >> >> wrote: >>> >> >> >>>> >>> >> >> >>>>> Hi, >>> >> >> >>>>> >>> >> >> >>>>> Use the appendTo >>> feature of >>> >> the >>> >> >> DataFileWriter. See >>> >> >> >>>>> >>> >> >> >>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) >>> >> >> >>>>> >>> >> >> >>>>> For a quick setup >>> example, >>> >> read also: >>> >> >> >>>>> >>> >> >> >>>>> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file >>> >> >> >>>>> >>> >> >> >>>>> On Tue, Feb 21, 2012 >>> at 3:15 >>> >> AM, >>> >> >> Vyacheslav Zholudev >>> >> >> >>>>> <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> >>>>>> Hi, >>> >> >> >>>>>> >>> >> >> >>>>>> is it possible to >>> append >>> >> to an >>> >> >> already existing avro file when it was >>> >> >> >>>>>> written and >>> closed >>> >> before? >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> >> >> >>>>>> outputStream = >>> >> >> fs.append(avroFilePath); >>> >> >> >>>>>> >>> >> >> >>>>>> then later on I >>> get: >>> >> >> java.io.IOException: Invalid sync! >>> >> >> >>>>>> >>> >> >> >>>>>> Probably because >>> the >>> >> schema is >>> >> >> written twice and some other issues. >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> outputStream = >>> >> >> fs.create(avroFilePath); then the avro >>> file >>> >> >> >>>>>> gets >>> >> >> >>>>>> overwritten. >>> >> >> >>>>>> >>> >> >> >>>>>> Thanks, >>> >> >> >>>>>> Vyacheslav >>> >> >> >>>>> >>> >> >> >>>>> -- >>> >> >> >>>>> Harsh J >>> >> >> >>>>> Customer Ops. >>> Engineer >>> >> >> >>>>> Cloudera | http://tiny.cloudera.com/about >>> >> >> > >>> >> >> >>> >> >>> >>> >>> >>> -- >>> Harsh J >>> >>> On Wed, Feb 6, 2013 at 9:00 AM, Michael Malak <michaelma...@yahoo.com> >>> wrote: >>> > I don't believe a Hadoop FileSystem is a Java >>> OutputStream? >>> > >>> > --- On Tue, 2/5/13, Doug Cutting <cutt...@apache.org> >>> wrote: >>> > >>> >> From: Doug Cutting <cutt...@apache.org> >>> >> Subject: Re: Is it possible to append to an already >>> existing avro file >>> >> To: user@avro.apache.org >>> >> Date: Tuesday, February 5, 2013, 5:27 PM >>> >> It will work on an OutputStream that >>> >> supports append. >>> >> >>> >> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(org.apache.avro.file.SeekableInput, >>> >> java.io.OutputStream) >>> >> >>> >> So it depends on how well HDFS implements >>> >> FileSystem#append(), not on >>> >> any changes in Avro. >>> >> >>> >> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path) >>> >> >>> >> I have no recent personal experience with append >>> in >>> >> HDFS. Does anyone >>> >> else here? >>> >> >>> >> Doug >>> >> >>> >> On Tue, Feb 5, 2013 at 4:10 PM, Michael Malak >>> <michaelma...@yahoo.com> >>> >> wrote: >>> >> > My understanding is that will append to a file >>> on the >>> >> local filesystem, but not to a file on HDFS. >>> >> > >>> >> > --- On Tue, 2/5/13, Doug Cutting <cutt...@apache.org> >>> >> wrote: >>> >> > >>> >> >> From: Doug Cutting <cutt...@apache.org> >>> >> >> Subject: Re: Is it possible to append to >>> an already >>> >> existing avro file >>> >> >> To: user@avro.apache.org >>> >> >> Date: Tuesday, February 5, 2013, 5:08 PM >>> >> >> The Jira is: >>> >> >> >>> >> >> https://issues.apache.org/jira/browse/AVRO-1035 >>> >> >> >>> >> >> It is possible to append to an existing >>> Avro file: >>> >> >> >>> >> >> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) >>> >> >> >>> >> >> Should we close that issue as "fixed"? >>> >> >> >>> >> >> Doug >>> >> >> >>> >> >> On Fri, Feb 1, 2013 at 11:32 AM, Michael >>> Malak >>> >> <michaelma...@yahoo.com> >>> >> >> wrote: >>> >> >> > Was a JIRA ticket ever created >>> regarding >>> >> appending to >>> >> >> an existing Avro file on HDFS? >>> >> >> > >>> >> >> > What is the status of such a >>> capability, a >>> >> year out >>> >> >> from when the issue below was raised? >>> >> >> > >>> >> >> > On Wed, 22 Feb 2012 10:57:48 +0100, >>> >> "Vyacheslav >>> >> >> Zholudev" <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> > >>> >> >> >> Thanks for your reply, I >>> suspected this. >>> >> >> >> >>> >> >> >> I will create a JIRA ticket. >>> >> >> >> >>> >> >> >> Vyacheslav >>> >> >> >> >>> >> >> >> On Feb 21, 2012, at 6:02 PM, >>> Scott Carey >>> >> wrote: >>> >> >> >> >>> >> >> >>> >>> >> >> >>> On 2/21/12 7:29 AM, >>> "Vyacheslav >>> >> Zholudev" >>> >> >> <vyacheslav.zholu...@gmail.com> >>> >> >> >>> wrote: >>> >> >> >>> >>> >> >> >>>> Yep, I saw that method as >>> well as >>> >> the >>> >> >> stackoverflow post. However, I'm >>> >> >> >>>> interested how to append >>> to a file >>> >> on the >>> >> >> arbitrary file system, not >>> >> >> >>>> only on the local one. >>> >> >> >>>> >>> >> >> >>>> I want to get an >>> OutputStream >>> >> based on the >>> >> >> Path and the FileSystem >>> >> >> >>>> implementation and then >>> pass it >>> >> for >>> >> >> appending to avro methods. >>> >> >> >>>> >>> >> >> >>>> Is that possible? >>> >> >> >>> >>> >> >> >>> It is not possible without >>> modifying >>> >> >> DataFileWriter. Please open a JIRA >>> >> >> >>> ticket. >>> >> >> >>> >>> >> >> >>> It could not simply append to >>> an >>> >> OutputStream, >>> >> >> since it must either: >>> >> >> >>> * Seek to the start to >>> validate the >>> >> schemas >>> >> >> match and find the sync >>> >> >> >>> marker, or >>> >> >> >>> * Trust that the schemas >>> match and >>> >> find the >>> >> >> sync marker from the last >>> >> >> >>> block >>> >> >> >>> >>> >> >> >>> DataFileWriter cannot refer >>> to Hadoop >>> >> classes >>> >> >> such as FileSystem, but we >>> >> >> >>> could add something to the >>> mapred >>> >> module that >>> >> >> takes a Path and >>> >> >> >>> FileSystem and returns >>> something that >>> >> >> implemements an interface that >>> >> >> >>> DataFileWriter can append >>> to. >>> >> This would >>> >> >> be something that is both a >>> >> >> >>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html >>> >> >> >>> and an OutputStream, or has >>> both an >>> >> InputStream >>> >> >> from the start of the >>> >> >> >>> existing file and an >>> OutputStream at >>> >> the end. >>> >> >> >>> >>> >> >> >>>> Thanks, >>> >> >> >>>> Vyacheslav >>> >> >> >>>> >>> >> >> >>>> On Feb 21, 2012, at 5:29 >>> AM, Harsh >>> >> J >>> >> >> wrote: >>> >> >> >>>> >>> >> >> >>>>> Hi, >>> >> >> >>>>> >>> >> >> >>>>> Use the appendTo >>> feature of >>> >> the >>> >> >> DataFileWriter. See >>> >> >> >>>>> >>> >> >> >>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) >>> >> >> >>>>> >>> >> >> >>>>> For a quick setup >>> example, >>> >> read also: >>> >> >> >>>>> >>> >> >> >>>>> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file >>> >> >> >>>>> >>> >> >> >>>>> On Tue, Feb 21, 2012 >>> at 3:15 >>> >> AM, >>> >> >> Vyacheslav Zholudev >>> >> >> >>>>> <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> >>>>>> Hi, >>> >> >> >>>>>> >>> >> >> >>>>>> is it possible to >>> append >>> >> to an >>> >> >> already existing avro file when it was >>> >> >> >>>>>> written and >>> closed >>> >> before? >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> >> >> >>>>>> outputStream = >>> >> >> fs.append(avroFilePath); >>> >> >> >>>>>> >>> >> >> >>>>>> then later on I >>> get: >>> >> >> java.io.IOException: Invalid sync! >>> >> >> >>>>>> >>> >> >> >>>>>> Probably because >>> the >>> >> schema is >>> >> >> written twice and some other issues. >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> outputStream = >>> >> >> fs.create(avroFilePath); then the avro >>> file >>> >> >> >>>>>> gets >>> >> >> >>>>>> overwritten. >>> >> >> >>>>>> >>> >> >> >>>>>> Thanks, >>> >> >> >>>>>> Vyacheslav >>> >> >> >>>>> >>> >> >> >>>>> -- >>> >> >> >>>>> Harsh J >>> >> >> >>>>> Customer Ops. >>> Engineer >>> >> >> >>>>> Cloudera | http://tiny.cloudera.com/about >>> >> >> > >>> >> >> >>> >> >> On Fri, Feb 1, 2013 at 11:32 AM, Michael >>> Malak >>> >> <michaelma...@yahoo.com> >>> >> >> wrote: >>> >> >> > Was a JIRA ticket ever created >>> regarding >>> >> appending to >>> >> >> an existing Avro file on HDFS? >>> >> >> > >>> >> >> > What is the status of such a >>> capability, a >>> >> year out >>> >> >> from when the issue below was raised? >>> >> >> > >>> >> >> > On Wed, 22 Feb 2012 10:57:48 +0100, >>> >> "Vyacheslav >>> >> >> Zholudev" <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> > >>> >> >> >> Thanks for your reply, I >>> suspected this. >>> >> >> >> >>> >> >> >> I will create a JIRA ticket. >>> >> >> >> >>> >> >> >> Vyacheslav >>> >> >> >> >>> >> >> >> On Feb 21, 2012, at 6:02 PM, >>> Scott Carey >>> >> wrote: >>> >> >> >> >>> >> >> >>> >>> >> >> >>> On 2/21/12 7:29 AM, >>> "Vyacheslav >>> >> Zholudev" >>> >> >> <vyacheslav.zholu...@gmail.com> >>> >> >> >>> wrote: >>> >> >> >>> >>> >> >> >>>> Yep, I saw that method as >>> well as >>> >> the >>> >> >> stackoverflow post. However, I'm >>> >> >> >>>> interested how to append >>> to a file >>> >> on the >>> >> >> arbitrary file system, not >>> >> >> >>>> only on the local one. >>> >> >> >>>> >>> >> >> >>>> I want to get an >>> OutputStream >>> >> based on the >>> >> >> Path and the FileSystem >>> >> >> >>>> implementation and then >>> pass it >>> >> for >>> >> >> appending to avro methods. >>> >> >> >>>> >>> >> >> >>>> Is that possible? >>> >> >> >>> >>> >> >> >>> It is not possible without >>> modifying >>> >> >> DataFileWriter. Please open a JIRA >>> >> >> >>> ticket. >>> >> >> >>> >>> >> >> >>> It could not simply append to >>> an >>> >> OutputStream, >>> >> >> since it must either: >>> >> >> >>> * Seek to the start to >>> validate the >>> >> schemas >>> >> >> match and find the sync >>> >> >> >>> marker, or >>> >> >> >>> * Trust that the schemas >>> match and >>> >> find the >>> >> >> sync marker from the last >>> >> >> >>> block >>> >> >> >>> >>> >> >> >>> DataFileWriter cannot refer >>> to Hadoop >>> >> classes >>> >> >> such as FileSystem, but we >>> >> >> >>> could add something to the >>> mapred >>> >> module that >>> >> >> takes a Path and >>> >> >> >>> FileSystem and returns >>> something that >>> >> >> implemements an interface that >>> >> >> >>> DataFileWriter can append >>> to. >>> >> This would >>> >> >> be something that is both a >>> >> >> >>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html >>> >> >> >>> and an OutputStream, or has >>> both an >>> >> InputStream >>> >> >> from the start of the >>> >> >> >>> existing file and an >>> OutputStream at >>> >> the end. >>> >> >> >>> >>> >> >> >>>> Thanks, >>> >> >> >>>> Vyacheslav >>> >> >> >>>> >>> >> >> >>>> On Feb 21, 2012, at 5:29 >>> AM, Harsh >>> >> J >>> >> >> wrote: >>> >> >> >>>> >>> >> >> >>>>> Hi, >>> >> >> >>>>> >>> >> >> >>>>> Use the appendTo >>> feature of >>> >> the >>> >> >> DataFileWriter. See >>> >> >> >>>>> >>> >> >> >>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) >>> >> >> >>>>> >>> >> >> >>>>> For a quick setup >>> example, >>> >> read also: >>> >> >> >>>>> >>> >> >> >>>>> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file >>> >> >> >>>>> >>> >> >> >>>>> On Tue, Feb 21, 2012 >>> at 3:15 >>> >> AM, >>> >> >> Vyacheslav Zholudev >>> >> >> >>>>> <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> >>>>>> Hi, >>> >> >> >>>>>> >>> >> >> >>>>>> is it possible to >>> append >>> >> to an >>> >> >> already existing avro file when it was >>> >> >> >>>>>> written and >>> closed >>> >> before? >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> >> >> >>>>>> outputStream = >>> >> >> fs.append(avroFilePath); >>> >> >> >>>>>> >>> >> >> >>>>>> then later on I >>> get: >>> >> >> java.io.IOException: Invalid sync! >>> >> >> >>>>>> >>> >> >> >>>>>> Probably because >>> the >>> >> schema is >>> >> >> written twice and some other issues. >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> outputStream = >>> >> >> fs.create(avroFilePath); then the avro >>> file >>> >> >> >>>>>> gets >>> >> >> >>>>>> overwritten. >>> >> >> >>>>>> >>> >> >> >>>>>> Thanks, >>> >> >> >>>>>> Vyacheslav >>> >> >> >>>>> >>> >> >> >>>>> -- >>> >> >> >>>>> Harsh J >>> >> >> >>>>> Customer Ops. >>> Engineer >>> >> >> >>>>> Cloudera | http://tiny.cloudera.com/about >>> >> >> > >>> >> >> >>> >> >>> >>> >>> >>> -- >>> Harsh J >>> >>> On Wed, Feb 6, 2013 at 9:00 AM, Michael Malak <michaelma...@yahoo.com> >>> wrote: >>> > I don't believe a Hadoop FileSystem is a Java >>> OutputStream? >>> > >>> > --- On Tue, 2/5/13, Doug Cutting <cutt...@apache.org> >>> wrote: >>> > >>> >> From: Doug Cutting <cutt...@apache.org> >>> >> Subject: Re: Is it possible to append to an already >>> existing avro file >>> >> To: user@avro.apache.org >>> >> Date: Tuesday, February 5, 2013, 5:27 PM >>> >> It will work on an OutputStream that >>> >> supports append. >>> >> >>> >> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(org.apache.avro.file.SeekableInput, >>> >> java.io.OutputStream) >>> >> >>> >> So it depends on how well HDFS implements >>> >> FileSystem#append(), not on >>> >> any changes in Avro. >>> >> >>> >> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path) >>> >> >>> >> I have no recent personal experience with append >>> in >>> >> HDFS. Does anyone >>> >> else here? >>> >> >>> >> Doug >>> >> >>> >> On Tue, Feb 5, 2013 at 4:10 PM, Michael Malak >>> <michaelma...@yahoo.com> >>> >> wrote: >>> >> > My understanding is that will append to a file >>> on the >>> >> local filesystem, but not to a file on HDFS. >>> >> > >>> >> > --- On Tue, 2/5/13, Doug Cutting <cutt...@apache.org> >>> >> wrote: >>> >> > >>> >> >> From: Doug Cutting <cutt...@apache.org> >>> >> >> Subject: Re: Is it possible to append to >>> an already >>> >> existing avro file >>> >> >> To: user@avro.apache.org >>> >> >> Date: Tuesday, February 5, 2013, 5:08 PM >>> >> >> The Jira is: >>> >> >> >>> >> >> https://issues.apache.org/jira/browse/AVRO-1035 >>> >> >> >>> >> >> It is possible to append to an existing >>> Avro file: >>> >> >> >>> >> >> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) >>> >> >> >>> >> >> Should we close that issue as "fixed"? >>> >> >> >>> >> >> Doug >>> >> >> >>> >> >> On Fri, Feb 1, 2013 at 11:32 AM, Michael >>> Malak >>> >> <michaelma...@yahoo.com> >>> >> >> wrote: >>> >> >> > Was a JIRA ticket ever created >>> regarding >>> >> appending to >>> >> >> an existing Avro file on HDFS? >>> >> >> > >>> >> >> > What is the status of such a >>> capability, a >>> >> year out >>> >> >> from when the issue below was raised? >>> >> >> > >>> >> >> > On Wed, 22 Feb 2012 10:57:48 +0100, >>> >> "Vyacheslav >>> >> >> Zholudev" <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> > >>> >> >> >> Thanks for your reply, I >>> suspected this. >>> >> >> >> >>> >> >> >> I will create a JIRA ticket. >>> >> >> >> >>> >> >> >> Vyacheslav >>> >> >> >> >>> >> >> >> On Feb 21, 2012, at 6:02 PM, >>> Scott Carey >>> >> wrote: >>> >> >> >> >>> >> >> >>> >>> >> >> >>> On 2/21/12 7:29 AM, >>> "Vyacheslav >>> >> Zholudev" >>> >> >> <vyacheslav.zholu...@gmail.com> >>> >> >> >>> wrote: >>> >> >> >>> >>> >> >> >>>> Yep, I saw that method as >>> well as >>> >> the >>> >> >> stackoverflow post. However, I'm >>> >> >> >>>> interested how to append >>> to a file >>> >> on the >>> >> >> arbitrary file system, not >>> >> >> >>>> only on the local one. >>> >> >> >>>> >>> >> >> >>>> I want to get an >>> OutputStream >>> >> based on the >>> >> >> Path and the FileSystem >>> >> >> >>>> implementation and then >>> pass it >>> >> for >>> >> >> appending to avro methods. >>> >> >> >>>> >>> >> >> >>>> Is that possible? >>> >> >> >>> >>> >> >> >>> It is not possible without >>> modifying >>> >> >> DataFileWriter. Please open a JIRA >>> >> >> >>> ticket. >>> >> >> >>> >>> >> >> >>> It could not simply append to >>> an >>> >> OutputStream, >>> >> >> since it must either: >>> >> >> >>> * Seek to the start to >>> validate the >>> >> schemas >>> >> >> match and find the sync >>> >> >> >>> marker, or >>> >> >> >>> * Trust that the schemas >>> match and >>> >> find the >>> >> >> sync marker from the last >>> >> >> >>> block >>> >> >> >>> >>> >> >> >>> DataFileWriter cannot refer >>> to Hadoop >>> >> classes >>> >> >> such as FileSystem, but we >>> >> >> >>> could add something to the >>> mapred >>> >> module that >>> >> >> takes a Path and >>> >> >> >>> FileSystem and returns >>> something that >>> >> >> implemements an interface that >>> >> >> >>> DataFileWriter can append >>> to. >>> >> This would >>> >> >> be something that is both a >>> >> >> >>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html >>> >> >> >>> and an OutputStream, or has >>> both an >>> >> InputStream >>> >> >> from the start of the >>> >> >> >>> existing file and an >>> OutputStream at >>> >> the end. >>> >> >> >>> >>> >> >> >>>> Thanks, >>> >> >> >>>> Vyacheslav >>> >> >> >>>> >>> >> >> >>>> On Feb 21, 2012, at 5:29 >>> AM, Harsh >>> >> J >>> >> >> wrote: >>> >> >> >>>> >>> >> >> >>>>> Hi, >>> >> >> >>>>> >>> >> >> >>>>> Use the appendTo >>> feature of >>> >> the >>> >> >> DataFileWriter. See >>> >> >> >>>>> >>> >> >> >>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) >>> >> >> >>>>> >>> >> >> >>>>> For a quick setup >>> example, >>> >> read also: >>> >> >> >>>>> >>> >> >> >>>>> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file >>> >> >> >>>>> >>> >> >> >>>>> On Tue, Feb 21, 2012 >>> at 3:15 >>> >> AM, >>> >> >> Vyacheslav Zholudev >>> >> >> >>>>> <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> >>>>>> Hi, >>> >> >> >>>>>> >>> >> >> >>>>>> is it possible to >>> append >>> >> to an >>> >> >> already existing avro file when it was >>> >> >> >>>>>> written and >>> closed >>> >> before? >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> >> >> >>>>>> outputStream = >>> >> >> fs.append(avroFilePath); >>> >> >> >>>>>> >>> >> >> >>>>>> then later on I >>> get: >>> >> >> java.io.IOException: Invalid sync! >>> >> >> >>>>>> >>> >> >> >>>>>> Probably because >>> the >>> >> schema is >>> >> >> written twice and some other issues. >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> outputStream = >>> >> >> fs.create(avroFilePath); then the avro >>> file >>> >> >> >>>>>> gets >>> >> >> >>>>>> overwritten. >>> >> >> >>>>>> >>> >> >> >>>>>> Thanks, >>> >> >> >>>>>> Vyacheslav >>> >> >> >>>>> >>> >> >> >>>>> -- >>> >> >> >>>>> Harsh J >>> >> >> >>>>> Customer Ops. >>> Engineer >>> >> >> >>>>> Cloudera | http://tiny.cloudera.com/about >>> >> >> > >>> >> >> >>> >> >> On Fri, Feb 1, 2013 at 11:32 AM, Michael >>> Malak >>> >> <michaelma...@yahoo.com> >>> >> >> wrote: >>> >> >> > Was a JIRA ticket ever created >>> regarding >>> >> appending to >>> >> >> an existing Avro file on HDFS? >>> >> >> > >>> >> >> > What is the status of such a >>> capability, a >>> >> year out >>> >> >> from when the issue below was raised? >>> >> >> > >>> >> >> > On Wed, 22 Feb 2012 10:57:48 +0100, >>> >> "Vyacheslav >>> >> >> Zholudev" <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> > >>> >> >> >> Thanks for your reply, I >>> suspected this. >>> >> >> >> >>> >> >> >> I will create a JIRA ticket. >>> >> >> >> >>> >> >> >> Vyacheslav >>> >> >> >> >>> >> >> >> On Feb 21, 2012, at 6:02 PM, >>> Scott Carey >>> >> wrote: >>> >> >> >> >>> >> >> >>> >>> >> >> >>> On 2/21/12 7:29 AM, >>> "Vyacheslav >>> >> Zholudev" >>> >> >> <vyacheslav.zholu...@gmail.com> >>> >> >> >>> wrote: >>> >> >> >>> >>> >> >> >>>> Yep, I saw that method as >>> well as >>> >> the >>> >> >> stackoverflow post. However, I'm >>> >> >> >>>> interested how to append >>> to a file >>> >> on the >>> >> >> arbitrary file system, not >>> >> >> >>>> only on the local one. >>> >> >> >>>> >>> >> >> >>>> I want to get an >>> OutputStream >>> >> based on the >>> >> >> Path and the FileSystem >>> >> >> >>>> implementation and then >>> pass it >>> >> for >>> >> >> appending to avro methods. >>> >> >> >>>> >>> >> >> >>>> Is that possible? >>> >> >> >>> >>> >> >> >>> It is not possible without >>> modifying >>> >> >> DataFileWriter. Please open a JIRA >>> >> >> >>> ticket. >>> >> >> >>> >>> >> >> >>> It could not simply append to >>> an >>> >> OutputStream, >>> >> >> since it must either: >>> >> >> >>> * Seek to the start to >>> validate the >>> >> schemas >>> >> >> match and find the sync >>> >> >> >>> marker, or >>> >> >> >>> * Trust that the schemas >>> match and >>> >> find the >>> >> >> sync marker from the last >>> >> >> >>> block >>> >> >> >>> >>> >> >> >>> DataFileWriter cannot refer >>> to Hadoop >>> >> classes >>> >> >> such as FileSystem, but we >>> >> >> >>> could add something to the >>> mapred >>> >> module that >>> >> >> takes a Path and >>> >> >> >>> FileSystem and returns >>> something that >>> >> >> implemements an interface that >>> >> >> >>> DataFileWriter can append >>> to. >>> >> This would >>> >> >> be something that is both a >>> >> >> >>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInput.html >>> >> >> >>> and an OutputStream, or has >>> both an >>> >> InputStream >>> >> >> from the start of the >>> >> >> >>> existing file and an >>> OutputStream at >>> >> the end. >>> >> >> >>> >>> >> >> >>>> Thanks, >>> >> >> >>>> Vyacheslav >>> >> >> >>>> >>> >> >> >>>> On Feb 21, 2012, at 5:29 >>> AM, Harsh >>> >> J >>> >> >> wrote: >>> >> >> >>>> >>> >> >> >>>>> Hi, >>> >> >> >>>>> >>> >> >> >>>>> Use the appendTo >>> feature of >>> >> the >>> >> >> DataFileWriter. See >>> >> >> >>>>> >>> >> >> >>>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File) >>> >> >> >>>>> >>> >> >> >>>>> For a quick setup >>> example, >>> >> read also: >>> >> >> >>>>> >>> >> >> >>>>> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file >>> >> >> >>>>> >>> >> >> >>>>> On Tue, Feb 21, 2012 >>> at 3:15 >>> >> AM, >>> >> >> Vyacheslav Zholudev >>> >> >> >>>>> <vyacheslav.zholu...@gmail.com> >>> >> >> wrote: >>> >> >> >>>>>> Hi, >>> >> >> >>>>>> >>> >> >> >>>>>> is it possible to >>> append >>> >> to an >>> >> >> already existing avro file when it was >>> >> >> >>>>>> written and >>> closed >>> >> before? >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> >> >> >>>>>> outputStream = >>> >> >> fs.append(avroFilePath); >>> >> >> >>>>>> >>> >> >> >>>>>> then later on I >>> get: >>> >> >> java.io.IOException: Invalid sync! >>> >> >> >>>>>> >>> >> >> >>>>>> Probably because >>> the >>> >> schema is >>> >> >> written twice and some other issues. >>> >> >> >>>>>> >>> >> >> >>>>>> If I use >>> outputStream = >>> >> >> fs.create(avroFilePath); then the avro >>> file >>> >> >> >>>>>> gets >>> >> >> >>>>>> overwritten. >>> >> >> >>>>>> >>> >> >> >>>>>> Thanks, >>> >> >> >>>>>> Vyacheslav >>> >> >> >>>>> >>> >> >> >>>>> -- >>> >> >> >>>>> Harsh J >>> >> >> >>>>> Customer Ops. >>> Engineer >>> >> >> >>>>> Cloudera | http://tiny.cloudera.com/about >>> >> >> > >>> >> >> >>> >> >>> >>> >>> >>> -- >>> Harsh J >>> > > > > -- > Harsh J