Re: TextIO binary file

2017-02-05 Thread Aviem Zur
AvroIO would is great for POJOs. But for use cases with more complex,
serializable objects, or objects which are compatible with some coder it
falls short.

Also, for less savvy users to know they need to use AvroIO might be a
stretch.
Some simpler API along the the lines of ObjectFile might be more user
friendly (even if for optimization it uses avro under the hood for POJOs).

On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov 
wrote:

> OK, I see what you mean; however I still think this can be solved without
> introducing a new "Beam object file" (or whatever) file format, and without
> thereby introducing additional use cases and compatibility constraints on
> coders.
>
> I asked before in the thread why not just use AvroIO (it can serialize
> arbitrary POJOs using reflection); I skimmed the thread it doesn't seem
> like that got answered properly. I also like Dan's suggestion to use AvroIO
> to serialize byte[] arrays and you can do whatever you want with them (e.g.
> use another serialization library, say, Kryo, or Java serialization, etc.)
>
> On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur  wrote:
>
> > I agree that these files will serve no use outside of Beam pipelines.
> >
> > The rationale was that you might want to have one pipeline write output
> to
> > files and then have a different pipeline that uses those files as inputs.
> >
> > Say one team in your organization creates a pipeline and a different team
> > utilizes those files as input for a different pipeline. The contract
> > between them is the file, in a Beam-readable format.
> > This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
> >
> spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > <
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> >
> >
> > The merit for something like this in my eyes is to not burden the user
> with
> > writing a custom IO
> >
> > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> >  wrote:
> >
> > +1 to Robert. Either this will be a Beam-specific file format (and then
> > nothing except Beam will be able to read it - which I doubt is what you
> > want), or it is an existing well-known file format and then we should
> just
> > develop an IO for it.
> > Note that any file format that involves encoding elements with a Coder is
> > Beam-specific, because wire format of coders is Beam-specific.
> >
> > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> >  wrote:
> >
> > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur 
> wrote:
> > > > +1 on what Stas said.
> > > > I think there is value in not having the user write a custom IO for a
> > > > protocol they use which is not covered by Beam IOs. Plus having them
> > deal
> > > > with not only the encoding but also the IO part is not ideal.
> > > > I think having a basic FileIO that can write to the Filesystems
> > supported
> > > > by Beam (GS/HDFS/Local/...) which you can use any coder with,
> including
> > > > your own custom coder, can be beneficial.
> > >
> > > What would the format of the file be? Just the concatenation of the
> > > elements encoded according to the coder? Or is there a delimiter
> > > needed to separate records. In which case how does one ensure the
> > > delimiter does not also appear in the middle of an encoded element? At
> > > this point you're developing a file format, and might as well stick
> > > with one of the standard ones. https://xkcd.com/927
> > >
> > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin 
> > wrote:
> > > >
> > > > I believe the motivation is to have an abstraction that allows one to
> > > write
> > > > stuff to a file in a way that is agnostic to the coder.
> > > > If one needs to write a non-Avro protocol to a file, and this
> > particular
> > > > protocol does not meet the assumption made by TextIO, one might need
> to
> > > > duplicate the file IO related code from AvroIO.
> > > >
> > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > >  wrote:
> > > >
> > > >> Could you clarify why it would be useful to write objects to files
> > using
> > > >> Beam coders, as opposed to just using e.g. AvroIO?
> > > >>
> > > >> Coders (should) make no promise as to what their wire format is, so
> > such
> > > >> files could be read back only by other Beam pipelines using the same
> > IO.
> > > >>
> > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur 
> wrote:
> > > >>
> > > >> > So If I understand the general agreement is that TextIO should not
> > > >> support
> > > >> > anything but lines from files as strings.
> > > >> > I'll go ahead and file a ticket that says the Javadoc should be
> > > changed
> > > >> to
> > > >> > reflect this and `withCoder` method should be removed.
> > > >> >
> > > >> > Is there merit for Beam to supply an IO 

Re: TextIO binary file

2017-02-05 Thread Aviem Zur
I agree that these files will serve no use outside of Beam pipelines.

The rationale was that you might want to have one pipeline write output to
files and then have a different pipeline that uses those files as inputs.

Say one team in your organization creates a pipeline and a different team
utilizes those files as input for a different pipeline. The contract
between them is the file, in a Beam-readable format.
This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512

The merit for something like this in my eyes is to not burden the user with
writing a custom IO

On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
 wrote:

+1 to Robert. Either this will be a Beam-specific file format (and then
nothing except Beam will be able to read it - which I doubt is what you
want), or it is an existing well-known file format and then we should just
develop an IO for it.
Note that any file format that involves encoding elements with a Coder is
Beam-specific, because wire format of coders is Beam-specific.

On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
 wrote:

> On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur  wrote:
> > +1 on what Stas said.
> > I think there is value in not having the user write a custom IO for a
> > protocol they use which is not covered by Beam IOs. Plus having them
deal
> > with not only the encoding but also the IO part is not ideal.
> > I think having a basic FileIO that can write to the Filesystems
supported
> > by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> > your own custom coder, can be beneficial.
>
> What would the format of the file be? Just the concatenation of the
> elements encoded according to the coder? Or is there a delimiter
> needed to separate records. In which case how does one ensure the
> delimiter does not also appear in the middle of an encoded element? At
> this point you're developing a file format, and might as well stick
> with one of the standard ones. https://xkcd.com/927
>
> > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin  wrote:
> >
> > I believe the motivation is to have an abstraction that allows one to
> write
> > stuff to a file in a way that is agnostic to the coder.
> > If one needs to write a non-Avro protocol to a file, and this particular
> > protocol does not meet the assumption made by TextIO, one might need to
> > duplicate the file IO related code from AvroIO.
> >
> > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> >  wrote:
> >
> >> Could you clarify why it would be useful to write objects to files
using
> >> Beam coders, as opposed to just using e.g. AvroIO?
> >>
> >> Coders (should) make no promise as to what their wire format is, so
such
> >> files could be read back only by other Beam pipelines using the same
IO.
> >>
> >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur  wrote:
> >>
> >> > So If I understand the general agreement is that TextIO should not
> >> support
> >> > anything but lines from files as strings.
> >> > I'll go ahead and file a ticket that says the Javadoc should be
> changed
> >> to
> >> > reflect this and `withCoder` method should be removed.
> >> >
> >> > Is there merit for Beam to supply an IO which does allow writing
> objects
> >> to
> >> > a file using Beam coders and Beam FS (To write these files to
> >> > GS/Hadoop/Local)?
> >> >
> >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> >> >  wrote:
> >> >
> >> > P.S. Note that this point (about coders) is also mentioned in the
> >> > now-being-reviewed PTransform Style Guide
> >> > https://github.com/apache/beam-site/pull/134
> >> > currently staged at
> >> >
> >> >
> >>
> >
>
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >> >
> >> >
> >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> chamik...@apache.org
> >> >
> >> > wrote:
> >> >
> >> > > +1 to what Eugene said.
> >> > >
> >> > > I've seen a number of Python SDK users incorrectly assuming that
> >> > > coder.decode() is needed when developing their own file-based
> sources
> >> > > (since many users usually refer to text source first). Probably
> coder
> >> > > parameter should not be configurable for text source/sink and they
> >> should
> >> > > be updated to only read/write UTF-8 encoded strings.
> >> > >
> >> > > - Cham
> >> > >
> >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> >> > >  wrote:
> >> > >
> >> > > > The use of Coder in TextIO is a long standing design issue
because
> >> > coders
> >> > > > are not intended to be used for general purpose converting things
> >> from
> >> > > and
> >> > > > to bytes, their only proper use is letting the runner materialize
> > and
> >> > > > restore objects if 

Re: TextIO binary file

2017-01-31 Thread Robert Bradshaw
On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur  wrote:
> +1 on what Stas said.
> I think there is value in not having the user write a custom IO for a
> protocol they use which is not covered by Beam IOs. Plus having them deal
> with not only the encoding but also the IO part is not ideal.
> I think having a basic FileIO that can write to the Filesystems supported
> by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> your own custom coder, can be beneficial.

What would the format of the file be? Just the concatenation of the
elements encoded according to the coder? Or is there a delimiter
needed to separate records. In which case how does one ensure the
delimiter does not also appear in the middle of an encoded element? At
this point you're developing a file format, and might as well stick
with one of the standard ones. https://xkcd.com/927

> On Tue, Jan 31, 2017 at 7:56 PM Stas Levin  wrote:
>
> I believe the motivation is to have an abstraction that allows one to write
> stuff to a file in a way that is agnostic to the coder.
> If one needs to write a non-Avro protocol to a file, and this particular
> protocol does not meet the assumption made by TextIO, one might need to
> duplicate the file IO related code from AvroIO.
>
> On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
>  wrote:
>
>> Could you clarify why it would be useful to write objects to files using
>> Beam coders, as opposed to just using e.g. AvroIO?
>>
>> Coders (should) make no promise as to what their wire format is, so such
>> files could be read back only by other Beam pipelines using the same IO.
>>
>> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur  wrote:
>>
>> > So If I understand the general agreement is that TextIO should not
>> support
>> > anything but lines from files as strings.
>> > I'll go ahead and file a ticket that says the Javadoc should be changed
>> to
>> > reflect this and `withCoder` method should be removed.
>> >
>> > Is there merit for Beam to supply an IO which does allow writing objects
>> to
>> > a file using Beam coders and Beam FS (To write these files to
>> > GS/Hadoop/Local)?
>> >
>> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
>> >  wrote:
>> >
>> > P.S. Note that this point (about coders) is also mentioned in the
>> > now-being-reviewed PTransform Style Guide
>> > https://github.com/apache/beam-site/pull/134
>> > currently staged at
>> >
>> >
>>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>> >
>> >
>> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath > >
>> > wrote:
>> >
>> > > +1 to what Eugene said.
>> > >
>> > > I've seen a number of Python SDK users incorrectly assuming that
>> > > coder.decode() is needed when developing their own file-based sources
>> > > (since many users usually refer to text source first). Probably coder
>> > > parameter should not be configurable for text source/sink and they
>> should
>> > > be updated to only read/write UTF-8 encoded strings.
>> > >
>> > > - Cham
>> > >
>> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
>> > >  wrote:
>> > >
>> > > > The use of Coder in TextIO is a long standing design issue because
>> > coders
>> > > > are not intended to be used for general purpose converting things
>> from
>> > > and
>> > > > to bytes, their only proper use is letting the runner materialize
> and
>> > > > restore objects if the runner thinks it's necessary. IMO it should
>> have
>> > > > been called LineIO, document that it reads lines of text as String,
>> and
>> > > not
>> > > > have a withCoder parameter at all.
>> > > >
>> > > > The proper way to address your use case is to write a custom
>> > > > FileBasedSource.
>> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur 
>> wrote:
>> > > >
>> > > > > The Javadoc of TextIO states:
>> > > > >
>> > > > > * By default, {@link TextIO.Read} returns a {@link PCollection}
>> of
>> > > > > {@link String Strings},
>> > > > >  * each corresponding to one line of an input UTF-8 text file. To
>> > > convert
>> > > > > directly from the raw
>> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
>> > > another
>> > > > > object of type {@code T},
>> > > > >  * supply a {@code Coder} using {@link
>> > > TextIO.Read#withCoder(Coder)}.
>> > > > >
>> > > > > However, as I stated, `withCoder` doesn't seem to have tests, and
>> > > > probably
>> > > > > won't work given the hard-coded '\n' delimiter.
>> > > > >
>> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
>> > j...@nanthrax.net
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Aviem,
>> > > > > >
>> > > > > > TextIO is not designed to write/read binary file: it's pure
> Text,
>> > so
>> > > > > > String.
>> > > > > >
>> > > > > > Regards
>> > > > > > JB

Re: TextIO binary file

2017-01-31 Thread Aviem Zur
+1 on what Stas said.
I think there is value in not having the user write a custom IO for a
protocol they use which is not covered by Beam IOs. Plus having them deal
with not only the encoding but also the IO part is not ideal.
I think having a basic FileIO that can write to the Filesystems supported
by Beam (GS/HDFS/Local/...) which you can use any coder with, including
your own custom coder, can be beneficial.

On Tue, Jan 31, 2017 at 7:56 PM Stas Levin  wrote:

I believe the motivation is to have an abstraction that allows one to write
stuff to a file in a way that is agnostic to the coder.
If one needs to write a non-Avro protocol to a file, and this particular
protocol does not meet the assumption made by TextIO, one might need to
duplicate the file IO related code from AvroIO.

On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
 wrote:

> Could you clarify why it would be useful to write objects to files using
> Beam coders, as opposed to just using e.g. AvroIO?
>
> Coders (should) make no promise as to what their wire format is, so such
> files could be read back only by other Beam pipelines using the same IO.
>
> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur  wrote:
>
> > So If I understand the general agreement is that TextIO should not
> support
> > anything but lines from files as strings.
> > I'll go ahead and file a ticket that says the Javadoc should be changed
> to
> > reflect this and `withCoder` method should be removed.
> >
> > Is there merit for Beam to supply an IO which does allow writing objects
> to
> > a file using Beam coders and Beam FS (To write these files to
> > GS/Hadoop/Local)?
> >
> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> >  wrote:
> >
> > P.S. Note that this point (about coders) is also mentioned in the
> > now-being-reviewed PTransform Style Guide
> > https://github.com/apache/beam-site/pull/134
> > currently staged at
> >
> >
>
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >
> >
> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath  >
> > wrote:
> >
> > > +1 to what Eugene said.
> > >
> > > I've seen a number of Python SDK users incorrectly assuming that
> > > coder.decode() is needed when developing their own file-based sources
> > > (since many users usually refer to text source first). Probably coder
> > > parameter should not be configurable for text source/sink and they
> should
> > > be updated to only read/write UTF-8 encoded strings.
> > >
> > > - Cham
> > >
> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > >  wrote:
> > >
> > > > The use of Coder in TextIO is a long standing design issue because
> > coders
> > > > are not intended to be used for general purpose converting things
> from
> > > and
> > > > to bytes, their only proper use is letting the runner materialize
and
> > > > restore objects if the runner thinks it's necessary. IMO it should
> have
> > > > been called LineIO, document that it reads lines of text as String,
> and
> > > not
> > > > have a withCoder parameter at all.
> > > >
> > > > The proper way to address your use case is to write a custom
> > > > FileBasedSource.
> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur 
> wrote:
> > > >
> > > > > The Javadoc of TextIO states:
> > > > >
> > > > > * By default, {@link TextIO.Read} returns a {@link PCollection}
> of
> > > > > {@link String Strings},
> > > > >  * each corresponding to one line of an input UTF-8 text file. To
> > > convert
> > > > > directly from the raw
> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > > another
> > > > > object of type {@code T},
> > > > >  * supply a {@code Coder} using {@link
> > > TextIO.Read#withCoder(Coder)}.
> > > > >
> > > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > > probably
> > > > > won't work given the hard-coded '\n' delimiter.
> > > > >
> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Aviem,
> > > > > >
> > > > > > TextIO is not designed to write/read binary file: it's pure
Text,
> > so
> > > > > > String.
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > While trying to use TextIO to write/read a binary file rather
> > than
> > > > > String
> > > > > > > lines from a textual file I ran into an issue - the delimiter
> > > TextIO
> > > > > uses
> > > > > > > seems to be hardcoded '\n'.
> > > > > > > See `findSeparatorBounds` -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > > >
> > > > > > > The use case 

Re: TextIO binary file

2017-01-31 Thread Eugene Kirpichov
Could you clarify why it would be useful to write objects to files using
Beam coders, as opposed to just using e.g. AvroIO?

Coders (should) make no promise as to what their wire format is, so such
files could be read back only by other Beam pipelines using the same IO.

On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur  wrote:

> So If I understand the general agreement is that TextIO should not support
> anything but lines from files as strings.
> I'll go ahead and file a ticket that says the Javadoc should be changed to
> reflect this and `withCoder` method should be removed.
>
> Is there merit for Beam to supply an IO which does allow writing objects to
> a file using Beam coders and Beam FS (To write these files to
> GS/Hadoop/Local)?
>
> On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
>  wrote:
>
> P.S. Note that this point (about coders) is also mentioned in the
> now-being-reviewed PTransform Style Guide
> https://github.com/apache/beam-site/pull/134
> currently staged at
>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>
>
> On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath 
> wrote:
>
> > +1 to what Eugene said.
> >
> > I've seen a number of Python SDK users incorrectly assuming that
> > coder.decode() is needed when developing their own file-based sources
> > (since many users usually refer to text source first). Probably coder
> > parameter should not be configurable for text source/sink and they should
> > be updated to only read/write UTF-8 encoded strings.
> >
> > - Cham
> >
> > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> >  wrote:
> >
> > > The use of Coder in TextIO is a long standing design issue because
> coders
> > > are not intended to be used for general purpose converting things from
> > and
> > > to bytes, their only proper use is letting the runner materialize and
> > > restore objects if the runner thinks it's necessary. IMO it should have
> > > been called LineIO, document that it reads lines of text as String, and
> > not
> > > have a withCoder parameter at all.
> > >
> > > The proper way to address your use case is to write a custom
> > > FileBasedSource.
> > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur  wrote:
> > >
> > > > The Javadoc of TextIO states:
> > > >
> > > > * By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > > {@link String Strings},
> > > >  * each corresponding to one line of an input UTF-8 text file. To
> > convert
> > > > directly from the raw
> > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > another
> > > > object of type {@code T},
> > > >  * supply a {@code Coder} using {@link
> > TextIO.Read#withCoder(Coder)}.
> > > >
> > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > probably
> > > > won't work given the hard-coded '\n' delimiter.
> > > >
> > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Hi Aviem,
> > > > >
> > > > > TextIO is not designed to write/read binary file: it's pure Text,
> so
> > > > > String.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > Hi,
> > > > > >
> > > > > > While trying to use TextIO to write/read a binary file rather
> than
> > > > String
> > > > > > lines from a textual file I ran into an issue - the delimiter
> > TextIO
> > > > uses
> > > > > > seems to be hardcoded '\n'.
> > > > > > See `findSeparatorBounds` -
> > > > > >
> > > > >
> > > >
> > >
> >
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > >
> > > > > > The use case is to have a file of objects, encoded into bytes
> > using a
> > > > > > coder. However, '\n' is not a good delimiter here, as you can
> > > imagine.
> > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > > >
> > > > >
> > > >
> > >
> >
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > where
> > > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > > >
> > > > > > I did not find any unit tests which use TextIO to read anything
> > other
> > > > > than
> > > > > > Strings.
> > > > > >
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbono...@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>


Re: TextIO binary file

2017-01-31 Thread Aviem Zur
So If I understand the general agreement is that TextIO should not support
anything but lines from files as strings.
I'll go ahead and file a ticket that says the Javadoc should be changed to
reflect this and `withCoder` method should be removed.

Is there merit for Beam to supply an IO which does allow writing objects to
a file using Beam coders and Beam FS (To write these files to
GS/Hadoop/Local)?

On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
 wrote:

P.S. Note that this point (about coders) is also mentioned in the
now-being-reviewed PTransform Style Guide
https://github.com/apache/beam-site/pull/134
currently staged at
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders


On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath 
wrote:

> +1 to what Eugene said.
>
> I've seen a number of Python SDK users incorrectly assuming that
> coder.decode() is needed when developing their own file-based sources
> (since many users usually refer to text source first). Probably coder
> parameter should not be configurable for text source/sink and they should
> be updated to only read/write UTF-8 encoded strings.
>
> - Cham
>
> On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
>  wrote:
>
> > The use of Coder in TextIO is a long standing design issue because
coders
> > are not intended to be used for general purpose converting things from
> and
> > to bytes, their only proper use is letting the runner materialize and
> > restore objects if the runner thinks it's necessary. IMO it should have
> > been called LineIO, document that it reads lines of text as String, and
> not
> > have a withCoder parameter at all.
> >
> > The proper way to address your use case is to write a custom
> > FileBasedSource.
> > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur  wrote:
> >
> > > The Javadoc of TextIO states:
> > >
> > > * By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > {@link String Strings},
> > >  * each corresponding to one line of an input UTF-8 text file. To
> convert
> > > directly from the raw
> > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> another
> > > object of type {@code T},
> > >  * supply a {@code Coder} using {@link
> TextIO.Read#withCoder(Coder)}.
> > >
> > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > probably
> > > won't work given the hard-coded '\n' delimiter.
> > >
> > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > Hi Aviem,
> > > >
> > > > TextIO is not designed to write/read binary file: it's pure Text, so
> > > > String.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > Hi,
> > > > >
> > > > > While trying to use TextIO to write/read a binary file rather than
> > > String
> > > > > lines from a textual file I ran into an issue - the delimiter
> TextIO
> > > uses
> > > > > seems to be hardcoded '\n'.
> > > > > See `findSeparatorBounds` -
> > > > >
> > > >
> > >
> >
>
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > >
> > > > > The use case is to have a file of objects, encoded into bytes
> using a
> > > > > coder. However, '\n' is not a good delimiter here, as you can
> > imagine.
> > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > >
> > > >
> > >
> >
>
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > where
> > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > >
> > > > > I did not find any unit tests which use TextIO to read anything
> other
> > > > than
> > > > > Strings.
> > > > >
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>


Re: TextIO binary file

2017-01-30 Thread Eugene Kirpichov
P.S. Note that this point (about coders) is also mentioned in the
now-being-reviewed PTransform Style Guide
https://github.com/apache/beam-site/pull/134
currently staged at
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders


On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath 
wrote:

> +1 to what Eugene said.
>
> I've seen a number of Python SDK users incorrectly assuming that
> coder.decode() is needed when developing their own file-based sources
> (since many users usually refer to text source first). Probably coder
> parameter should not be configurable for text source/sink and they should
> be updated to only read/write UTF-8 encoded strings.
>
> - Cham
>
> On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
>  wrote:
>
> > The use of Coder in TextIO is a long standing design issue because coders
> > are not intended to be used for general purpose converting things from
> and
> > to bytes, their only proper use is letting the runner materialize and
> > restore objects if the runner thinks it's necessary. IMO it should have
> > been called LineIO, document that it reads lines of text as String, and
> not
> > have a withCoder parameter at all.
> >
> > The proper way to address your use case is to write a custom
> > FileBasedSource.
> > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur  wrote:
> >
> > > The Javadoc of TextIO states:
> > >
> > > * By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > {@link String Strings},
> > >  * each corresponding to one line of an input UTF-8 text file. To
> convert
> > > directly from the raw
> > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> another
> > > object of type {@code T},
> > >  * supply a {@code Coder} using {@link
> TextIO.Read#withCoder(Coder)}.
> > >
> > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > probably
> > > won't work given the hard-coded '\n' delimiter.
> > >
> > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > Hi Aviem,
> > > >
> > > > TextIO is not designed to write/read binary file: it's pure Text, so
> > > > String.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > Hi,
> > > > >
> > > > > While trying to use TextIO to write/read a binary file rather than
> > > String
> > > > > lines from a textual file I ran into an issue - the delimiter
> TextIO
> > > uses
> > > > > seems to be hardcoded '\n'.
> > > > > See `findSeparatorBounds` -
> > > > >
> > > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > >
> > > > > The use case is to have a file of objects, encoded into bytes
> using a
> > > > > coder. However, '\n' is not a good delimiter here, as you can
> > imagine.
> > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > where
> > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > >
> > > > > I did not find any unit tests which use TextIO to read anything
> other
> > > > than
> > > > > Strings.
> > > > >
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>


Re: TextIO binary file

2017-01-30 Thread Eugene Kirpichov
The use of Coder in TextIO is a long standing design issue because coders
are not intended to be used for general purpose converting things from and
to bytes, their only proper use is letting the runner materialize and
restore objects if the runner thinks it's necessary. IMO it should have
been called LineIO, document that it reads lines of text as String, and not
have a withCoder parameter at all.

The proper way to address your use case is to write a custom
FileBasedSource.
On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur  wrote:

> The Javadoc of TextIO states:
>
> * By default, {@link TextIO.Read} returns a {@link PCollection} of
> {@link String Strings},
>  * each corresponding to one line of an input UTF-8 text file. To convert
> directly from the raw
>  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> object of type {@code T},
>  * supply a {@code Coder} using {@link TextIO.Read#withCoder(Coder)}.
>
> However, as I stated, `withCoder` doesn't seem to have tests, and probably
> won't work given the hard-coded '\n' delimiter.
>
> On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Aviem,
> >
> > TextIO is not designed to write/read binary file: it's pure Text, so
> > String.
> >
> > Regards
> > JB
> >
> > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > Hi,
> > >
> > > While trying to use TextIO to write/read a binary file rather than
> String
> > > lines from a textual file I ran into an issue - the delimiter TextIO
> uses
> > > seems to be hardcoded '\n'.
> > > See `findSeparatorBounds` -
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > >
> > > The use case is to have a file of objects, encoded into bytes using a
> > > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > > A similar pattern is found in Spark's `saveAsObjectFile`
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > where
> > > they use a more appropriate delimiter, to avoid such issues.
> > >
> > > I did not find any unit tests which use TextIO to read anything other
> > than
> > > Strings.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: TextIO binary file

2017-01-30 Thread Dan Halperin
Stas' comment is the right one. The "canonical" use of TextIO is using
something like a TextualIntegerCoder
,
but that should almost certainly be replaced with TextIO.Read |
ParDo.of(Parse integer). The `withCoder` functions need to get removed or
replaced.

For "holding a file of arbitrary records" -- simply producing a
delimiter-separated TextIO is probably not a good choice. Specifically,
splitting is broken when the delimiter might appear in the output (e.g.,
when using almost any coder). A better option is to design a file format to
hold arbitrary records. E.g., an Avro file where each record is just a
byte[].

Dan

On Mon, Jan 30, 2017 at 2:52 AM, Aviem Zur  wrote:

> The Javadoc of TextIO states:
>
> * By default, {@link TextIO.Read} returns a {@link PCollection} of
> {@link String Strings},
>  * each corresponding to one line of an input UTF-8 text file. To convert
> directly from the raw
>  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> object of type {@code T},
>  * supply a {@code Coder} using {@link TextIO.Read#withCoder(Coder)}.
>
> However, as I stated, `withCoder` doesn't seem to have tests, and probably
> won't work given the hard-coded '\n' delimiter.
>
> On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Aviem,
> >
> > TextIO is not designed to write/read binary file: it's pure Text, so
> > String.
> >
> > Regards
> > JB
> >
> > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > Hi,
> > >
> > > While trying to use TextIO to write/read a binary file rather than
> String
> > > lines from a textual file I ran into an issue - the delimiter TextIO
> uses
> > > seems to be hardcoded '\n'.
> > > See `findSeparatorBounds` -
> > >
> > https://github.com/apache/beam/blob/master/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > >
> > > The use case is to have a file of objects, encoded into bytes using a
> > > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > > A similar pattern is found in Spark's `saveAsObjectFile`
> > >
> > https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > where
> > > they use a more appropriate delimiter, to avoid such issues.
> > >
> > > I did not find any unit tests which use TextIO to read anything other
> > than
> > > Strings.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: TextIO binary file

2017-01-30 Thread Aviem Zur
The Javadoc of TextIO states:

* By default, {@link TextIO.Read} returns a {@link PCollection} of
{@link String Strings},
 * each corresponding to one line of an input UTF-8 text file. To convert
directly from the raw
 * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
object of type {@code T},
 * supply a {@code Coder} using {@link TextIO.Read#withCoder(Coder)}.

However, as I stated, `withCoder` doesn't seem to have tests, and probably
won't work given the hard-coded '\n' delimiter.

On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré 
wrote:

> Hi Aviem,
>
> TextIO is not designed to write/read binary file: it's pure Text, so
> String.
>
> Regards
> JB
>
> On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > Hi,
> >
> > While trying to use TextIO to write/read a binary file rather than String
> > lines from a textual file I ran into an issue - the delimiter TextIO uses
> > seems to be hardcoded '\n'.
> > See `findSeparatorBounds` -
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> >
> > The use case is to have a file of objects, encoded into bytes using a
> > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > A similar pattern is found in Spark's `saveAsObjectFile`
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > where
> > they use a more appropriate delimiter, to avoid such issues.
> >
> > I did not find any unit tests which use TextIO to read anything other
> than
> > Strings.
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>