[jira] [Resolved] (AVRO-1708) Memory leak with WeakIdentityHashMap?

2016-04-17 Thread Zoltan Farkas (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Farkas resolved AVRO-1708.
-
Resolution: Won't Fix

I am not sure issue was related with WeakIdentityHashMap implementation, and 
more with the fact that weak references GC overhead is high..
In any case it would be useful to review the uses of these caches.

> Memory leak with WeakIdentityHashMap?
> -
>
> Key: AVRO-1708
> URL: https://issues.apache.org/jira/browse/AVRO-1708
> Project: Avro
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Zoltan Farkas
>
> WeakIdentityHashMap used in GenericDatumReader has only weak Keys, 
> it seems to grow, and values remain in map which looks like a memory leak...
> java WeakhashMap has Weak Entries which allows the GC to collect a entire 
> entry, which prevents leaks...
> the javadoc of this class claims: "Implements a combination of WeakHashMap 
> and IdentityHashMap." which is not really the case



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1707) Java serialization readers/writers in generated Java classes

2016-04-17 Thread Zoltan Farkas (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245015#comment-15245015
 ] 

Zoltan Farkas commented on AVRO-1707:
-

No problems with the practice if the readers are fixed to stop keeping 
references to a thread... (which were the cause of large memory waste in our 
apps)
 
In  our use cases java serialization is not a common use case so it seemed like 
a waste to have these readers and writers initialized without being used... and 
it is so simple to make this lazy and being initialized only when needed...



> Java serialization readers/writers in generated Java classes
> 
>
> Key: AVRO-1707
> URL: https://issues.apache.org/jira/browse/AVRO-1707
> Project: Avro
>  Issue Type: Improvement
>Affects Versions: 1.8.0
>Reporter: Zoltan Farkas
>
> the following static instances are declared in the generated classes:
>   private static final org.apache.avro.io.DatumWriter
> WRITER$ = new org.apache.avro.specific.SpecificDatumWriter(SCHEMA$);  
>   private static final org.apache.avro.io.DatumReader
> READER$ = new org.apache.avro.specific.SpecificDatumReader(SCHEMA$);  
>  the reaser/writer hold on to a reference to the "Creator Thread":
> "private final Thread creator;"
> which inhibits GC-ing thread locals... for this thread...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Quarterly release goal

2016-04-17 Thread S G
+1 for quarterly releases !

On Sat, Apr 16, 2016 at 7:03 PM, Thiruvalluvan MG <
thiru...@yahoo.com.invalid> wrote:

> +1.
> Thanks
> Thiru
>
> On Sunday, 17 April 2016 5:45 AM, Ryan Blue  wrote:
>
>
>  Hi everyone,
>
> It's been about 3 months since we released Avro 1.8.0 and we've already
> accumulated several fixes that we should get out in a release. Sean
> suggested it a few days ago, but I'm not sure if everyone saw that thread.
> Anyone interested in being the release manager for 1.8.1?
>
> I think we should set a goal of a release about once each quarter. If we
> let ourselves go 18 months between releases, then contributors can't use
> their work soon enough to continue caring and contributing.
>
> I think a quarterly release goal is a good first step toward making the
> project more contributor-friendly. For example, we usually have good
> participation getting ready for releases so it's a good excuse to get some
> reviews done in addition to getting committed fixes out the door.
>
> Thoughts and comments?
>
> rb
>
>
> --
> Ryan Blue
>
>
>
>


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244991#comment-15244991
 ] 

Ryan Blue commented on AVRO-1704:
-

Sorry if what I said wasn't clear. I'm not proposing that we get rid of the 
header. I'm saying that we make it one byte instead of 4. I think what I 
outlined addresses the case where the schema cache miss is expensive and 
balances that with the per-message overhead. (I'm fine moving forward with the 
FP considered part of the body.)

A one-byte header results in lower than a 1/256 chance of an expensive lookup 
(by choosing carefully). Why is that too high? Why 4 bytes and not, for 
example, 2 for a 1/65536 chance?

I disagree that the impact of extra bytes is too small to matter. It (probably) 
won't cause fragmentation when sending one message, but we're not talking about 
just one message. Kafka's performance depends on batching records together for 
network operations and each message takes up space on disk. What matters is the 
percentage of data that is overhead. 4 bytes if your messages are 500 is 0.8%, 
and it is 4% if your messages are 100 bytes.

In terms of how much older data I can keep in a Kafka topic, that accounts for 
11m 30s to 57m 30s per day. If I provision for a 3-day window of data in Kafka, 
I'm losing between half an hour and 3 hours of that just to store 'Avr0' over 
and over. That's why I think we have to strike a balance between the two 
concerns. 1 or 2 bytes should really be sufficient, depending on the 
probability of a false-positive we want. And false-positives are only that 
costly if each one causes an RPC, which we can avoid with a little failure 
detection logic.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AVRO-1794) Update docs after migration to git

2016-04-17 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned AVRO-1794:
---

Assignee: Ryan Blue

> Update docs after migration to git
> --
>
> Key: AVRO-1794
> URL: https://issues.apache.org/jira/browse/AVRO-1794
> Project: Avro
>  Issue Type: Task
>  Components: doc
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>
> The [vote to move to 
> git|https://mail-archives.apache.org/mod_mbox/avro-dev/201602.mbox/%3C56AFB9B9.8000304%40apache.org%3E]
>  just passed. Once the INFRA ticket is completed, we will need to [update 
> docs|https://avro.apache.org/version_control.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AVRO-1794) Update docs after migration to git

2016-04-17 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved AVRO-1794.
-
Resolution: Fixed

Done. Thanks for pointing out the wiki page, Niels.

> Update docs after migration to git
> --
>
> Key: AVRO-1794
> URL: https://issues.apache.org/jira/browse/AVRO-1794
> Project: Avro
>  Issue Type: Task
>  Components: doc
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>
> The [vote to move to 
> git|https://mail-archives.apache.org/mod_mbox/avro-dev/201602.mbox/%3C56AFB9B9.8000304%40apache.org%3E]
>  just passed. Once the INFRA ticket is completed, we will need to [update 
> docs|https://avro.apache.org/version_control.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-17 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244947#comment-15244947
 ] 

Niels Basjes commented on AVRO-1704:


A few of the thoughts I had when creating the current patch:
# Regarding the 'Avro' header (which I still believe to be 'the way to go')
#* The cost of going to the Schema registry is high on a 'cache mis'. Problems 
like I ran into with STORM-512 will occur in other systems too and may very 
well cause an overload on the schema registry.
#* I consider the cost of a fixed header of 4 bytes to be low. But that really 
depends on the size of the record being transmitted (my records are in the 
500-1000 bytes range).
#** These extra bytes will only be persisted in streaming systems like Kafka. 
Long term file formats (like AVRO, Parquet and ORC) won't store this.
#** In network traffic the overhead is 'unmeasurably small' because it is 
unlikely the record will go over the size of a single TCP packet (1500) because 
of these 4 bytes.
# Regarding the schema fingerprint (which I consider a 'body' part).
#* The idea of the 'version' was that someone may want to use a different 
'hash' instead of the CRC-64-AVRO.
#* I think that in case of encryption we should have the fingerprint encrypted 
too.

*In light of the encryption option and your comments I'm now considering this 
_brainwave_*:
* The 'header of the message' should be pluggable.
** The default is a 'fixed shape' which includes a format id. (Same as what my 
current patch does).
** I expect that making this pluggable too is possible but that would have some 
restrictions like "all records of a schema must adhere to the same base format".
* The 'body of the message' should be pluggable too. 
** Format '0' is hardcoded (fingerprint+record). 
** Yet other versions (we should define a range like 0x80-0xFF) can be used by 
anyone to define a custom body definition (including encryption). I expect 
these versions to only exist within a specific company. If they need to 
exchange data with others they should share their format specification anyway.
* If we set the code up right we can have a layering system: I.e. someone can 
'insert' an encryption layer and still use the 'standard' body (after 
decryption).
** Such an 'encryption layer' would add additional parts like a encryption type 
and a key id.


> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1750) GenericDatum API behavior breaking change

2016-04-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244945#comment-15244945
 ] 

Ryan Blue commented on AVRO-1750:
-

[~braden], could you take a look at the new patch?

> GenericDatum API behavior breaking change
> -
>
> Key: AVRO-1750
> URL: https://issues.apache.org/jira/browse/AVRO-1750
> Project: Avro
>  Issue Type: Bug
>  Components: c++
>Affects Versions: 1.7.7
>Reporter: Braden McDaniel
>Assignee: Thiruvalluvan M. G.
> Fix For: 1.9.0
>
> Attachments: AVRO-1750.patch
>
>
> It appears that a change was introduced to the {{avro::GenericDatum}} 
> implementation between 1.7.6 and 1.7.7 that causes unions to be handled 
> differently.
> The 1.7.6 implementation does this:
> {noformat}
> inline Type AVRO_DECL GenericDatum::type() const {
> return (type_ == AVRO_UNION) ?
> boost::any_cast(_)->type() : type_;
> }
> template
> const T& GenericDatum::value() const {
> return (type_ == AVRO_UNION) ?
> boost::any_cast(_)->value() :
> *boost::any_cast(_);
> }
> template
> T& GenericDatum::value() {
> return (type_ == AVRO_UNION) ?
> boost::any_cast(_)->value() :
> *boost::any_cast(_);
> }
> {noformat}
> …whereas the 1.7.7 implementation does this:
> {noformat}
> /**
>  * The avro data type this datum holds.
>  */
> Type type() const {
> return type_;
> }
> /**
>  * Returns the value held by this datum.
>  * T The type for the value. This must correspond to the
>  * avro type returned by type().
>  */
> template const T& value() const {
> return *boost::any_cast(_);
> }
> /**
>  * Returns the reference to the value held by this datum, which
>  * can be used to change the contents. Please note that only
>  * value can be changed, the data type of the value held cannot
>  * be changed.
>  *
>  * T The type for the value. This must correspond to the
>  * avro type returned by type().
>  */
> template T& value() {
> return *boost::any_cast(_);
> }
> {noformat}
> The result of this is that, if the underlying value is an {{AVRO_UNION}}, 
> calls to {{GenericDatum::type}} and {{GenericDatum::value<>}} that previously 
> resolved to the union member type no longer do so (and user code relying on 
> that behavior has been broken).
> This change apparently was made as part of the changes for AVRO-1474; 
> however, looking at the comments in that issue, it's not clear to me why it 
> was required for that fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1821) Avro (Java) Memory Leak in ReflectData Caching

2016-04-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244938#comment-15244938
 ] 

ASF subversion and git services commented on AVRO-1821:
---

Commit b30b9e7a3365f50aa6f4481705937c462914764d in avro's branch 
refs/heads/master from [~rdblue]
[ https://git-wip-us.apache.org/repos/asf?p=avro.git;h=b30b9e7 ]

AVRO-1821: Add license header to TestReflectData.


> Avro (Java) Memory Leak in ReflectData Caching
> --
>
> Key: AVRO-1821
> URL: https://issues.apache.org/jira/browse/AVRO-1821
> Project: Avro
>  Issue Type: Bug
>  Components: java
> Environment: OS X 10.11.3
> {code}java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode){code}
>Reporter: Bryan Harclerode
>Assignee: Bryan Harclerode
> Attachments: 
> 0001-AVRO-1821-Fix-memory-leak-of-Schemas-in-ReflectData.patch
>
>
> I think I have encountered one of the memory leaks described by AVRO-1283 in 
> the way Java Avro implements field accessor caching in {{ReflectData}}. When 
> a reflected object is serialized, the key of {{ClassAccessorData.bySchema}} 
> (as retained by {{ReflectData.ACCESSOR_CACHE}}) retains a strong reference to 
> the schema that was used to serialize the object, but there exists no code 
> path for clearing these references after a schema will no longer be used.
> While in most cases, a class will probably only have one schema associated 
> with it (created and cached by {{ReflectData.getSchema(Type)}}), I 
> experienced {{OutOfMemoryError}} when serializing generic classes with 
> dynamically-generated schemas. The following is a minimal example which will 
> exhaust a 50MiB heap ({{-Xmx50m}}) after about 190K iterations:
> {code:title=AvroMemoryLeakMinimal.java|borderStyle=solid}
> import java.io.ByteArrayOutputStream;
> import java.io.IOException;
> import java.util.Collections;
> import org.apache.avro.Schema;
> import org.apache.avro.io.BinaryEncoder;
> import org.apache.avro.io.EncoderFactory;
> import org.apache.avro.reflect.ReflectDatumWriter;
> public class AvroMemoryLeakMinimal {
> public static void main(String[] args) throws IOException {
> long count = 0;
> EncoderFactory encFactory = EncoderFactory.get();
> try {
> while (true) {
> // Create schema
> Schema schema = Schema.createRecord("schema", null, null, 
> false);
> schema.setFields(Collections.emptyList());
> // serialize
> ByteArrayOutputStream baos = new ByteArrayOutputStream(1024);
> BinaryEncoder encoder = encFactory.binaryEncoder(baos, null);
> (new ReflectDatumWriter(schema)).write(new Object(), 
> encoder);
> byte[] result = baos.toByteArray();
> count++;
> }
> } catch (OutOfMemoryError e) {
> System.out.print("Memory exhausted after ");
> System.out.print(count);
> System.out.println(" schemas");
> throw e;
> }
> }
> }
> {code}
> I was able to fix the bug in the latest 1.9.0-SNAPSHOT from git with the 
> following patch to {{ClassAccessorData.bySchema}} to use weak keys so that it 
> properly released the {{Schema}} objects if no other threads are still 
> referencing them:
> {code:title=ReflectData.java.patch|borderStyle=solid}
> --- a/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java
> +++ b/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java
> @@ -57,6 +57,7 @@ import org.apache.avro.io.DatumWriter;
>  import org.apache.avro.specific.FixedSize;
>  import org.apache.avro.specific.SpecificData;
>  import org.apache.avro.SchemaNormalization;
> +import org.apache.avro.util.WeakIdentityHashMap;
>  import org.codehaus.jackson.JsonNode;
>  import org.codehaus.jackson.node.NullNode;
>  
> @@ -234,8 +235,8 @@ public class ReflectData extends SpecificData {
>  private final Class clazz;
>  private final Map byName =
>  new HashMap();
> -private final IdentityHashMap bySchema =
> -new IdentityHashMap();
> +private final WeakIdentityHashMap bySchema =
> +new WeakIdentityHashMap();
>  
>  private ClassAccessorData(Class c) {
>clazz = c;
> {code}
> Additionally, I'm not sure why an {{IdentityHashMap}} was used instead of a 
> standard {{HashMap}}, since two equivalent schemas have the same set of 
> {{FieldAccessor}}. Everything appears to work and all tests pass if I use a 
> {{WeakHashMap}} instead of an 

[jira] [Commented] (AVRO-1821) Avro (Java) Memory Leak in ReflectData Caching

2016-04-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244939#comment-15244939
 ] 

Ryan Blue commented on AVRO-1821:
-

Fixed. Thanks for catching that, [~nielsbasjes]!

> Avro (Java) Memory Leak in ReflectData Caching
> --
>
> Key: AVRO-1821
> URL: https://issues.apache.org/jira/browse/AVRO-1821
> Project: Avro
>  Issue Type: Bug
>  Components: java
> Environment: OS X 10.11.3
> {code}java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode){code}
>Reporter: Bryan Harclerode
>Assignee: Bryan Harclerode
> Attachments: 
> 0001-AVRO-1821-Fix-memory-leak-of-Schemas-in-ReflectData.patch
>
>
> I think I have encountered one of the memory leaks described by AVRO-1283 in 
> the way Java Avro implements field accessor caching in {{ReflectData}}. When 
> a reflected object is serialized, the key of {{ClassAccessorData.bySchema}} 
> (as retained by {{ReflectData.ACCESSOR_CACHE}}) retains a strong reference to 
> the schema that was used to serialize the object, but there exists no code 
> path for clearing these references after a schema will no longer be used.
> While in most cases, a class will probably only have one schema associated 
> with it (created and cached by {{ReflectData.getSchema(Type)}}), I 
> experienced {{OutOfMemoryError}} when serializing generic classes with 
> dynamically-generated schemas. The following is a minimal example which will 
> exhaust a 50MiB heap ({{-Xmx50m}}) after about 190K iterations:
> {code:title=AvroMemoryLeakMinimal.java|borderStyle=solid}
> import java.io.ByteArrayOutputStream;
> import java.io.IOException;
> import java.util.Collections;
> import org.apache.avro.Schema;
> import org.apache.avro.io.BinaryEncoder;
> import org.apache.avro.io.EncoderFactory;
> import org.apache.avro.reflect.ReflectDatumWriter;
> public class AvroMemoryLeakMinimal {
> public static void main(String[] args) throws IOException {
> long count = 0;
> EncoderFactory encFactory = EncoderFactory.get();
> try {
> while (true) {
> // Create schema
> Schema schema = Schema.createRecord("schema", null, null, 
> false);
> schema.setFields(Collections.emptyList());
> // serialize
> ByteArrayOutputStream baos = new ByteArrayOutputStream(1024);
> BinaryEncoder encoder = encFactory.binaryEncoder(baos, null);
> (new ReflectDatumWriter(schema)).write(new Object(), 
> encoder);
> byte[] result = baos.toByteArray();
> count++;
> }
> } catch (OutOfMemoryError e) {
> System.out.print("Memory exhausted after ");
> System.out.print(count);
> System.out.println(" schemas");
> throw e;
> }
> }
> }
> {code}
> I was able to fix the bug in the latest 1.9.0-SNAPSHOT from git with the 
> following patch to {{ClassAccessorData.bySchema}} to use weak keys so that it 
> properly released the {{Schema}} objects if no other threads are still 
> referencing them:
> {code:title=ReflectData.java.patch|borderStyle=solid}
> --- a/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java
> +++ b/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java
> @@ -57,6 +57,7 @@ import org.apache.avro.io.DatumWriter;
>  import org.apache.avro.specific.FixedSize;
>  import org.apache.avro.specific.SpecificData;
>  import org.apache.avro.SchemaNormalization;
> +import org.apache.avro.util.WeakIdentityHashMap;
>  import org.codehaus.jackson.JsonNode;
>  import org.codehaus.jackson.node.NullNode;
>  
> @@ -234,8 +235,8 @@ public class ReflectData extends SpecificData {
>  private final Class clazz;
>  private final Map byName =
>  new HashMap();
> -private final IdentityHashMap bySchema =
> -new IdentityHashMap();
> +private final WeakIdentityHashMap bySchema =
> +new WeakIdentityHashMap();
>  
>  private ClassAccessorData(Class c) {
>clazz = c;
> {code}
> Additionally, I'm not sure why an {{IdentityHashMap}} was used instead of a 
> standard {{HashMap}}, since two equivalent schemas have the same set of 
> {{FieldAccessor}}. Everything appears to work and all tests pass if I use a 
> {{WeakHashMap}} instead of an {{WeakIdentityHashMap}}, but I don't know if 
> there was some other reason object identity was important for this map. If a 
> non-identity map can be used, this will help reduce memory/CPU usage further 
> by not 

[jira] [Updated] (AVRO-1826) build.sh rat fails over extra license files and many others.

2016-04-17 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated AVRO-1826:
---
   Resolution: Fixed
Fix Version/s: 1.8.1
   Status: Resolved  (was: Patch Available)

Committed

> build.sh rat fails over extra license files and many others.
> 
>
> Key: AVRO-1826
> URL: https://issues.apache.org/jira/browse/AVRO-1826
> Project: Avro
>  Issue Type: Bug
>Reporter: Niels Basjes
>Assignee: Niels Basjes
> Fix For: 1.8.1
>
> Attachments: AVRO-1826-20160410.patch
>
>
> When running ./build.sh rat this will fail due to several license related 
> files we recently added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1826) build.sh rat fails over extra license files and many others.

2016-04-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244929#comment-15244929
 ] 

ASF subversion and git services commented on AVRO-1826:
---

Commit 2875665a5bace62f4e7dfb01060cd50a08637b86 in avro's branch 
refs/heads/master from [~nielsbasjes]
[ https://git-wip-us.apache.org/repos/asf?p=avro.git;h=2875665 ]

AVRO-1826: build.sh rat fails over extra license files and many others.


> build.sh rat fails over extra license files and many others.
> 
>
> Key: AVRO-1826
> URL: https://issues.apache.org/jira/browse/AVRO-1826
> Project: Avro
>  Issue Type: Bug
>Reporter: Niels Basjes
>Assignee: Niels Basjes
> Attachments: AVRO-1826-20160410.patch
>
>
> When running ./build.sh rat this will fail due to several license related 
> files we recently added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1821) Avro (Java) Memory Leak in ReflectData Caching

2016-04-17 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244927#comment-15244927
 ] 

Niels Basjes commented on AVRO-1821:


[~rdblue] 
I was just verifying AVRO-1826 (the build rat problem) and it failed ...
The file 
lang/java/avro/src/test/java/org/apache/avro/reflect/TestReflectData.java is 
missing the appropriate copyright message.

> Avro (Java) Memory Leak in ReflectData Caching
> --
>
> Key: AVRO-1821
> URL: https://issues.apache.org/jira/browse/AVRO-1821
> Project: Avro
>  Issue Type: Bug
>  Components: java
> Environment: OS X 10.11.3
> {code}java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode){code}
>Reporter: Bryan Harclerode
>Assignee: Bryan Harclerode
> Attachments: 
> 0001-AVRO-1821-Fix-memory-leak-of-Schemas-in-ReflectData.patch
>
>
> I think I have encountered one of the memory leaks described by AVRO-1283 in 
> the way Java Avro implements field accessor caching in {{ReflectData}}. When 
> a reflected object is serialized, the key of {{ClassAccessorData.bySchema}} 
> (as retained by {{ReflectData.ACCESSOR_CACHE}}) retains a strong reference to 
> the schema that was used to serialize the object, but there exists no code 
> path for clearing these references after a schema will no longer be used.
> While in most cases, a class will probably only have one schema associated 
> with it (created and cached by {{ReflectData.getSchema(Type)}}), I 
> experienced {{OutOfMemoryError}} when serializing generic classes with 
> dynamically-generated schemas. The following is a minimal example which will 
> exhaust a 50MiB heap ({{-Xmx50m}}) after about 190K iterations:
> {code:title=AvroMemoryLeakMinimal.java|borderStyle=solid}
> import java.io.ByteArrayOutputStream;
> import java.io.IOException;
> import java.util.Collections;
> import org.apache.avro.Schema;
> import org.apache.avro.io.BinaryEncoder;
> import org.apache.avro.io.EncoderFactory;
> import org.apache.avro.reflect.ReflectDatumWriter;
> public class AvroMemoryLeakMinimal {
> public static void main(String[] args) throws IOException {
> long count = 0;
> EncoderFactory encFactory = EncoderFactory.get();
> try {
> while (true) {
> // Create schema
> Schema schema = Schema.createRecord("schema", null, null, 
> false);
> schema.setFields(Collections.emptyList());
> // serialize
> ByteArrayOutputStream baos = new ByteArrayOutputStream(1024);
> BinaryEncoder encoder = encFactory.binaryEncoder(baos, null);
> (new ReflectDatumWriter(schema)).write(new Object(), 
> encoder);
> byte[] result = baos.toByteArray();
> count++;
> }
> } catch (OutOfMemoryError e) {
> System.out.print("Memory exhausted after ");
> System.out.print(count);
> System.out.println(" schemas");
> throw e;
> }
> }
> }
> {code}
> I was able to fix the bug in the latest 1.9.0-SNAPSHOT from git with the 
> following patch to {{ClassAccessorData.bySchema}} to use weak keys so that it 
> properly released the {{Schema}} objects if no other threads are still 
> referencing them:
> {code:title=ReflectData.java.patch|borderStyle=solid}
> --- a/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java
> +++ b/lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java
> @@ -57,6 +57,7 @@ import org.apache.avro.io.DatumWriter;
>  import org.apache.avro.specific.FixedSize;
>  import org.apache.avro.specific.SpecificData;
>  import org.apache.avro.SchemaNormalization;
> +import org.apache.avro.util.WeakIdentityHashMap;
>  import org.codehaus.jackson.JsonNode;
>  import org.codehaus.jackson.node.NullNode;
>  
> @@ -234,8 +235,8 @@ public class ReflectData extends SpecificData {
>  private final Class clazz;
>  private final Map byName =
>  new HashMap();
> -private final IdentityHashMap bySchema =
> -new IdentityHashMap();
> +private final WeakIdentityHashMap bySchema =
> +new WeakIdentityHashMap();
>  
>  private ClassAccessorData(Class c) {
>clazz = c;
> {code}
> Additionally, I'm not sure why an {{IdentityHashMap}} was used instead of a 
> standard {{HashMap}}, since two equivalent schemas have the same set of 
> {{FieldAccessor}}. Everything appears to work and all tests pass if I use a 
> {{WeakHashMap}} instead of an {{WeakIdentityHashMap}}, but I don't know if 
> 

[jira] [Updated] (AVRO-1825) Allow running build.sh dist under git

2016-04-17 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated AVRO-1825:
---
   Resolution: Fixed
Fix Version/s: 1.8.1
   Status: Resolved  (was: Patch Available)

Committed.

> Allow running build.sh dist under git
> -
>
> Key: AVRO-1825
> URL: https://issues.apache.org/jira/browse/AVRO-1825
> Project: Avro
>  Issue Type: Improvement
>  Components: build
>Reporter: Niels Basjes
>Assignee: Niels Basjes
> Fix For: 1.8.1
>
> Attachments: AVRO-1825-20160409.patch
>
>
> When working of a git clone instead of an svn checkout the build.sh dist 
> cannot run due to an explicit dependency on the fact that the working 
> directory must be an svn checkout.
> This should be a bit more flexible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1825) Allow running build.sh dist under git

2016-04-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244909#comment-15244909
 ] 

ASF subversion and git services commented on AVRO-1825:
---

Commit 4996990874d127f4908c9f55f7bc3a4334e2aed4 in avro's branch 
refs/heads/master from [~nielsbasjes]
[ https://git-wip-us.apache.org/repos/asf?p=avro.git;h=4996990 ]

AVRO-1825: Allow running build.sh dist under git


> Allow running build.sh dist under git
> -
>
> Key: AVRO-1825
> URL: https://issues.apache.org/jira/browse/AVRO-1825
> Project: Avro
>  Issue Type: Improvement
>  Components: build
>Reporter: Niels Basjes
>Assignee: Niels Basjes
> Attachments: AVRO-1825-20160409.patch
>
>
> When working of a git clone instead of an svn checkout the build.sh dist 
> cannot run due to an explicit dependency on the fact that the working 
> directory must be an svn checkout.
> This should be a bit more flexible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1794) Update docs after migration to git

2016-04-17 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244906#comment-15244906
 ] 

Niels Basjes commented on AVRO-1794:


This one needs an update too 
https://cwiki.apache.org/confluence/display/AVRO/How+To+Contribute

> Update docs after migration to git
> --
>
> Key: AVRO-1794
> URL: https://issues.apache.org/jira/browse/AVRO-1794
> Project: Avro
>  Issue Type: Task
>  Components: doc
>Reporter: Ryan Blue
>
> The [vote to move to 
> git|https://mail-archives.apache.org/mod_mbox/avro-dev/201602.mbox/%3C56AFB9B9.8000304%40apache.org%3E]
>  just passed. Once the INFRA ticket is completed, we will need to [update 
> docs|https://avro.apache.org/version_control.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Avro union compatibility mode enhancement proposal

2016-04-17 Thread Matthieu Monsch
Great!

Summarizing:

+ For `enum`s, we can go with the approach described in AVRO-1340 [1]. The only 
missing step is to agree on the `.avsc` and `.avdl` syntax (can be discussed in 
that ticket).
+ For unions, we will add an optional catch-all attribute to mark a branch as 
resolution target when no names or aliases match (and come up with the 
corresponding syntax).

Does this sound like a good way forward?

[1] https://issues.apache.org/jira/browse/AVRO-1340



> On Apr 17, 2016, at 10:14 AM, Ryan Blue  wrote:
> 
> Thanks, Zoltan! Good to know there's an issue open for the enum enhancement
> and that you already suggested the fix we came up with in this thread. Lets
> use AVRO-1340 to track this.
> 
> rb
> 
> On Sun, Apr 17, 2016 at 4:49 AM, Zoltan Farkas > wrote:
> 
>> +1 , it would resolve AVRO-1340 as well!
>> 
>> --z
>> 
>> 
>> 
>>> On Apr 16, 2016, at 10:26 PM, Ryan Blue 
>> wrote:
>>> 
>>> +1
>>> 
>>> On Sat, Apr 16, 2016 at 5:54 PM, Matthieu Monsch 
>>> wrote:
>>> 
> I think it may make sense for
> enums to add a special value for it.
 
 That would work great. To give slightly more flexibility to users, we
 could even allow the reader’s schema to explicitly specify which symbol
>> to
 use when it reads an unknown symbol. If not specified, resolution would
 fail (consistent with the current behavior).
 
 For example (assuming the corresponding key is called “onUnknown”):
 
> {
> “type”: “enum”,
> “name”: “Platform”,
> “symbols”: [“UNSUPPORTED”, “ANDROID”, “IOS”],
> “onUnknown”: “UNSUPPORTED"
> }
 
 This `enum` would then be able to resolve schemas with extra symbols
>> (read
 as `UNSUPPORTED`).
 
 
 
>> On Apr 16, 2016, at 1:36 PM, Ryan Blue 
> wrote:
> 
> Matthieu, how would that work with enums? I think it may make sense for
> enums to add a special value for it.
> 
> rb
> 
> On Sat, Apr 16, 2016 at 1:23 PM, Matthieu Monsch 
> wrote:
> 
>>> I think that substituting different data for
>>> unrecognized branches in a union isn't the way to fix the problem,
>> but
 I
>>> have a proposal for a way to fix it that looks more like you'd expect
 in
>>> the OO example above: by adding the Vehicle class to your read
>> schema.
>>> 
>>> Right now, unions are first resolved by full class name as required
>> by
>> the
>>> spec. But after that, we have some additional rules to match schemas.
>> These
>>> rules are how we reconcile name differences from situations like
 writing
>>> with a generic class and reading with a specific class. I'm proposing
 you
>>> use a catch-all class (the superclass) with fields that are in all of
 the
>>> union's branches, and we update schema resolution to allow it.
>> 
>> 
>> That sounds good. The only thing I’d add is to make this catch-all
>> behavior explicit in the schema (similar to how field defaults must be
>> explicitly added).
>> To help fix another common writer evolution issue, we could also add a
>> similar catch-all for `enum`s (optional, to be explicitly specified in
 the
>> schema).
>> -Matthieu
>> 
>> 
>> 
 On Apr 12, 2016, at 2:41 PM, Ryan Blue 
>>> wrote:
>>> 
>>> Yacine,
>>> 
>>> Thanks for the extra information. Sorry for my delay in replying, I
>> wanted
>>> to time think about your suggestion.
>>> 
>>> I see what you mean that you can think of a union as the superclass
>> of
>> its
>>> options. The reflect object model has test code that does just that,
>> where
>>> classes B and C inherit from A and the schema for A is created as a
 union
>>> of B and C. But, I don't think that your suggestion aligns with the
>>> expectations of object oriented design. Maybe that's an easier way to
>>> present my concern:
>>> 
>>> Say I have a class, Vehicle, with subclasses Car and Truck. I have
>>> applications that work with my dataset, the vehicles that my company
>> owns,
>>> and we buy a bus. If I add a Bus class, what normally happens is that
 an
>>> older application can work with it. A maintenance tracker would call
>>> getVehicles and can still get the lastMaintenanceDate for my Bus,
>> even
>>> though it doesn't know about busses. But what you suggest is that it
>> is
>>> replaced with a default, say null, in cases like this.
>>> 
>>> I think that the problem is that Avro no equivalent concept of
>> inheritance.
>>> There's only one way to represent it for what you need right now,
>> like
>>> Matthieu suggested. I think that substituting different data for
>>> unrecognized branches in a union 

Re: Moved to git!

2016-04-17 Thread Niels Basjes
Hi,

I just noticed both these pages
 http://avro.apache.org/version_control.html
 https://cwiki.apache.org/confluence/display/AVRO/How+To+Contribute
have not been updated to reflect the change to git.

Can one of you guys pick this up please?

Niels Basjes


On Wed, Feb 10, 2016 at 5:13 PM, Sean Busbey  wrote:

> On Tue, Feb 9, 2016 at 8:56 AM, Tom White  wrote:
>
> > On Thu, Feb 4, 2016 at 11:23 PM, Ryan Blue  wrote:
> > > The new git repository is live! You can clone it from here:
> > >
> > >   https://git-wip-us.apache.org/repos/asf/avro.git
> > >
> > > It looks like the commit hashes are identical to the ones in the github
> > > mirror, so it should just appear like trunk has been renamed to master
> if
> > > you've already cloned the github mirror. In that case, just run this:
> > >
> > >   git remote add apache
> https://git-wip-us.apache.org/repos/asf/avro.git
> > >
> > > The old SVN repository is still RW so we can change the site/ folders,
> so
> > > please remember to push to the git repo's master instead of committing
> > code
> > > changes to SVN. Does anyone know if we still use those folders? If not,
> > then
> > > we can probably switch it over to read-only.
> >
> > The site folder (https://svn.apache.org/repos/asf/avro/site) contains
> > the website, so we do still need to able to write to it. As far as I
> > know, sites still use svnpubsub at the ASF so we need to keep these
> > pages in svn (see http://www.apache.org/dev/project-site.html).
> >
> > Tom
> >
> >
> There's a gitpubsub now, so we could move the site over to git if we so
> desired. It can either be a branch (asf-site) or it can be in its own git
> repo (useful if we want to automate site generation and publication at some
> point).
>
> --
> Sean
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


[jira] [Assigned] (AVRO-1750) GenericDatum API behavior breaking change

2016-04-17 Thread Thiruvalluvan M. G. (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thiruvalluvan M. G. reassigned AVRO-1750:
-

Assignee: Thiruvalluvan M. G.

> GenericDatum API behavior breaking change
> -
>
> Key: AVRO-1750
> URL: https://issues.apache.org/jira/browse/AVRO-1750
> Project: Avro
>  Issue Type: Bug
>  Components: c++
>Affects Versions: 1.7.7
>Reporter: Braden McDaniel
>Assignee: Thiruvalluvan M. G.
> Fix For: 1.9.0
>
> Attachments: AVRO-1750.patch
>
>
> It appears that a change was introduced to the {{avro::GenericDatum}} 
> implementation between 1.7.6 and 1.7.7 that causes unions to be handled 
> differently.
> The 1.7.6 implementation does this:
> {noformat}
> inline Type AVRO_DECL GenericDatum::type() const {
> return (type_ == AVRO_UNION) ?
> boost::any_cast(_)->type() : type_;
> }
> template
> const T& GenericDatum::value() const {
> return (type_ == AVRO_UNION) ?
> boost::any_cast(_)->value() :
> *boost::any_cast(_);
> }
> template
> T& GenericDatum::value() {
> return (type_ == AVRO_UNION) ?
> boost::any_cast(_)->value() :
> *boost::any_cast(_);
> }
> {noformat}
> …whereas the 1.7.7 implementation does this:
> {noformat}
> /**
>  * The avro data type this datum holds.
>  */
> Type type() const {
> return type_;
> }
> /**
>  * Returns the value held by this datum.
>  * T The type for the value. This must correspond to the
>  * avro type returned by type().
>  */
> template const T& value() const {
> return *boost::any_cast(_);
> }
> /**
>  * Returns the reference to the value held by this datum, which
>  * can be used to change the contents. Please note that only
>  * value can be changed, the data type of the value held cannot
>  * be changed.
>  *
>  * T The type for the value. This must correspond to the
>  * avro type returned by type().
>  */
> template T& value() {
> return *boost::any_cast(_);
> }
> {noformat}
> The result of this is that, if the underlying value is an {{AVRO_UNION}}, 
> calls to {{GenericDatum::type}} and {{GenericDatum::value<>}} that previously 
> resolved to the union member type no longer do so (and user code relying on 
> that behavior has been broken).
> This change apparently was made as part of the changes for AVRO-1474; 
> however, looking at the comments in that issue, it's not clear to me why it 
> was required for that fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1750) GenericDatum API behavior breaking change

2016-04-17 Thread Thiruvalluvan M. G. (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thiruvalluvan M. G. updated AVRO-1750:
--
Attachment: AVRO-1750.patch

> GenericDatum API behavior breaking change
> -
>
> Key: AVRO-1750
> URL: https://issues.apache.org/jira/browse/AVRO-1750
> Project: Avro
>  Issue Type: Bug
>  Components: c++
>Affects Versions: 1.7.7
>Reporter: Braden McDaniel
> Fix For: 1.9.0
>
> Attachments: AVRO-1750.patch
>
>
> It appears that a change was introduced to the {{avro::GenericDatum}} 
> implementation between 1.7.6 and 1.7.7 that causes unions to be handled 
> differently.
> The 1.7.6 implementation does this:
> {noformat}
> inline Type AVRO_DECL GenericDatum::type() const {
> return (type_ == AVRO_UNION) ?
> boost::any_cast(_)->type() : type_;
> }
> template
> const T& GenericDatum::value() const {
> return (type_ == AVRO_UNION) ?
> boost::any_cast(_)->value() :
> *boost::any_cast(_);
> }
> template
> T& GenericDatum::value() {
> return (type_ == AVRO_UNION) ?
> boost::any_cast(_)->value() :
> *boost::any_cast(_);
> }
> {noformat}
> …whereas the 1.7.7 implementation does this:
> {noformat}
> /**
>  * The avro data type this datum holds.
>  */
> Type type() const {
> return type_;
> }
> /**
>  * Returns the value held by this datum.
>  * T The type for the value. This must correspond to the
>  * avro type returned by type().
>  */
> template const T& value() const {
> return *boost::any_cast(_);
> }
> /**
>  * Returns the reference to the value held by this datum, which
>  * can be used to change the contents. Please note that only
>  * value can be changed, the data type of the value held cannot
>  * be changed.
>  *
>  * T The type for the value. This must correspond to the
>  * avro type returned by type().
>  */
> template T& value() {
> return *boost::any_cast(_);
> }
> {noformat}
> The result of this is that, if the underlying value is an {{AVRO_UNION}}, 
> calls to {{GenericDatum::type}} and {{GenericDatum::value<>}} that previously 
> resolved to the union member type no longer do so (and user code relying on 
> that behavior has been broken).
> This change apparently was made as part of the changes for AVRO-1474; 
> however, looking at the comments in that issue, it's not clear to me why it 
> was required for that fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-992) AVRO Path

2016-04-17 Thread Ivan Balashov (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244617#comment-15244617
 ] 

Ivan Balashov commented on AVRO-992:


Looks like this was implemented in https://github.com/wandoulabs/avpath

> AVRO Path
> -
>
> Key: AVRO-992
> URL: https://issues.apache.org/jira/browse/AVRO-992
> Project: Avro
>  Issue Type: New Feature
>  Components: java
>Affects Versions: 1.7.0
>Reporter: Jason Rutherglen
>Priority: Minor
>
> Like XPath or JSON Path, it would be useful for AVRO to support a 'path' like 
> system to query on an AVRO object to return selected results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)