Hi Sachneet,
On Wed, Mar 26, 2014 at 8:37 AM, Sachneet Singh Bains < [email protected]> wrote: > Hi Sean, > > > > My use case is to store incoming data(various sources) into a database > like Cassandra. The data will be serialized using AVRO. > It would be foolish for me NOT to put in a plug here for Apache Gora [0]. Gora is an acronym for Generic Object Representation using Avro. So it will do possibly exactly what you are trying to do out of the box. Cassandra is just one of the NoSQL databases we support in Gora. You can see more by reading the site documentation. [0] http://gora.apache.org > My questions are: > > 1. What is the best way to do this ? > Right now in gora-cassandra module we support following Avro data types: Type.STRING, Type.BOOLEAN, Type.BYTES, Type.DOUBLE, Type.FLOAT, Type.INT, Type.LONG, Type.FIXED, Type.ARRAY, Type.MAP, Type.UNION, Type.RECORD. For a more comprehensive overview of how we actually store the data you can head over to dev@gora posting your question and we will reply in full. > 2. How should I keep the schema information along with each record > ? For e.g. two columns , one storing data and another schema/fingerprints ? > Well this is certainly an option, right now though it appear that we store (prepend) the Schema with the data as it is. Right now the storage logic is that we are focused on the data and not the data schema/fingerprints. Therefore when executing Gora Queries in Cassandra we query the Cassandra keyspace by families. When we add sub/supercolumns, Gora keys are mapped to Cassandra partition keys only. This is because we follow the Cassandra logic where column family data is partitioned across nodes based on row Key. You would therefore need to change some aspect of the data modeling if you really wished to store data metadata such as Schema & fingerprints separately. > 3. I see fingerprints as one option but how to make use of it ; > where to maintain the schema repository and how to add fingerprints to data > I've never used fingerprints so i cannot comment. Sorry! > 4. Also, I am wondering if there is ant feature to automatically > generate a schema from an incoming data (CSV format) ? > Everything for Java is Mavenized. There will be no ant target. You could possibly write an implementation for avro-tools which would achieve this for you. You can see current option in avro-tools by looking into the Main#Main() method https://svn.apache.org/repos/asf/avro/trunk/lang/java/tools/src/main/java/org/apache/avro/tool/Main.java > 5. Is there any recommended database to store data in AVRO format > (relational or Nosql) ? > No there is no recommended DB. LOADS of use cases use many different DB's. I would suggest you consider your data and how you will be querying it before you choose your DB. Hopefully some of the above give food for thought. Lewis
