Wednesday, 8 March 2017

Avro: nested complex types

I've been trying to use Apache Avro to serialize and deserialize some Java objects.

I've found the documentation so useless that I've decided to write down my notes on solving all the problems, in order to save other people time.

So if you are using a hard-coded schema (ie a file) to describe your objects, you will need to use the GenericRecord class to create entries, then populate them from your objects, like this:


 Schema schema = new Schema.Parser().parse(new File("src/test/resources/user.avsc"));
        GenericRecord user1 = new GenericData.Record(schema);
        user1.put("name", "Alyssa");
        user1.put("favorite_number", 256);
// more users can be created this way
 File file = new File("users.avro");
        DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
        DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
        dataFileWriter.create(schema, file);
        dataFileWriter.append(user1);
        dataFileWriter.close();

So it's a little bit cumbersome, and it's restricted to File output.

So what if you want to have a collection of some custom objects? 
For example:

        Dog ollie = new Dog();
        ollie.setName("Ollie");

        ollie.setBreed("bulldog");


Here's one way of doing it:  Your schema is changed to be a union of 2 schemas, like this:

[
{"namespace": "example.avro",
 "type": "record",
 "name": "Dog",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "breed",  "type": "string"}
     ]
},

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]},
     {"name": "dogs", "type":["null", {
                              "type": "array",
                              "items": "Dog"
                          }]
     }
 ]
}]


And this means you have to change your code as follows:

Schema schema = new Schema.Parser().parse(new File("src/test/resources/user.avsc"));
        GenericRecord user1 = new GenericData.Record(schema.getTypes().get(1)); //the second schema
        user1.put("name", "Alyssa");
        user1.put("favorite_number", 256);
//now to add the user's dogs:
       List<GenericRecord> list = new ArrayList<>();
        Schema dogs = schema.getTypes().get(0);//the first schema
        GenericRecord dog2 = new GenericData.Record(dogs);
        dog2.put("name", ollie.getName());
        dog2.put("breed", ollie.getBreed());
//add more dogs as desired
        list.add(dog2);

        user2.put("dogs", list);

So this works.  Now your schema object is no longer a schema but a union of 2 schemas.

However there is another way of doing it as follows:

dog.avsc:
{"namespace": "example.avro",
 "type": "record",
 "name": "Dog",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "breed",  "type": "string"}
     ]
}

user.avsc
{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]},
     {"name": "dogs", "type":["null", {
                              "type": "array",
                              "items": "Dog"
                          }]

     }
 ]

}


        Schema.Parser parser = new Schema.Parser(); //need to use the same parser
        Schema dogs = parser.parse(new File("src/test/resources/dog.avsc"));
        Schema schema = parser.parse(new File("src/test/resources/user2.avsc"));

Two important things to remember:
Firstly you need to use the same Parser object to parse both schemas
Secondly you need to parse them in "dependency order" - ie if the User schema refers to the Dog object, then you need to parse the Dog schema before you can parse the User schema.