Build Your Own Schema Registry Server Using Python and Django

Posted by Panos M. on 12/Mar/2019 (09:28)

Build Your Own Schema Registry Server Using Python and Django

In the era of big data and event-driven software architectures, one fundamental element that needs careful thought and design, is the structure of data and events that a system generates and another system consumes. All companies that have their business online, ask their developers to implement some kind of event generation piece of code that would generate events for anything that business users might think is necessary. Necessary for them to understand things like user behavior, application performance, security anomalies and more.

Here is when the problem starts to appear. A lot of events are being generated, each one having different type and structure. It is not very late until no one knows how many different types of events are generated. When they are generated, What kind of information they should bare as optional and what other as mandatory. What are the valid rules for their property values e.t.c.

In order to solve this event-chaotic situation, you can use a schema registry, i.e. a repository of meta-information, that gives details about the events that your company generates and consumes.

  • It will be the central place to go to, when one needs to understand what is the set of events that are generated.
  • Find out the structure of each event type, like the properties and their types, their valid set of values and their description.

And not only that.

  • You can use the schema repository to validate any incoming event document against the schema information that it should be compatible with.

Validation of events makes sure that you don't accept or generate events that might break the process consuming the events. Also, it gives information back to the sender of the event what might be the problematic piece of property in the event document they are sending.

But let's take the things from the beginning.

JSON Representation

JSON is a format that is very popular in exchanging information between software systems. Hence, you can choose it to format the events that travel from system to system, generated and consumed.

Here is an example of an event in JSON format:

{
  "action": "OfferAccept",
  "appId": "12-2x12",
  "category": "Seller.Dashboard",
  "eventType": "click",
  "pageUrl": "https://www.acme.com/offers/",
  "platform": "web",
  "userId": 194246
}

But, is this a valid event? In other words:

  • Does it contain all the required properties?
  • Do all properties have values that they are valid? For example: Event property eventType has the value click. Is this allowed?

Or even more...

What if you wanted to see all the values that the action property could take, besides the OfferAccept value?

There is a way to do that. First, you need to define the rules this document should be compatible with.

Let's see how.

Background

JSON Schema

JSON Schema is the tool that will allow you to tell things about a JSON document. These things are mainly validation rules, like what values a property can take, or other annotations, like a description, that give extra information about the properties of a JSON document, in order to make it easier for the reader to understand what's inside.

What kind of tool is that?

Simply enough, it is another JSON document. In other words, we use a JSON document to define annotations and validations for another JSON document.

JSON Schema Document Annotates JSON Event Document

Having said that, let's define the JSON schema document for the example click event that we saw earlier:

{
  "description": "JSON documents of this schema will be Events describing user click actions.",
  "type": "object",
  "properties": {
    "action": {
      "type": "string",
      "enum": [
        "RFPstart",
        "Completed",
        "OBRstart",
        "RFPSSStart",
        "LoginClick",
        "LoginFirstClick",
        "SignupClick",
        "OfferAccept",
        "OfferReject",
        "Continue"
      ],
      "description": "This is the action that the user has taken on our site."
    },
    "appId": {
      "type": "string",
      "enum": ["12-2x12"],
      "description": "Takes the value that identifies which application is sending the event."
    },
    "category": {
      "type": "string",
      "enum": [
        "Buyer.OBR",
        "Buyer.RFP",
        "Buyer.RFPSS",
        "Seller.Dashboard",
        "User.Registration",
        "YellowPages.SellerSnippet"
      ],
      "description": "The category of the event."
    },
    "eventType": {
      "const": "click"
    },
    "pageUrl": {
      "type": "string",
      "format": "uri",
      "description": "It is the absolute URL of the page the user / visitor was on when they clicked on an HTML element."
    },
    "platform": {
      "type": "string",
      "enum": ["app", "web"],
      "description": "Takes the value that identifies the platform the application runs on when sending the event."
    },
    "userId": {
      "type": "number",
      "description": "This is the User id uniquely identifying the user that did the action."
    }
  },
  "required": [
    "action",
    "appId",
    "category",
    "eventType",
    "pageUrl",
    "platform"
  ],
  "additionalProperties": false
}

You don't have to understand all the details of this document right away. But, doing a quick read you will see things like

  • properties, which is an object describing which properties can appear in the JSON document this schema is about.
  • You can see things like the type of a property. For example, the userId property has the type number. This means that when an event JSON document is validated, its userId property needs to be a number, otherwise, it will be considered invalid.

So, we use a special language, the JSON Schema language to describe rules about a JSON document. Needless to say, that if you want to write correct schemas, you will need to learn this language, the latest specification of which can be found here. Also, this tool here, can help you write a correct JSON schema.

Online JSON Schema Validator

There are a lot of online JSON schema validators that can help you understand the above idea, i.e. the idea of having a JSON document being validated by another JSON document, the schema.

Here is one: https://www.jsonschemavalidator.net/. You can try it out by putting the JSON schema document in the area below Select Schema and the JSON document in the area below Input JSON. Play around by changing schema details or changing input JSON properties and see how it validates or not.

Online JSON Validator using JSON Schema

Self-Describing JSON Documents

One problem that we haven't solved yet is the following:

Given a JSON document, like an event, how do we know which JSON schema document to use in order to validate it?

The answer is the self-describing JSON documents. In other words, the JSON documents that describe themselves. Or, in order to put it simpler, the JSON documents that tell us where their JSON schema document resides. We do that by using a special property in the JSON document, named $schema. The $schema property needs to be pointing to the JSON schema that describes the validation rules for the JSON document at hand.

Self-Describing JSON Document Using $schema Property

Hence, the self-describing JSON document for the click event would have been this:

{
  "$schema": "com.acme.event_click.jsonschema.1-0-0.json"
  "action": "OfferAccept",
  "appId": "12-2x12",
  "category": "Seller.Dashboard",
  "eventType": "click",
  "pageUrl": "https://www.acme.com/offers/",
  "platform": "web",
  "userId": 194246
}

Where "$schema": "com.acme.event_click.jsonschema.1-0-0.json" is the property that has the value that points to the JSON schema document that would be the document to bare the rules for the click event at hand.

URL for $schema

In order to be more precise, the $schema value needs to be a URI, i.e. should precisely define where the JSON schema document resides, for example using a URL like: https://schema-registry.acme.com/event_click.jsonschema.1-0-0.json.

Hence, we know exactly where to go to in order to find the validation rules for the JSON document at hand.

$schema In Properties

And since we want all of our JSON documents to be self-describing, in other words, to have the $schema properties, then, that property should be defined in the corresponding JSON schema. And not only that, it needs to be specified as mandatory. Hence, the proper JSON schema document for the example click event should be:

{
  "description": "JSON documents of this schema will be Events describing user click actions.",
  "type": "object",
  "properties": {
    "$schema": {
      "type": "string",
      "format": "uri-reference"
    },  
    "action": {
...
  },
  "required": [
    "$schema", 
    "action",
...
  ],
  "additionalProperties": false
}

Do you see the $schema in the list of properties:

...
"$schema": {
      "type": "string",
      "format": "uri-reference"
    },
...      

The value of the format is uri-reference. See also how it has been specified in the required properties.

Self-Describing JSON Schema Documents

But, this goes recursively. I.e. the JSON schema documents, which are used to describe other JSON documents, are JSON documents themselves. Hence:

  • we need a way to describe their validations rules
  • we need a way to make them self-describing too, so that we know where to go to in order to find their validation rules.

In order to achieve these, we use the same logic. We create JSON schema documents that describe JSON schema documents and we add the property $schema inside the original JSON schema document in order to make them self-describing.

Self-Describing JSON Schema Document

Usually, the $schema property of a JSON schema document points to a version of the JSON schema specification that has been used to write the JSON schema. For example, the following JSON schema document is the JSON schema document for the click event and it's own JSON schema is defined to be JSON schema specification draft version 7.

{
  "$schema": "http://json-schema.org/draft-07/schema#"
  "description": "JSON documents of this schema will be Events describing user click actions.",
  "type": "object",
  "properties": {
    "$schema": {
          "type": "string",
          "format": "uri-reference"
        },  
    "action": {
      "type": "string",
      "enum": [
...
  "additionalProperties": false
}

Meta: Sometimes, a document that describes rules about another document is called a meta-document. Hence, a JSON schema document is called a meta-document for the JSON document it describes. Following this rule, the Draft 07 Schema specification, which describes the rules for the JSON schema documents, it is a meta-schema. This is because it describes a schema rather than a document.

JSON Validation Programming

Having built the background, now let's see how we can develop our own schema registry and validation server. In other words, we will develop a Web server that would:

  • store our schemas
  • validate a JSON document against a schema

For this tutorial, we will use Python 3.7 and Django 2.1.

Start a new Django Project and Application

We are using pipenv to set up the environment for our new Django project.

$ pipenv --python 3.7
Creating a virtualenv for this project…
Pipfile: /Users/...erver/Pipfile
Using /usr/local/opt/python/libexec/bin/python (3.7.2) to create virtualenv…
⠙ Creating virtual environment...Using base prefix '/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7'
New python executable in /Users/...erver--0VpK2Ai/bin/python3.7
Also creating executable in /Users/...erver--0VpK2Ai/bin/python
Installing setuptools, pip, wheel...done.
Running virtualenv with interpreter /usr/local/opt/python/libexec/bin/python

✔ Successfully created virtual environment! 
Virtualenv location: /Users/...erver--0VpK2Ai
Creating a Pipfile for this project…
$ pipenv shell
Launching subshell in virtual environment…
 . /Users/...erver--0VpK2Ai/bin/activate
.bashrc...........
$  . /Users/...erver--0VpK2Ai/bin/activate
$ 

We are editing Pipfile to add the django version we will use:

[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]

[packages]
Django= "==2.1.*"

[requires]
python_version = "3.7"

and we install with pipenv install -d

$ pipenv install -d
Pipfile.lock not found, creating…
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
✔ Success! 
Updated Pipfile.lock (81d4bf)!
Installing dependencies from Pipfile.lock (81d4bf)…
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 2/2 — 00:00:07
$ 

We start a new Django project:

$ django-admin startproject json_validator .
$

And then we start a new application:

$ django-admin startapp main
$

Project Configuration Settings

This project will need very minimum configuration for the server to run. For example, it will not need any database configuration. Let's see the minimum json_validator/settings.py file:

# json_validator/settings.py
"""
Django settings for json_validator project.

Generated by 'django-admin startproject' using Django 2.1.7.

For more information on this file, see
https://docs.djangoproject.com/en/2.1/topics/settings/

For the full list of settings and their values, see
https://docs.djangoproject.com/en/2.1/ref/settings/
"""

import os

# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))


# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/2.1/howto/deployment/checklist/

# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = 'mn&d5u-@)tz*2mwm&gjeq(ptdz0xy3gx6vu=qg_7ockfe7yvzr'

# SECURITY WARNING: don't run with debug turned on in production!
DEBUG = True

ALLOWED_HOSTS = []


# Application definition

INSTALLED_APPS = [
    'django.contrib.contenttypes',
    'django.contrib.auth',
    'main',
]

MIDDLEWARE = [
]

ROOT_URLCONF = 'json_validator.urls'

TEMPLATES = [
    {
        'BACKEND': 'django.template.backends.django.DjangoTemplates',
        'DIRS': [],
        'APP_DIRS': True,
        'OPTIONS': {
            'context_processors': [
                'django.template.context_processors.debug',
                'django.template.context_processors.request',
                'django.contrib.auth.context_processors.auth',
                'django.contrib.messages.context_processors.messages',
            ],
        },
    },
]

WSGI_APPLICATION = 'json_validator.wsgi.application'

# Internationalization
# https://docs.djangoproject.com/en/2.1/topics/i18n/

LANGUAGE_CODE = 'en-us'

TIME_ZONE = 'UTC'

USE_I18N = True

USE_L10N = True

USE_TZ = True


# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/2.1/howto/static-files/

STATIC_URL = '/static/'

Upload Your Schemas Into a Server-based Location

In order for the JSON validator to work as a server, it needs to have access to the JSON schemas. Simply enough, let's upload them to a new folder schemas. As an example, after creating the folder schemas, upload the click event JSON schema that we saw earlier. Use the filename com.acme.event_click.jsonschema.1-0-0.json:

JSON Schema Document for event click

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "description": "JSON documents of this schema will be Events describing user click actions.",
  "type": "object",
  "properties": {
    "$schema": {
          "type": "string",
          "format": "uri-reference"
        },
    "action": {
      "type": "string",
      "enum": [
        "RFPstart",
        "Completed",
        "OBRstart",
        "RFPSSStart",
        "LoginClick",
        "LoginFirstClick",
        "SignupClick",
        "OfferAccept",
        "OfferReject",
        "Continue"
      ],
      "description": "This is the action that the user has taken on our site."
    },
    "appId": {
      "type": "string",
      "enum": ["12-2x12"],
      "description": "Takes the value that identifies which application is sending the event."
    },
    "category": {
      "type": "string",
      "enum": [
        "Buyer.OBR",
        "Buyer.RFP",
        "Buyer.RFPSS",
        "Seller.Dashboard",
        "User.Registration",
        "YellowPages.SellerSnippet"
      ],
      "description": "The category of the event."
    },
    "eventType": {
      "const": "click"
    },
    "pageUrl": {
      "type": "string",
      "format": "uri",
      "description": "It is the absolute URL of the page the user / visitor was on when they clicked on an HTML element."
    },
    "platform": {
      "type": "string",
      "enum": ["app", "web"],
      "description": "Takes the value that identifies the platform the application runs on when sending the event."
    },
    "userId": {
      "type": "number",
      "description": "This is the User id uniquely identifying the user that did the action."
    }
  },
  "required": [
    "$schema",
    "action",
    "appId",
    "category",
    "eventType",
    "pageUrl",
    "platform"
  ],
  "additionalProperties": false
}

Validation Endpoint - POST /validate

With the JSON schemas in place, we now need to expose an endpoint for the JSON validator to accept documents to be validated. Let's implement a POST /validate endpoint for that. The JSON document to be validated will be sent in the body of the request.

In order to implement this endpoint, we have to define the path details inside the json_validator/urls.py file:

# json_validator/urls.py
#
from django.urls import path
from main.views.schemas import validate

urlpatterns = [
    path(r'validate', validate),
]

We decide to put our function-based views inside the file main/views/schemas.py:

Important: Don't forget to remove the file main/views.py that is automatically generated by the django-admin command.

Here is how:

# main/views/schemas.py
#
import json
import jsonschema
from django.http import HttpResponse, JsonResponse
from django.views.decorators.http import require_http_methods

from lib.validate_json import validate_json_content

@require_http_methods(['POST'])
def validate(request):
    error = None

    try:
        validate_json_content(request.body)
    except jsonschema.exceptions.ValidationError as e:
        error = {
            'error': e.message
        }
    except json.decoder.JSONDecodeError as e:
        error = {
            'error': 'not valid JSON document. Specific error: {}'.format(' - '.join(e.args))
        }

    if error:
        return JsonResponse(data=error, safe=True, status=422)

    return HttpResponse(status=200)

This is a very simple implementation. It basically relies on validate_json_content() function which takes the JSON document coming in in the request.body and

  1. returns without error if the JSON document is valid
  2. raises jsonschema.exceptions.ValidationError if JSON document is not valid according to the JSON schema rules, or
  3. raises json.decoder.JSONDecodeError if request.body is not a JSON document.

validate_json_content()

The validate_json_content() function should be implemented like this (inside the lib python folder):

import json

import jsonschema
import urllib.request

from django.conf import settings

FILE_PROTOCOL = 'file://'


def validate_json_content(json_content):
    json_content = json.loads(json_content)
    schema = json_content['$schema']

    if schema.startswith(FILE_PROTOCOL):
        schema_json_file_name = schema[len(FILE_PROTOCOL) + 1:]
        schema_json_file_name = "{}/{}".format(settings.SCHEMAS_LOCAL_DIR, schema_json_file_name)

        schema_content = open(schema_json_file_name).read()
        schema_content = json.loads(schema_content)
    else:
        schema_content = urllib.request.urlopen(schema).read()
        schema_content = json.loads(schema_content)

    jsonschema.validate(json_content, schema_content)

    return True
  1. The line schema = json_content['$schema'] takes the value of the $schema property of the incoming JSON document.
  2. Then, the program gets the actual JSON schema using two different methods, depending on the protocol used to define the URI:
  • When a file protocol is specified, like file:///event_click.json, then it will load the JSON schema from the file system
  • If not a file protocol, like https://schema-registry.acme.com/schemas/event_click.json, it will use urllib.request.urlopen() to fetch the JSON schema content over the Web.
  1. In both cases, it puts the JSON schema in the schema_content variable.
  2. Finally, it calls jsonschema.validate(json_content, schema_content)

Important: The code above assumes that there is a SCHEMAS_LOCAL_DIR settings property inside your django settings. Don't forget to define it like this:
SCHEMAS_LOCAL_DIR = os.path.join(BASE_DIR, 'schemas')
It is the folder in which you have stored your JSON schema documents.

The package jsonschema

The Python package jsonschema does the whole work here. It validates a JSON document against a JSON schema. You will have to define that in your Pipfile and run pipenv install -d.

Example Call

Everything is in place for your server to validate an incoming JSON document. Let's try that.

Valid JSON Document

Let's send the following JSON document to the /validate endpoint:

{
  "$schema": "file:///com.acme.event_click.jsonschema.1-0-0"  
  "action": "OfferAccept",
  "appId": "12-2x12",
  "category": "Seller.Dashboard",
  "eventType": "click",
  "pageUrl": "https://www.acme.com/offers/",
  "platform": "web",
  "userId": 194246
}

Let's do it using curl. Make sure that your server is up and running.

$ curl -X POST 'http://127.0.0.1:8000/validate' -H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{"$schema": "file:///com.acme.event_click.jsonschema.1-0-0.json", "action": "OfferAccept","appId": "12-2x12","category": "Seller.Dashboard", \
"eventType": "click","pageUrl": "https://www.acme.com/offers/","platform": "web","userId": 194246}'
$

As you can see, for a valid JSON document, the response is empty.

Invalid JSON Document

On another example call, let's send a JSON document that is invalid, for example, missing the action property:

$ curl -X POST 'http://127.0.0.1:8000/validate' -H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{"$schema": "file:///com.acme.event_click.jsonschema.1-0-0.json","appId": "12-2x12","category": "Seller.Dashboard",\
"eventType": "click","pageUrl": "https://www.acme.com/offers/","platform": "web","userId": 194246}'
{"error": "'action' is a required property"}
$

You can see that the {"error": "'action' is a required property"} has been returned as response, precisely defining where the error was.

Validating Using External Schemas

Our JSON Validator can be used to validate documents using external JSON schema, of course. We have already said that it identifies the http(s):// protocol of the $schema value and it fetches the JSON schema over the Web.

We have also said that a JSON schema document should have its own $schema property pointing to the JSON schema specification that the JSON schema has been written against.

Let's use these two to validate a JSON schema document. We will validate the JSON schema of the click event. We will first save the JSON schema document into a file. This will be quite convenient when calling curl, since JSON schema document is quite big to give it literally on the curl command line arguments.

Having saved the JSON schema in a file with name, e.g. click_event_json_schema.json, we can now use curl to validate it:

$ curl -X POST 'http://127.0.0.1:8000/validate' -H 'Content-Type: application/json' -H 'Accept: application/json' -d @click_event_json_schema.json
$

It returns nothing, which means that the JSON schema is valid against its schema specification (JSON schema specification draft version 7).

You can also try with an invalid JSON schema. For example, change a string type property to be str. And try again:

$curl -X POST 'http://127.0.0.1:8000/validate' -H 'Content-Type: application/json' -H 'Accept: application/json' -d @click_event_json_schema.json
 {"error": "'str' is not one of ['array', 'boolean', 'integer', 'null', 'number', 'object', 'string']"} 
$ 

Do you see the error being returned? It is telling you that str is not a valid value for the type property.

Closing Notes

A Note about Schema Versions

You may have noticed that we have suffixed the filename of the JSON schema documents with a sequence of numbers separated with -. Example 1-0-0. This is a way for us to use a versioning policy on a JSON schema. A JSON schema might evolve over time and we want to keep track of the changes using a version number. The version number will have three parts, that would change depending on how big the change and how it might affect the clients using that JSON schema.

The versioning technique that we use is called SchemaVer and it is described in detail on this blog post by SnowPlow:

Introducing SchemaVer for semantic versioning of schemas.

A Note about Production Server

The server example that we have demonstrated here is for tutorial reasons only. If you want to put your server into a real production mode, you will have to think about things like caching, in order to avoid fetching the schemas from the file disk or from another remote server on every validation request.

Besides that, you may also want to consider using a schema registry server that's already production-ready and, actually, does deliver a lot of work every day, with many features that you will not have to develop from scratch on your own.

We are fun of Confluent. Confluent, besides others, offers a Schema Registry Server with a REST API.

Source Code of the Tutorial

https://github.com/pmatsinopoulos/django_json_schema_server

About Tech Career Booster

Tech Career Booster offers high-quality computer programming courses and professional services.