Offline handwritten text recognition datasets (optically scanned images), as opposed to online handwritten recognition datasets (record of the trajectory of the pen as a function of time), don’t contain images but strokes. We’ll try to explain how to create a pre-rendering pipeline for online handwritten that can be used for text recognition model training in python.
A stroke is a list of triplets (x, y, t) where (x, y) are the 2D coordinates of the points and (t) is the drawing time collected by the sensitive display, like a device with a touchscreen.
When training a text recognition model, we usually consider using datasets containing images because we use vision-based models. That’s why, most of the time, deep learning engineers orient themselves towards offline datasets and simply train their models with images and labels straight from the dataset, with image augmentations.
We must overcome this dependency on images: training a vision-based model by taking images as input does not necessarily require an image dataset. Online datasets contain a huge amount of precious data, which can be easily exploited and converted to images instantly. In addition, having access to the raw points of each stroke of each word, when using online datasets, allows us to perform a lot of NumPy operations directly on those points.
In this article, we will provide an entire python transformation pipeline for online handwritten datasets using IAM, starting with data points (strokes) to image rendering. It will include a collection of simple and fast Numpy augmentations performed directly on strokes and points.
Image rendering for online handwritten recognition datasets using IAM
Before we get started, it is important to note that these operations are carried out on points and not images, which makes it extremely fast and only requires Numpy dependency.
IAM online text data is given as an XML file. We need to parse it to get the strokes. Below is a Python code snippet on how to parse an XML data point of the IAM offline dataset:
If we simply draw the points on a white canvas, we obtain the raw rendering shown in the image below. For clarity of the code, the next code examples contain only the points manipulations: the canvas drawing will be shown later in the article. Note that for the following examples, we will use the first datapoint from the IAM online dataset (lineStrokes-all/lineStrokes/a01/a01-000/a01-000u-01.xml).
It is that easy, but let’s not stop there. We can augment the resolution of points randomly, to avoid the “dashlane effect” (points instead of lines) and better distinguish letters:
This is how it renders if we multiply the number of points by a factor of 2:
The more we add points to the canvas, the more it looks like a plain line. This is important in case you want to train a handwritten text recognition model as it fits better a real data distribution. Here is an illustration to compare the two canvases without enrichment and with a factor of 2:
Adding augmentation simulating real handwritten text
Let’s now perform random dilation (spacing them) on each stroke to displace letters relatively.
This is how it renders:
To operate on all points, let’s flatten the strokes in an array of points:
We have 2 examples of the rendering here:
Let’s resize points:
Now all those manipulations may not be useful if we don’t render them, so we are now going to compute a Numpy canvas to draw the points on:
These are generated samples of canvas:
Finally, let’s render our points on the canvas:
Here are some samples with drawing variations:
From here, it is easier to create a generative augmentation pipeline, taking a file path as input and rendering random augmented versions of the original datapoint from the IAM online dataset:
The next image contains 10 randomly generated samples done with the previous code snippet:
One can play with the parameters of each function in the pipeline to modify the transformations.
Online handwritten datasets can be exploited to generate a lot of very different image samples with simple augmentations. Since you manipulate points instead of images it is way faster than using offline datasets, and we are not even mentioning the dataset size to download. In the end, this is quick and easy, and it will surely help your handwritten text recognition model converge if you use this augmented dataset.
Feel free to join our slack community if you want to go further!