2025Solo Developer - ML Architecture, Training Pipeline & REST API

OCRAppModels - Deep Learning OCR Engine

A deep learning OCR system built from scratch in Python and TensorFlow, capable of recognising text in images via a CRNN model served through a FastAPI endpoint.

server@backend:~/ocr-app-models

➜ Initializing services...

➜ Stack: Python, TensorFlow, Keras, FastAPI

➜ Status: Ready

█

Overview

OCRAppModels started as a university project in intelligent computing, but quickly became something I genuinely wanted to get right. The goal was straightforward: build a system that takes an image containing a word and returns that word as text - no third-party OCR libraries, no shortcuts, just a neural network trained from data up.

The project covers the full ML lifecycle under one roof: model architecture design, a data loading and preprocessing pipeline, training with callbacks and checkpointing, quantitative evaluation, and finally a REST API that lets any client send an image and get a prediction back in milliseconds.

Architecture & Tech Stack

Technologies?

Python 3.11
TensorFlow / Keras**
FastAPI
Pillow + NumPy
MJSynth dataset - A large-scale synthetic dataset of 9+ million word images - the go-to benchmark for scene text recognition.

Model: CRNN + CTC

The core architecture is a Convolutional Recurrent Neural Network (CRNN) - a classic choice for sequence-to-sequence OCR that reads a word character by character without needing per-character bounding boxes.

Input image (32 × 128, grayscale)
        │
  CNN feature extractor
  Conv2D: 64 -> 128 -> 256 -> 512 filters
  MaxPooling + BatchNormalization
        │
  Reshape -> sequence of column feature vectors
        │
  Bidirectional LSTM × 2  (256 units each, dropout 0.25)
        │
  Dense (vocab_size + 1, softmax)
        │
  CTC Loss (Connectionist Temporal Classification)

CTC is what makes this architecture work without aligned labels: it collapses repeated characters and blank tokens into the final decoded string, so the model never needs to know where a character is - only that it's there.

Alternative: Vision Transformer (ViT)

I also implemented a ViT-based OCR model as a research comparison. It splits the input image into 8×8 patches, embeds them, passes them through 6 Transformer encoder layers (8 attention heads, embedding dim 256), and decodes the sequence with CTC. Two fundamentally different approaches to the same problem - a good forcing function for understanding why each architecture makes the choices it does.

Key Features

Custom CRNN architecture - built from scratch using Keras functional API, not a pre-trained backbone
Dual architecture support - the OCRTrainer class accepts model_type='crnn' or model_type='vit' at runtime
CTC-based sequence decoding - no character segmentation or alignment required at training time
MJSynth data pipeline - custom MJSynthDataLoader handles dataset parsing, vocabulary building, character-to-index encoding and CTC-compatible label formatting
Training with checkpointing - best model saved automatically via ModelCheckpoint; training history exported for TensorBoard visualisation
Quantitative evaluation - evaluate.py computes Character Error Rate (CER) and word-level accuracy on a held-out test set
REST API - POST /predict endpoint accepts a raw image upload, preprocesses it server-side (grayscale -> 128×32 -> normalise), runs inference, and returns the decoded string in JSON

The Challenge & Solution

The problem: training a deep neural network on a laptop that really didn't want to

This is the part of the project where I learned the most - the hard way.

MJSynth is a 9-million-image dataset. A CRNN with bidirectional LSTMs is not a small model. My laptop, on the other hand, was very much just a laptop. Within minutes of starting a training run, CPU/GPU temperatures climbed above 100 °C. The machine would throttle aggressively, fan screaming, and any accidental click would trigger a thermal shutdown mid-epoch. One training run I managed to let run overnight for 20+ hours produced almost nothing usable.

This wasn't a bug I could fix with a Stack Overflow answer.

The solution: resourcefulness over raw compute

I had three options: give up, rent a cloud VM (no budget), or find a person. I went with the third option. I reached out to a classmate with a better machine, documented the entire training setup clearly (environment, data paths, hyperparameters, callback config), and handed off the training run. While the model trained on their hardware, I focused on the parts I could do locally: the API, the evaluation scripts, the data loader, the ViT prototype.

The result was crnn_best.keras - a trained model that I then plugged back into my own codebase and served via FastAPI.

The lesson wasn't about compute. It was about knowing which problems are engineering problems and which are logistics problems, and not wasting time trying to engineer your way out of the latter.

Lessons Learned

On machine learning: The gap between "I understand how CRNN works on paper" and "I can implement, train, and debug a CRNN end-to-end" is enormous. CTC loss in particular is non-trivial to get right - label length vs. input sequence length constraints will silently break training if you don't validate your data pipeline carefully.

On architecture decisions: Implementing both CRNN and ViT for the same task was one of the better decisions I made. It forced me to understand why CNN+RNN is still competitive for OCR (inductive bias for spatial locality and sequential structure) vs. the more general but data-hungry Transformer approach.

On working under constraints: When you can't change the environment, you change the plan. Knowing when to stop engineering a workaround and instead ask for help or restructure the workflow is a professional skill as much as a technical one. I'm glad I learned it early.

On API design: FastAPI made it trivial to go from "trained model" to "callable HTTP endpoint" in under 50 lines. Seeing how cleanly a model can be wrapped and exposed made me much more interested.

Project Links

Source Code

Tech Stack

PythonTensorFlowKerasFastAPIDeep LearningCRNNVision TransformerOCRCTC LossMJSynth