OCRAppModels - Deep Learning OCR Engine
A deep learning OCR system built from scratch in Python and TensorFlow, capable of recognising text in images via a CRNN model served through a FastAPI endpoint.
➜ Initializing services...
➜ Stack: Python, TensorFlow, Keras, FastAPI
➜ Status: Ready
█
Overview
OCRAppModels started as a university project in intelligent computing, but quickly became something I genuinely wanted to get right. The goal was straightforward: build a system that takes an image containing a word and returns that word as text - no third-party OCR libraries, no shortcuts, just a neural network trained from data up.
The project covers the full ML lifecycle under one roof: model architecture design, a data loading and preprocessing pipeline, training with callbacks and checkpointing, quantitative evaluation, and finally a REST API that lets any client send an image and get a prediction back in milliseconds.
Architecture & Tech Stack
Technologies?
- Python 3.11
- TensorFlow / Keras**
- FastAPI
- Pillow + NumPy
- MJSynth dataset - A large-scale synthetic dataset of 9+ million word images - the go-to benchmark for scene text recognition.
Model: CRNN + CTC
The core architecture is a Convolutional Recurrent Neural Network (CRNN) - a classic choice for sequence-to-sequence OCR that reads a word character by character without needing per-character bounding boxes.
Input image (32 × 128, grayscale)
│
CNN feature extractor
Conv2D: 64 -> 128 -> 256 -> 512 filters
MaxPooling + BatchNormalization
│
Reshape -> sequence of column feature vectors
│
Bidirectional LSTM × 2 (256 units each, dropout 0.25)
│
Dense (vocab_size + 1, softmax)
│
CTC Loss (Connectionist Temporal Classification)
CTC is what makes this architecture work without aligned labels: it collapses repeated characters and blank tokens into the final decoded string, so the model never needs to know where a character is - only that it's there.
Alternative: Vision Transformer (ViT)
I also implemented a ViT-based OCR model as a research comparison. It splits the input image into 8×8 patches, embeds them, passes them through 6 Transformer encoder layers (8 attention heads, embedding dim 256), and decodes the sequence with CTC. Two fundamentally different approaches to the same problem - a good forcing function for understanding why each architecture makes the choices it does.
Key Features
- Custom CRNN architecture - built from scratch using Keras functional API, not a pre-trained backbone
- Dual architecture support - the
OCRTrainerclass acceptsmodel_type='crnn'ormodel_type='vit'at runtime - CTC-based sequence decoding - no character segmentation or alignment required at training time
- MJSynth data pipeline - custom
MJSynthDataLoaderhandles dataset parsing, vocabulary building, character-to-index encoding and CTC-compatible label formatting - Training with checkpointing - best model saved automatically via
ModelCheckpoint; training history exported for TensorBoard visualisation - Quantitative evaluation -
evaluate.pycomputes Character Error Rate (CER) and word-level accuracy on a held-out test set - REST API -
POST /predictendpoint accepts a raw image upload, preprocesses it server-side (grayscale -> 128×32 -> normalise), runs inference, and returns the decoded string in JSON
The Challenge & Solution
The problem: training a deep neural network on a laptop that really didn't want to
This is the part of the project where I learned the most - the hard way.
MJSynth is a 9-million-image dataset. A CRNN with bidirectional LSTMs is not a small model. My laptop, on the other hand, was very much just a laptop. Within minutes of starting a training run, CPU/GPU temperatures climbed above 100 °C. The machine would throttle aggressively, fan screaming, and any accidental click would trigger a thermal shutdown mid-epoch. One training run I managed to let run overnight for 20+ hours produced almost nothing usable.
This wasn't a bug I could fix with a Stack Overflow answer.
The solution: resourcefulness over raw compute
I had three options: give up, rent a cloud VM (no budget), or find a person. I went with the third option. I reached out to a classmate with a better machine, documented the entire training setup clearly (environment, data paths, hyperparameters, callback config), and handed off the training run. While the model trained on their hardware, I focused on the parts I could do locally: the API, the evaluation scripts, the data loader, the ViT prototype.
The result was crnn_best.keras - a trained model that I then plugged back into my own codebase and served via FastAPI.
The lesson wasn't about compute. It was about knowing which problems are engineering problems and which are logistics problems, and not wasting time trying to engineer your way out of the latter.
Lessons Learned
On machine learning: The gap between "I understand how CRNN works on paper" and "I can implement, train, and debug a CRNN end-to-end" is enormous. CTC loss in particular is non-trivial to get right - label length vs. input sequence length constraints will silently break training if you don't validate your data pipeline carefully.
On architecture decisions: Implementing both CRNN and ViT for the same task was one of the better decisions I made. It forced me to understand why CNN+RNN is still competitive for OCR (inductive bias for spatial locality and sequential structure) vs. the more general but data-hungry Transformer approach.
On working under constraints: When you can't change the environment, you change the plan. Knowing when to stop engineering a workaround and instead ask for help or restructure the workflow is a professional skill as much as a technical one. I'm glad I learned it early.
On API design: FastAPI made it trivial to go from "trained model" to "callable HTTP endpoint" in under 50 lines. Seeing how cleanly a model can be wrapped and exposed made me much more interested.