🪨Compiling tesseract-5.0 on Amazon Linux (AWS Lambda pack)
Compile Tesseract 5.0 on Amazon Linux for AWS Lambda
🎯 Why This Guide?
My goal is to make a Tesseract 5.0 package for AWS Lambda, which runs on Amazon Linux. While Windows, Debian/Ubuntu, and Mac compilation methods are readily available, Amazon Linux requires specific steps.
📋 Contents
- ✅ Amazon Linux Docker Setup
- ✅ Compiling Leptonica 1.79
- ✅ Compiling Tesseract 5.0
- ✅ Tesseract Packaging
- ✅ Download Language Models & Testing
- ✅ Troubleshooting Common Issues
1 Amazon Linux Docker Setup
⚠️ Note: If your OS is already Amazon Linux or CentOS, you can skip this step.
Docker Hub: https://hub.docker.com/_/amazonlinux
Pull Amazon Linux Image
docker pull amazonlinux
Run Docker Container
# Simple container run and enter
docker run -it amazonlinux /bin/bash
# Share local folder with container (recommended)
docker run -v $(pwd):/outputs -it amazonlinux /bin/bash
# Run container in background
docker run -dit amazonlinux /bin/bash
Verify Amazon Linux Version
# Option 1: Detailed info
cat /etc/os-release
# Output:
# NAME="Amazon Linux"
# VERSION="2"
# ID="amzn"
# ID_LIKE="centos rhel fedora"
# VERSION_ID="2"
# PRETTY_NAME="Amazon Linux 2"
# Option 2: Simple version
cat /etc/system-release
# Output:
# Amazon Linux release 2 (Karoo)
Install Basic Dependencies
yum install autoconf automake libtool pkgconfig.x86_64 \
libpng12-devel.x86_64 libjpeg-devel libtiff-devel.x86_64 \
zlib-devel.x86_64
2 Compile Leptonica 1.79
💡 What is Leptonica?
Leptonica is an open-source library for image processing and analysis. Tesseract depends on this library, so we need to install it first.
# Download Leptonica source
wget http://www.leptonica.org/source/leptonica-1.79.0.tar.gz
# Extract archive
tar -zxvf leptonica-1.79.0.tar.gz
# Navigate to directory
cd leptonica-1.79.0
# Configure build
./configure --prefix=/usr/local/leptonica-1.79.0
# Compile
make
# Install
make install
3 Compile Tesseract 5.0
Install Git
yum install git
Configure PKG_CONFIG_PATH for Leptonica
export PKG_CONFIG_PATH=/usr/local/leptonica-1.79.0/lib/pkgconfig
Compile Tesseract
# Clone Tesseract repository
git clone https://github.com/tesseract-ocr/tesseract.git
# Navigate to directory
cd tesseract
# Generate configure script
./autogen.sh
# Configure build
./configure --prefix=/usr/local/tesseract-5.0
# Compile
make
# Install
make install
Verify Installation
# Check version
/usr/local/tesseract-5.0/bin/tesseract -v
# Output:
# tesseract 5.0.0-alpha-20210401-90-g723eb
# leptonica-1.79.0
# libjpeg 6b (libjpeg-turbo 1.2.90) : libtiff 4.0.3 : zlib 1.2.7
# Found AVX2
# Found AVX
# Found FMA
# Found SSE4.1
# Found OpenMP 201511
4 Package Tesseract for Deployment
✨ Goal: Create a standalone Tesseract package that can be used on Amazon Linux or AWS Lambda.
# Create package directory
mkdir /outputs/tesseract-standalone
# Copy Tesseract executable
cp /usr/local/tesseract-5.0/bin/tesseract /outputs/tesseract-standalone/
# Create lib directory
mkdir /outputs/tesseract-standalone/lib
# Copy Tesseract library
cp /usr/local/tesseract-5.0/lib/libtesseract.so.5 /outputs/tesseract-standalone/lib/
# Copy Leptonica library
cp /usr/local/leptonica-1.79.0/lib/liblept.so.5 /outputs/tesseract-standalone/lib/
# Copy JPEG library
cp /usr/lib64/libjpeg.so.62 /outputs/tesseract-standalone/lib/
👉🏻 Result: The directory /outputs/tesseract-standalone now contains everything needed to run Tesseract 5.0!
5 Download Language Models & Test
Create Tessdata Directory
# Create tessdata directory
mkdir /outputs/tesseract-standalone/tessdata
cd /outputs/tesseract-standalone/tessdata
# Download English language model (best quality)
wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata
Language Model Options
| Model Type | Repository | Description |
|---|---|---|
| Best | tessdata_best | Highest accuracy, slower |
| Fast | tessdata_fast | Faster processing, good accuracy |
| Normal | tessdata | Balanced performance |
Test Tesseract
# Set tessdata path
export TESSDATA_PREFIX="/outputs/tesseract-standalone/tessdata"
# Navigate to tesseract directory
cd /outputs/tesseract-standalone
# Run OCR on image
./tesseract /outputs/example.png /outputs/text_result.txt -l eng --psm 6
Page Segmentation Modes (PSM)
0 = Orientation and script detection (OSD) only
1 = Automatic page segmentation with OSD
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD (Default)
4 = Assume a single column of text of variable sizes
5 = Assume a single uniform block of vertically aligned text
6 = Assume a single uniform block of text
7 = Treat the image as a single text line
8 = Treat the image as a single word
9 = Treat the image as a single word in a circle
10 = Treat the image as a single character
11 = Sparse text. Find as much text as possible in no particular order
12 = Sparse text with OSD
13 = Raw line. Treat the image as a single text line
📖 Learn More: Tesseract Command Line Documentation
6 Troubleshooting
| Issue | Solution |
|---|---|
wget: command not found |
yum install wget |
gzip: Cannot exec |
yum install gzip |
make: command not found |
yum install make |
C++ compiler cannot create executables |
yum install gcc-c++ |
pip: command not found |
yum -y install python3-pip |
| Install Python 3 | yum -y install python3 |
🛠️ Additional Utilities
Check Library Dependencies
readelf -d tesseract | grep NEEDED
# Output shows required libraries:
# libtesseract.so.5
# liblept.so.5
# libjpeg.so.62
# libtiff.so.5
# libz.so.1
# librt.so.1
# libpthread.so.0
# libstdc++.so.6
# libm.so.6
# libgomp.so.1
# libgcc_s.so.1
# libc.so.6
Find Files by Name
# Example: Find all JPEG libraries
find / -type f -name "libjpeg*"
# Output:
# /usr/share/doc/libjpeg-turbo-devel-1.2.90/libjpeg.txt
# /usr/lib64/pkgconfig/libjpeg.pc
# /usr/lib64/libjpeg.so.62.1.0
# /outputs/tesseract-standalone/lib/libjpeg.so.62
📦 Download Pre-compiled Binary
👉🏻 Skip the compilation? Download the pre-compiled Tesseract 5.0 executable binary file ready for AWS Lambda deployment.
🎯 Summary
- ✅ Compiled Tesseract 5.0 on Amazon Linux 2
- ✅ Created standalone package with all dependencies
- ✅ Ready for AWS Lambda deployment
- ✅ Includes language models and configuration
- ✅ Tested and verified OCR functionality
🙇🏻Thank you!
image by Salvatore Andrea Santacroce