🪨Compiling tesseract-5.0 on Amazon Linux (AWS Lambda pack)

🪨Compiling tesseract-5.0 on Amazon Linux (AWS Lambda pack)

Compile Tesseract 5.0 on Amazon Linux for AWS Lambda

🎯 Why This Guide?

My goal is to make a Tesseract 5.0 package for AWS Lambda, which runs on Amazon Linux. While Windows, Debian/Ubuntu, and Mac compilation methods are readily available, Amazon Linux requires specific steps.

📋 Contents

  1. ✅ Amazon Linux Docker Setup
  2. ✅ Compiling Leptonica 1.79
  3. ✅ Compiling Tesseract 5.0
  4. ✅ Tesseract Packaging
  5. ✅ Download Language Models & Testing
  6. ✅ Troubleshooting Common Issues

1 Amazon Linux Docker Setup

⚠️ Note: If your OS is already Amazon Linux or CentOS, you can skip this step.

Docker Hub: https://hub.docker.com/_/amazonlinux

Pull Amazon Linux Image

🐳 Pull Docker Image
docker pull amazonlinux

Run Docker Container

🐳 Docker Run Commands
# Simple container run and enter
docker run -it amazonlinux /bin/bash

# Share local folder with container (recommended)
docker run -v $(pwd):/outputs -it amazonlinux /bin/bash

# Run container in background
docker run -dit amazonlinux /bin/bash

Verify Amazon Linux Version

📋 Check OS Version
# Option 1: Detailed info
cat /etc/os-release

# Output:
# NAME="Amazon Linux"
# VERSION="2"
# ID="amzn"
# ID_LIKE="centos rhel fedora"
# VERSION_ID="2"
# PRETTY_NAME="Amazon Linux 2"

# Option 2: Simple version
cat /etc/system-release

# Output:
# Amazon Linux release 2 (Karoo)

Install Basic Dependencies

📦 Install Required Packages
yum install autoconf automake libtool pkgconfig.x86_64 \
    libpng12-devel.x86_64 libjpeg-devel libtiff-devel.x86_64 \
    zlib-devel.x86_64

2 Compile Leptonica 1.79

💡 What is Leptonica?

Leptonica is an open-source library for image processing and analysis. Tesseract depends on this library, so we need to install it first.

🔨 Compile Leptonica
# Download Leptonica source
wget http://www.leptonica.org/source/leptonica-1.79.0.tar.gz

# Extract archive
tar -zxvf leptonica-1.79.0.tar.gz

# Navigate to directory
cd leptonica-1.79.0

# Configure build
./configure --prefix=/usr/local/leptonica-1.79.0

# Compile
make

# Install
make install

3 Compile Tesseract 5.0

Install Git

📦 Install Git
yum install git

Configure PKG_CONFIG_PATH for Leptonica

⚙️ Set Environment Variable
export PKG_CONFIG_PATH=/usr/local/leptonica-1.79.0/lib/pkgconfig

Compile Tesseract

🔨 Compile Tesseract 5.0
# Clone Tesseract repository
git clone https://github.com/tesseract-ocr/tesseract.git

# Navigate to directory
cd tesseract

# Generate configure script
./autogen.sh

# Configure build
./configure --prefix=/usr/local/tesseract-5.0

# Compile
make

# Install
make install

Verify Installation

✅ Check Tesseract Version
# Check version
/usr/local/tesseract-5.0/bin/tesseract -v

# Output:
# tesseract 5.0.0-alpha-20210401-90-g723eb
#  leptonica-1.79.0
#   libjpeg 6b (libjpeg-turbo 1.2.90) : libtiff 4.0.3 : zlib 1.2.7
#  Found AVX2
#  Found AVX
#  Found FMA
#  Found SSE4.1
#  Found OpenMP 201511

4 Package Tesseract for Deployment

✨ Goal: Create a standalone Tesseract package that can be used on Amazon Linux or AWS Lambda.

📦 Create Standalone Package
# Create package directory
mkdir /outputs/tesseract-standalone

# Copy Tesseract executable
cp /usr/local/tesseract-5.0/bin/tesseract /outputs/tesseract-standalone/

# Create lib directory
mkdir /outputs/tesseract-standalone/lib

# Copy Tesseract library
cp /usr/local/tesseract-5.0/lib/libtesseract.so.5 /outputs/tesseract-standalone/lib/

# Copy Leptonica library
cp /usr/local/leptonica-1.79.0/lib/liblept.so.5 /outputs/tesseract-standalone/lib/

# Copy JPEG library
cp /usr/lib64/libjpeg.so.62 /outputs/tesseract-standalone/lib/

👉🏻 Result: The directory /outputs/tesseract-standalone now contains everything needed to run Tesseract 5.0!


5 Download Language Models & Test

Create Tessdata Directory

📁 Setup Language Models
# Create tessdata directory
mkdir /outputs/tesseract-standalone/tessdata
cd /outputs/tesseract-standalone/tessdata

# Download English language model (best quality)
wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata

Language Model Options

Model Type Repository Description
Best tessdata_best Highest accuracy, slower
Fast tessdata_fast Faster processing, good accuracy
Normal tessdata Balanced performance

Test Tesseract

🧪 Run Tesseract Test
# Set tessdata path
export TESSDATA_PREFIX="/outputs/tesseract-standalone/tessdata"

# Navigate to tesseract directory
cd /outputs/tesseract-standalone

# Run OCR on image
./tesseract /outputs/example.png /outputs/text_result.txt -l eng --psm 6

Page Segmentation Modes (PSM)

📋 PSM Options
0  = Orientation and script detection (OSD) only
1  = Automatic page segmentation with OSD
2  = Automatic page segmentation, but no OSD, or OCR
3  = Fully automatic page segmentation, but no OSD (Default)
4  = Assume a single column of text of variable sizes
5  = Assume a single uniform block of vertically aligned text
6  = Assume a single uniform block of text
7  = Treat the image as a single text line
8  = Treat the image as a single word
9  = Treat the image as a single word in a circle
10 = Treat the image as a single character
11 = Sparse text. Find as much text as possible in no particular order
12 = Sparse text with OSD
13 = Raw line. Treat the image as a single text line

📖 Learn More: Tesseract Command Line Documentation


6 Troubleshooting

Issue Solution
wget: command not found yum install wget
gzip: Cannot exec yum install gzip
make: command not found yum install make
C++ compiler cannot create executables yum install gcc-c++
pip: command not found yum -y install python3-pip
Install Python 3 yum -y install python3

🛠️ Additional Utilities

Check Library Dependencies

🔍 List Dependencies
readelf -d tesseract | grep NEEDED

# Output shows required libraries:
# libtesseract.so.5
# liblept.so.5
# libjpeg.so.62
# libtiff.so.5
# libz.so.1
# librt.so.1
# libpthread.so.0
# libstdc++.so.6
# libm.so.6
# libgomp.so.1
# libgcc_s.so.1
# libc.so.6

Find Files by Name

🔍 Search for Files
# Example: Find all JPEG libraries
find / -type f -name "libjpeg*"

# Output:
# /usr/share/doc/libjpeg-turbo-devel-1.2.90/libjpeg.txt
# /usr/lib64/pkgconfig/libjpeg.pc
# /usr/lib64/libjpeg.so.62.1.0
# /outputs/tesseract-standalone/lib/libjpeg.so.62

📦 Download Pre-compiled Binary

👉🏻 Skip the compilation? Download the pre-compiled Tesseract 5.0 executable binary file ready for AWS Lambda deployment.

🎯 Summary

  • ✅ Compiled Tesseract 5.0 on Amazon Linux 2
  • ✅ Created standalone package with all dependencies
  • ✅ Ready for AWS Lambda deployment
  • ✅ Includes language models and configuration
  • ✅ Tested and verified OCR functionality

🙇🏻Thank you!

image by Salvatore Andrea Santacroce

Back to blog

Leave a comment

Please note, comments need to be approved before they are published.