Pytesseract | Batch Processing#

This notebook covers batch processing with pytesseract.

import pytesseract

Batch File#

You can process a batch of files by providing a text file with the relative image paths (relative meaning to where the script is running). Here, I am providing a text file with two image paths.

batch_file_path = "../../../../binder-datasets/ocr/batch/images.txt"
with open(batch_file_path) as f:
    print(f.read())
../../../binder-datasets/ocr/images/invoice.png
../../../binder-datasets/ocr/images/letter.jpg

Timeouts#

While this is not required for batch processing, it may be a good idea to add handling to terminate Tesseract if processing is taking too long. Let’s try two scenarios, one where we timeout after 1 second and one where we wait 30 seconds to timeout. In this case, 1 second is not enough time to process both images.

timeout = 1
try:
    print(pytesseract.image_to_string(batch_file_path, timeout=timeout))
except RuntimeError as timeout_error:
    pass
timeout = 30
try:
    print(pytesseract.image_to_string(batch_file_path, timeout=timeout))
except RuntimeError as timeout_error:
    pass