Pytesseract | Batch Processing#
This notebook covers batch processing with pytesseract.
import pytesseract
Batch File#
You can process a batch of files by providing a text file with the relative image paths (relative meaning to where the script is running). Here, I am providing a text file with two image paths.
batch_file_path = "../../../../binder-datasets/ocr/batch/images.txt"
with open(batch_file_path) as f:
print(f.read())
../../../binder-datasets/ocr/images/invoice.png
../../../binder-datasets/ocr/images/letter.jpg
Timeouts#
While this is not required for batch processing, it may be a good idea to add handling to terminate Tesseract if processing is taking too long. Letβs try two scenarios, one where we timeout after 1 second and one where we wait 30 seconds to timeout. In this case, 1 second is not enough time to process both images.
timeout = 1
try:
print(pytesseract.image_to_string(batch_file_path, timeout=timeout))
except RuntimeError as timeout_error:
pass
timeout = 30
try:
print(pytesseract.image_to_string(batch_file_path, timeout=timeout))
except RuntimeError as timeout_error:
pass