Pytesseract | Orientation and Script Detection (OSD)#
This example shows how to use the orientation and script detection (OSD) functions in pytesseract.
OSD, plainly, describes the detection of the orientation of the input image and apparent script (alphabet). This information is extremely useful when you want to improve accuracy with Tesseract/pytesseract, which will be demonstrated in the examples below.
from PIL import Image
import pytesseract
Image Rotation#
If the input image is rotated, then Tesseract will by default give bad results. Tesseract by default, does not apply any preprocessing to rotate images - it is up to the end-user to rotate before processing.
path = '../../../../binder-datasets/ocr/images/letter_rotated.jpg'
im = Image.open(path)
display(im.resize(int(0.3*s) for s in im.size))
Let’s see the results as-is.
print(pytesseract.image_to_string(im))
a0q uyor
‘sIOARapUs pareys INO Ul yey pue wistumdo years WIM
“WOTRDIpap
SULIdARMUN pue AIIUN IMO 0} JUSUTe}Sa} B aUIOIAG SJUBWIBATTYIe MO pue ‘saaty) ATUNUIUIOD
AZoTouya} sup Jo sntuas BANaT[Od ay} aayM aININJ e 9810] UD aM “TayIasOL, ALO}sTY Noysnop
premio} Auewmy petjedoid aaey yew) sopdioutid ay) sproydn pue ‘AtAyeaso ey) saminu
‘s10}ea19 JO SUSU ayy suotdureyo yey) ATUNUIUIOD BATSNPUT Ue a}eaID sn jay ‘Ayadod [enDaTjaIut
sundaloid pue 8undadsai 0} JUBUNTUIWIOD paMmaual e UT aut uTOf 0} NOA aHAUT | “BUTSOTD UT
‘PLIOM SUTATOAS-~IaAa INO 0} SBULIQ
}f SUOTMNGLNUOD au) pue Ayadord JempdaTjaIUT Jo aUEIOdUTT ay} Jo WAWISpay|MOUye aATOaTIOD & YIM
suIgaq I pue ‘amyny sip adeys 0} JaMmod IMO UTYIIM ST I] ‘pauiejuN suTeUeI UOMeIOTAxe [eoTso;oUTya)
Jo Ids at pue ‘spunoge UOTRAOUUT ‘SaysLMOT] WONeIOGeT[OD a1ayM PLOM VY ‘peye101d
SI YIOM ITay}] PU PaleIqaTad are s1O}ea1D PUL SIOJUIAUT SSafaIN af} aoyM P]JOM e sUTSeUIT
‘ajoum
e se AjaIN0s s}ljauiaq ATAIEUIN[N pue ‘saLaAooSIp SUTYeIIQpUNOIS sasemMooua “UeTe} Mau see yey
JUSUIUOITAUA Ue aJeANTND am ‘os Surop Ag ‘ssaigoid aAUp OYM asoy} 10J Woddns pue uoneesdde
jo aimyno e ZULU ‘suoNealD pue aspayMouy SuLIeys Jo sueduI a_qeuoseal Pur Ie} 10J
JILOOAPE SN Jay ‘PeaIsuy “SuTtpPAIaAa 0} sSad0e aay Jo ainqye ay) Aq paprnSstut aq ou sn ja‘T
‘panyea pue pa}deloid aq IsNUI sUONeaD May) Wo Wauaq 0} IWS say) pue ‘UoTssed ITey)
‘YIOM May], ‘aiMNy Mo adeys Jey seap! OUT afi] aypearq 0} ‘UONeIIpap ssapueTer ay} Aq payany pue
SUOISTA Jtay) Aq USALIp ‘SINOY SsapUNOd ISBAUT OYM S10}eI1D JY ST II “pfinq pue weap 0} alep OYM *
asoy} JO SIYSII ay Joj JOadsai Jo UONepUNO] ay} UO SBATIU) VOPAOUUT Jey} JaqUIBUIal SN 39'T
‘s10}919 SY) JO sUojja pue syst sy}
JO] UONeIapIsuod anp NOTPIM painqunsip pue ‘paale ‘patdod Sureq suopea [eNsIp Jo uonesajtoid
eB SUISSAUIIM are am ‘S]Uasatdal II SIIO}Ja aaNeaID ay) pue AVadoid [en daTJaIUl Jo anjeA aU) SSTWISIp
0) Aouapus) ZuIMOIZ eB Aq patiayeartp st ‘AVTUNUTUIOD e se sn sa}TUN YDTYM ‘AZofouyd~a} 10j uotssed
paseys ING ‘aul je BULMeUS Udaq Sey Jey} WisdUOD e ssaidxa 0} pa[eduiod Jae] | ‘TaAaMOH
‘a8payMouy jo ymsind pareys
al) pue ‘UOTeUTSeUIT ‘UONeIOgeT[OD SpUeUIap Jey} SUIT) e ST I] “AJa{DOS JO JUAULIaNaq al) 10} pessaurey
aq 0) Suntem ‘dser8 no uTWIM sal] WoNeAOUUT Jo [enuaIOd ssaypunog ay], ‘santpiqissod ssaypua jo
aordiaid ay} 3e BUIPUEs SBATaSINO PUL] 9M ‘s}UaWIBOUeApe [edTZO[OUYIa} prides Jo ase stp UT
‘sIossadapaid mo jo auo
Aq pauuad ,,sisthqqox 0} Jana] uad¢,, dtuodt ay} Jo Wards ay) Suroyse ‘puru Aw UT SULMaIq Ueaq
JARY EY) SIYSNOUP at1os areys 0} YSTM | ‘AepoO], ‘AWUNUIWIOD JURIQIA MO JO aInyNy ay} noge Wrau0d
JO YONO} & PUR JUSTIa}1OXa JSOUNIN YIM NOL 0} aM | se ‘Tam NOA spuly Jane] sty adoy J
SISEISHIUy AsO[OUl al, 0} Jojlo |] UedQ uy
The text is completely inaccurate. You can see what Tesseract was trying to do, read it top-to-bottom and extract the text.
Orientation and Script Detection (OSD)#
OSD can help us here by providing necessary information to fix not only the rotation issue, but it also provides addition information such as the script language.
We can get OSD information with pytesseract by using image_to_osd.
It provides this information:
page_num the page index of the current item
orientation the detected rotation of the image
rotate the required rotation angle to get the text in a horizontal format
orientation_conf the confience of Tesseract that the orientation was detected correctly - higher is better
script provides information about the language or script family to which the detected text belongs
script_conf the confience of Tesseract that the script was detected correctly - higher is better
According to the official documentation a score of confience score 15.0 is ‘reasonably confident’ for orientation and script detection.
It is very helpful to use the output_type of dict, so we can easily access the values with the given keys.
osd = pytesseract.image_to_osd(im, output_type='dict')
print(osd)
{'page_num': 0, 'orientation': 180, 'rotate': 180, 'orientation_conf': 20.69, 'script': 'Latin', 'script_conf': 33.33}
Correcting the rotation#
Let’s correct the rotation. It is easy using Pillow.
rotate = osd['rotate']
im_fixed = im.copy().rotate(rotate)
display(im_fixed.resize(int(0.3*s) for s in im_fixed.size))
print(pytesseract.image_to_string(im_fixed))
An Open Letter to Technology Enthusiasts
I hope this letter finds you well, as I write to you with utmost excitement and a touch of
concern about the future of our vibrant community. Today, I wish to share some thoughts that have
been brewing in my mind, echoing the spirit of the iconic "Open Letter to Hobbyists" penned by
one of our predecessors.
In this age of rapid technological advancements, we find ourselves standing at the precipice
of endless possibilities. The boundless potential of innovation lies within our grasp, waiting to be
harnessed for the betterment of society. It is a time that demands collaboration, imagination, and the
shared pursuit of knowledge.
However, I feel compelled to express a concern that has been gnawing at me. Our shared
passion for technology, which unites us as a community, is threatened by a growing tendency to
dismiss the value of intellectual property and the creative efforts it represents. We are witnessing a
proliferation of digital creations being copied, altered, and distributed without due consideration for
the rights and efforts of their creators.
Let us remember that innovation thrives on the foundation of respect for the rights of those
_who dare to dream and build. It is the creators who invest countless hours, driven by their visions
and fueled by their relentless dedication, to breathe life into ideas that shape our future. Their work,
their passion, and their right to benefit from their creations must be protected and valued.
Let us not be misguided by the allure of free access to everything. Instead, let us advocate
for fair and reasonable means of sharing knowledge and creations, nurturing a culture of
appreciation and support for those who drive progress. By doing so, we cultivate an environment
that attracts new talent, encourages groundbreaking discoveries, and ultimately benefits society as a
whole.
Imagine a world where the tireless inventors and creators are celebrated and their work is
protected. A world where collaboration flourishes, innovation abounds, and the spirit of
technological exploration remains untamed. It is within our power to shape this future, and it begins
with a collective acknowledgment of the importance of intellectual property and the contributions it
brings to our ever-evolving world.
In closing, I invite you to join me in a renewed commitment to respecting and protecting
intellectual property. Let us create an inclusive community that champions the rights of creators,
nurtures their creativity, and upholds the principles that have propelled humanity forward
throughout history. Together, we can forge a future where the collective genius of the technology
community thrives, and our achievements become a testament to our unity and unwavering
dedication.
With great optimism and faith in our shared endeavors,
John Doe
Use case for Script Detection#
Where does script detection come into play? Here’s one potential example: what if you are creating a global OCR API? In this case you may not know the language or script of the input image.
In this example, I have a image in the Hebrew language. I can extract text from this image with out knowing it is Hebrew in advance by utilizing the script trained data that comes with tessdata_fast.
path = '../../../../binder-datasets/ocr/images/hebrew_text.png'
im = Image.open(path)
display(im)
osd = pytesseract.image_to_osd(im, output_type='dict')
print(osd)
{'page_num': 0, 'orientation': 0, 'rotate': 0, 'orientation_conf': 6.11, 'script': 'Hebrew', 'script_conf': 90.0}
print(pytesseract.image_to_string(im, lang='script/'+osd['script'], config='--psm 6'))
ב וכל אֲלֹהִים בַּיוֹם הַשְׁבֵיעי. מְלֵאכְתּוֹ אֲשֶׁר עֲשֶׂה; ויִשְׁבּת בַּיוֹם הַשְׁבֵיעי. מכָּל-מְלָאכְתּוֹ אֲשֶׁר
עֲשֶׂה.
This is not perfect. There are two potential issues with this.
Some languages have the same script type. As an example, English, Spanish, and French all are classified as ‘Latin’.
The script type returned by
image_to_osdis not a one-to-one mapping. As an example for Chinese Simple and Chinese Traditional the output might be ‘Han’ but if you examine thetessdatascripts, you will find ‘HanS’, ‘HanS_vert’, ‘HanT’, and ‘HanT_vert’.
For the first issue a resolution could be to extract the text via the ‘Latin’ script, then use a separate Python library, such as langdetect, to get the best language match. Then you would could OCR again with the detected language for more accurate results.
For the second issue you may need to use a variety of methods to make an educated guess.