Sunday, January 6, 2019

Powershell script to perform OCR on images present in a folder

This powershell script can perform OCR via tesseract OCR and convert them to text files default supported language is Kannada by script, but you can change it to required language.

Script asks you to choose your source folder and also the output folder to save converted files to.

Requirements to run this script: Tesseract OCR for windows software.
Running powershell script should be enabled by running set-executionpolicy unrestricted in powershell.

Save the below script with convert-ImagestoText.ps1 or any name with .ps1 extension and run it after setting the execution policy.

To run the script, you can right click on it and select run with powershell or run it from the powershell console by typing .\filename.ps1

Please contact me if you need any help related to this 

Important note: Please make sure that there is no space in any foldername while providing input and output folders, or else script might fail(I faced this issue today 7/1/2018).

param($SourceFolderPath,$OutputFolderPath,$LanguageCode = "kan")
Function Get-Folder($rootFolder,$DialogBoxTitleMessage)
{
[System.Reflection.Assembly]::LoadWithPartialName("System.windows.forms") | Out-Null

$foldername = New-Object System.Windows.Forms.FolderBrowserDialog
$foldername.Description = $DialogBoxTitleMessage
$foldername.SelectedPath = $rootFolder

if($foldername.ShowDialog() -eq "OK")
{
$folder += $foldername.SelectedPath
}
return $folder
}
if(!($SourceFolderPath -and $OutputFolderPath))
{
$rootFolderForSelector = "$env:userprofile\desktop"
$SourceFolderPath = Get-Folder -rootFolder $rootFolderForSelector -DialogBoxTitleMessage "Please select Source folder with images"
$OutputFolderPath = Get-Folder -rootFolder $rootFolderForSelector -DialogBoxTitleMessage "Please select folder to save the converted files"
}
if($SourceFolderPath -and $OutputFolderPath)
{
$filterFiles = "*.jp*g","*.png","*.bmp"

foreach($filterString in $filterFiles)
{
Write-Information -MessageData "Getting files of type $filterString" -InformationAction Continue
$inputFiles += Get-ChildItem $SourceFolderPath -Filter $filterstring
}
$totalFiles = $inputfiles.Count
$count = 0;
$inputFiles | ForEach-Object{
$inputFileFullName = $_.FullName
$outputfileName = Join-Path $OutputFolderPath "$($_.BaseName)"
try {
#Write-Information -MessageData "Converting file $inputFileFullName" -InformationAction Continue
$count++
$perc = (100*$count)/$totalfiles
Write-Progress -Activity "OCR conversion" -PercentComplete $perc -Status "$perc %" -currentoperation "Converting $inputfilefullname"
start-process tesseract.exe -argumentlist $inputFileFullName,$outputfileName,"-l",$LanguageCode -nonewwindow -wait

}
catch {
Write-Warning "Error while converting $inputFileFullName"
Write-Warning $_
}
}
}
Screenshots:
Running via commandline with source folder and destination folder as input

Running without inputs, choosing folders via windows prompt






Tesseract OCR project page for more options and information:
https://github.com/tesseract-ocr/tesseract

No comments:

Post a Comment