Sunday, January 6, 2019

Powershell script to perform OCR on images present in a folder

This powershell script can perform OCR via tesseract OCR and convert them to text files default supported language is Kannada by script, but you can change it to required language.

Script asks you to choose your source folder and also the output folder to save converted files to.

Requirements to run this script: Tesseract OCR for windows software.
Running powershell script should be enabled by running set-executionpolicy unrestricted in powershell.

Save the below script with convert-ImagestoText.ps1 or any name with .ps1 extension and run it after setting the execution policy.

To run the script, you can right click on it and select run with powershell or run it from the powershell console by typing .\filename.ps1

Please contact me if you need any help related to this 

Important note: Please make sure that there is no space in any foldername while providing input and output folders, or else script might fail(I faced this issue today 7/1/2018).

param($SourceFolderPath,$OutputFolderPath,$LanguageCode = "kan")
Function Get-Folder($rootFolder,$DialogBoxTitleMessage)
{
[System.Reflection.Assembly]::LoadWithPartialName("System.windows.forms") | Out-Null

$foldername = New-Object System.Windows.Forms.FolderBrowserDialog
$foldername.Description = $DialogBoxTitleMessage
$foldername.SelectedPath = $rootFolder

if($foldername.ShowDialog() -eq "OK")
{
$folder += $foldername.SelectedPath
}
return $folder
}
if(!($SourceFolderPath -and $OutputFolderPath))
{
$rootFolderForSelector = "$env:userprofile\desktop"
$SourceFolderPath = Get-Folder -rootFolder $rootFolderForSelector -DialogBoxTitleMessage "Please select Source folder with images"
$OutputFolderPath = Get-Folder -rootFolder $rootFolderForSelector -DialogBoxTitleMessage "Please select folder to save the converted files"
}
if($SourceFolderPath -and $OutputFolderPath)
{
$filterFiles = "*.jp*g","*.png","*.bmp"

foreach($filterString in $filterFiles)
{
Write-Information -MessageData "Getting files of type $filterString" -InformationAction Continue
$inputFiles += Get-ChildItem $SourceFolderPath -Filter $filterstring
}
$totalFiles = $inputfiles.Count
$count = 0;
$inputFiles | ForEach-Object{
$inputFileFullName = $_.FullName
$outputfileName = Join-Path $OutputFolderPath "$($_.BaseName)"
try {
#Write-Information -MessageData "Converting file $inputFileFullName" -InformationAction Continue
$count++
$perc = (100*$count)/$totalfiles
Write-Progress -Activity "OCR conversion" -PercentComplete $perc -Status "$perc %" -currentoperation "Converting $inputfilefullname"
start-process tesseract.exe -argumentlist $inputFileFullName,$outputfileName,"-l",$LanguageCode -nonewwindow -wait

}
catch {
Write-Warning "Error while converting $inputFileFullName"
Write-Warning $_
}
}
}
Screenshots:
Running via commandline with source folder and destination folder as input

Running without inputs, choosing folders via windows prompt






Tesseract OCR project page for more options and information:
https://github.com/tesseract-ocr/tesseract

Thursday, January 3, 2019

How to convert djvu to text

I had some djvu documents which had actual text in them, I was able to open them in djvu viewer and select the text and paste them into text files. I wanted to find an easy way to convert all the files into text documents, I had more than 50 files to convert each of them having pages over 500.

I found a commandline tool which can convert/extract the hidden text from djvu files and wrote a powershell script to pass the directories and convert the files.

Please note that if the djvu file does not contain any text this method might not be useful. To confirm that your djvu file has extractable text do the following.

Open djvu file in djvu viewer, click on edit->Select, select the area containing text and copy and paste it into a notepad file, if you see the text pasted, this tutorial should be helpful.

Software Requirements
1. djvulibre package
2. Windows powershell

You need to use the djvutxt.exe tool from djvulibre package in order to extract text from djvu documents.

You can download and install it from following link: http://djvu.sourceforge.net/ select windows download link.
Once the setup file is downloaded double click and install it.

Once you complete the installation add the installation directory to environment variable path.
 refer this link for steps

Open windows powershell by searching it in windows search.

Run the following command to check the execution policy(powershell execution policy should be set to allow running scripts on your computer)
Get-ExecutionPolicy
if you get the result as unrestricted then you can continue.
If your policy is restricted, you can execute the following command to enable it
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
(Once you complete the task you can set it to the one whichever was the output of get-executionpolicy command above)

Save the below code in required directory with .ps1 extension, that is when you save the file name it something like DJVU-BulkConvertor.ps1

ps1 is the powershell script extension, in simple terms if you are unfamiliar with powershell, powershell is a new technology implemented by Microsoft which supports scripting and commandline shell similar to DOS but in the back-end it is built with lot of features and strong design which has changed the way system admins used to work on windows. 


 You need to change the directory paths to your input directory path in below script, first two lines.

The folder structure is like below
FolderName3 -> contains multiple folders, inside each folder I have only djvu files no further directories.
Output will be written to  C:\FolderName1\FolderName2 creating folders which were inside foldername3
that is
C:\FolderName1\FolderName2\MyOutput and so on



$source = "C:\FolderName1\FolderName2\FolderName3"
$OutputFolderName = "MyOutput"
$outputDir = Split-Path -Parent $source | ForEach-Object{Join-Path $_ $OutputFolderName}
$directories = Get-ChildItem -Path $source -Directory
foreach($inputDirectory in $directories)
{
$djvuFiles = Get-ChildItem $inputDirectory.FullName -Filter *.djv*
$outputPath = join-path $outputDir $inputDirectory.Name
if(-not(Test-path $outputPath))
{
New-Item $outputPath -ItemType Directory
}
$djvuFiles | ForEach-Object{
$inputFileFullName = $_.FullName
$outputfileName = Join-Path $outputPath "$($_.BaseName).txt"
try {
Write-Information -MessageData "Converting file $inputFileFullName" -InformationAction Continue
djvutxt.exe $inputFileFullName $outputfileName
}
catch {
Write-Warning "Error while converting $inputFileFullName"
Write-Warning $_
}
}
}
 
Please feel free to reach me if you have any questions on my email. 
 
You may need to take help from OCR if you have image files saved in djvu, tesseractOCR can be used via commandline for the same purpose.