Thursday, January 3, 2019

How to convert djvu to text

I had some djvu documents which had actual text in them, I was able to open them in djvu viewer and select the text and paste them into text files. I wanted to find an easy way to convert all the files into text documents, I had more than 50 files to convert each of them having pages over 500.

I found a commandline tool which can convert/extract the hidden text from djvu files and wrote a powershell script to pass the directories and convert the files.

Please note that if the djvu file does not contain any text this method might not be useful. To confirm that your djvu file has extractable text do the following.

Open djvu file in djvu viewer, click on edit->Select, select the area containing text and copy and paste it into a notepad file, if you see the text pasted, this tutorial should be helpful.

Software Requirements
1. djvulibre package
2. Windows powershell

You need to use the djvutxt.exe tool from djvulibre package in order to extract text from djvu documents.

You can download and install it from following link: http://djvu.sourceforge.net/ select windows download link.
Once the setup file is downloaded double click and install it.

Once you complete the installation add the installation directory to environment variable path.
 refer this link for steps

Open windows powershell by searching it in windows search.

Run the following command to check the execution policy(powershell execution policy should be set to allow running scripts on your computer)
Get-ExecutionPolicy
if you get the result as unrestricted then you can continue.
If your policy is restricted, you can execute the following command to enable it
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
(Once you complete the task you can set it to the one whichever was the output of get-executionpolicy command above)

Save the below code in required directory with .ps1 extension, that is when you save the file name it something like DJVU-BulkConvertor.ps1

ps1 is the powershell script extension, in simple terms if you are unfamiliar with powershell, powershell is a new technology implemented by Microsoft which supports scripting and commandline shell similar to DOS but in the back-end it is built with lot of features and strong design which has changed the way system admins used to work on windows. 


 You need to change the directory paths to your input directory path in below script, first two lines.

The folder structure is like below
FolderName3 -> contains multiple folders, inside each folder I have only djvu files no further directories.
Output will be written to  C:\FolderName1\FolderName2 creating folders which were inside foldername3
that is
C:\FolderName1\FolderName2\MyOutput and so on



$source = "C:\FolderName1\FolderName2\FolderName3"
$OutputFolderName = "MyOutput"
$outputDir = Split-Path -Parent $source | ForEach-Object{Join-Path $_ $OutputFolderName}
$directories = Get-ChildItem -Path $source -Directory
foreach($inputDirectory in $directories)
{
$djvuFiles = Get-ChildItem $inputDirectory.FullName -Filter *.djv*
$outputPath = join-path $outputDir $inputDirectory.Name
if(-not(Test-path $outputPath))
{
New-Item $outputPath -ItemType Directory
}
$djvuFiles | ForEach-Object{
$inputFileFullName = $_.FullName
$outputfileName = Join-Path $outputPath "$($_.BaseName).txt"
try {
Write-Information -MessageData "Converting file $inputFileFullName" -InformationAction Continue
djvutxt.exe $inputFileFullName $outputfileName
}
catch {
Write-Warning "Error while converting $inputFileFullName"
Write-Warning $_
}
}
}
 
Please feel free to reach me if you have any questions on my email. 
 
You may need to take help from OCR if you have image files saved in djvu, tesseractOCR can be used via commandline for the same purpose.

No comments:

Post a Comment