Java Portal Experiment: July 2014

Recently I got a task in which I had to read text(English and Kannad) from scanned document(Image and PDF), I was searching for JAVA API implementing OCR , finally I found the API Tess4J.

After a lot of R&D(For support of both 32 and 64 bit system), I implemented a solution.

I am sharing the code here.

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.OutputStreamWriter;

import java.io.UnsupportedEncodingException;

import java.io.Writer;

import net.sourceforge.tess4j.Tesseract;

import net.sourceforge.tess4j.TesseractException;

import net.sourceforge.vietocr.PdfUtilities;

public class OCRKannad {

/**

* @param args

public static void main(String[] args) {

// TODO Auto-generated method stub

System.out.println("----11111" );

System.out.println("---->" + args[0]);

String str1 = args[0];

String textFileFolder = args[1];

Tesseract localTesseract = Tesseract.getInstance();

File[] arrayOfFile1 = PdfUtilities.convertPdf2Png(new File(str1));

localTesseract.setLanguage("eng+kan");

for (File localFile : arrayOfFile1)

try

{

String str2 = localTesseract.doOCR(localFile);

Writer out = new BufferedWriter(new OutputStreamWriter(

new FileOutputStream(textFileFolder), "UTF-8"));

try {

out.write(str2);

} catch(Exception ex){

System.out.println("Exception ex "+ex);

}finally {

try {

out.close();

} catch (IOException e) {

System.err.println(e.getMessage());

}

System.out.println(str2);

}

catch (TesseractException localTesseractException)

{

System.err.println(localTesseractException.getMessage());

} catch (UnsupportedEncodingException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (FileNotFoundException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

finally

{

localFile.delete();

}

API is available at http://sourceforge.net/projects/tess4j

For Complete dependency and dll please write to me saurabh.ranu@hotmail.com

Hope it"ll be helpful , please comment if have any issue.

Java Portal Experiment

Friday, July 11, 2014

OCR with Tss4J (wrapper for Tesseract OCR API) - Reading Text (English and Kannad) from Scanned Image and PDF