Recently I got a task in which I had to read text(English and Kannad) from scanned document(Image and PDF), I was searching for JAVA API implementing OCR , finally I found the API Tess4J.
After a lot of R&D(For support of both 32 and 64 bit system), I implemented a solution.
I am sharing the code here.
import
java.io.BufferedWriter;
import java.io.File;
import
java.io.FileNotFoundException;
import
java.io.FileOutputStream;
import java.io.IOException;
import
java.io.OutputStreamWriter;
import
java.io.UnsupportedEncodingException;
import java.io.Writer;
import
net.sourceforge.tess4j.Tesseract;
import
net.sourceforge.tess4j.TesseractException;
import
net.sourceforge.vietocr.PdfUtilities;
public class OCRKannad {
/**
* @param args
*/
public static void main(String[] args)
{
// TODO Auto-generated
method stub
System.out.println("----11111" );
System.out.println("---->" + args[0]);
String str1 = args[0];
String textFileFolder = args[1];
Tesseract localTesseract = Tesseract.getInstance();
File[] arrayOfFile1 = PdfUtilities.convertPdf2Png(new File(str1));
localTesseract.setLanguage("eng+kan");
for (File localFile :
arrayOfFile1)
try
{
String str2 =
localTesseract.doOCR(localFile);
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(textFileFolder), "UTF-8"));
try {
out.write(str2);
} catch(Exception ex){
System.out.println("Exception ex
"+ex);
}finally {
try {
out.close();
} catch (IOException e) {
System.err.println(e.getMessage());
}
}
System.out.println(str2);
}
catch (TesseractException
localTesseractException)
{
System.err.println(localTesseractException.getMessage());
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated
catch block
e.printStackTrace();
} catch
(FileNotFoundException e) {
// TODO Auto-generated
catch block
e.printStackTrace();
}
finally
{
localFile.delete();
}
}
}
API is available at http://sourceforge.net/projects/tess4j
For Complete dependency and dll please write to me saurabh.ranu@hotmail.com
Hope it"ll be helpful , please comment if have any issue.