Friday, July 11, 2014

OCR with Tss4J (wrapper for Tesseract OCR API) - Reading Text (English and Kannad) from Scanned Image and PDF

Recently I got a task in which I had to read text(English and Kannad) from scanned document(Image and PDF), I was searching for JAVA API implementing OCR , finally I found the API Tess4J.
After a lot of R&D(For support of both 32 and 64 bit system), I implemented a solution.

I am sharing the code here.

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
import java.io.Writer;

import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import net.sourceforge.vietocr.PdfUtilities;


public class OCRKannad {

       /**
        * @param args
        */
       public static void main(String[] args) {
              // TODO Auto-generated method stub
              System.out.println("----11111" );
         System.out.println("---->" + args[0]);
                  String str1 = args[0];
                  String textFileFolder = args[1];
                  Tesseract localTesseract = Tesseract.getInstance();
                  File[] arrayOfFile1 = PdfUtilities.convertPdf2Png(new File(str1));
                                localTesseract.setLanguage("eng+kan");
                  for (File localFile : arrayOfFile1)
                    try
                    {
                      String str2 = localTesseract.doOCR(localFile);
                      Writer out = new BufferedWriter(new OutputStreamWriter(
                                new FileOutputStream(textFileFolder), "UTF-8"));
                            try {
                                out.write(str2);
                            } catch(Exception ex){
                                   System.out.println("Exception ex "+ex);
                            }finally {
                                try {
                                                out.close();
                                         } catch (IOException e) {
                                                System.err.println(e.getMessage());
                                         }
                            }
                     
                      System.out.println(str2);
                    }
                    catch (TesseractException localTesseractException)
                    {
                      System.err.println(localTesseractException.getMessage());
                    } catch (UnsupportedEncodingException e) {
                           // TODO Auto-generated catch block
                           e.printStackTrace();
                     } catch (FileNotFoundException e) {
                           // TODO Auto-generated catch block
                           e.printStackTrace();
                     }
                    finally
                    {
                      localFile.delete();
                    }
               
             
       }

}

API is available at http://sourceforge.net/projects/tess4j

For Complete dependency and dll please write to me saurabh.ranu@hotmail.com 

Hope it"ll be helpful , please comment if have any issue.