使用Tesseract-OCR训练文字识别记录_tesseract ocr list-程序员宅基地

时间:2016-11-14 21:40 来源:清屏网作者:那一抹忧伤点击:133次

Tesseract官方文档页面

https://github.com/tesseract-ocr/tesseract

jTessBoxEditor官方文档页面

http://vietocr.sourceforge.net/training.html

[root@docker01 tesseract]# tesseract --list-langs
 
List of available languages (2):
eng

就一个英语环境。

语言包所在的目录

[root@docker01 tessdata]# pwd
/usr/share/tesseract/tessdata
[root@docker01 tessdata]# ll
总用量 37624
drwxr-xr-x 2 root root 4096 10月 25 22:51 configs
-rw-r--r-- 1 root root 171918 6月 25 2015 eng.cube.bigrams
-rw-r--r-- 1 root root 38 6月 25 2015 eng.cube.fold
-rw-r--r-- 1 root root 181 6月 25 2015 eng.cube.lm
-rw-r--r-- 1 root root 857304 6月 25 2015 eng.cube.nn
-rw-r--r-- 1 root root 254 6月 25 2015 eng.cube.params
-rw-r--r-- 1 root root 13020078 6月 25 2015 eng.cube.size
-rw-r--r-- 1 root root 2444187 6月 25 2015 eng.cube.word-freq
-rw-r--r-- 1 root root 996 6月 25 2015 eng.tesseract_cube.nn
-rw-r--r-- 1 root root 21876550 6月 25 2015 eng.traineddata
-rw-r--r-- 1 root root 124215 10月 25 23:08 normal.traineddata
-rw-r--r-- 1 root root 568 1月 26 2016 pdf.ttf
drwxr-xr-x 2 root root 92 10月 25 22:51 tessconfigs

后期若要添加语言包，则可下载语言包后放到这里面。

pkgs.org中对tesseract的安装说明，已经安装后的一些文件信息

https://pkgs.org/centos-7/epel-x86_64/tesseract-3.04.00-3.el7.x86_64.rpm.html

安装jTessBoxEditor

jTessBoxEditor需要jre7（Java Runtime Environment）以上的版本支持。

安装完jre后，下载jTessBoxEditor，解压，运行train.bat文件即可运行

运行后界面图

至此两个所需要的软件安装结束。

初步识别工作

准备几张图片

把这几张图片传到安装tesseract的机器上

[root@docker01 test01]# ll
总用量 24
-rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif
-rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif
-rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif
-rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif
-rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif
-rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif

开始识别 0.gif 图片

[root@docker01 test01]# tesseract 0.gif out.0 -l eng
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory

这是在该目录下多了一个out.0.txt文件

[root@docker01 test01]# ll
总用量 28
-rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif
-rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif
-rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif
-rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif
-rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif
-rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif
-rw-r--r-- 1 root root 6 10月 26 00:52 out.0.txt

查看所识别到的内容

[root@docker01 test01]# cat out.0.txt
[54v

和图片上的 I54v 有点差别。

批量识别所有内容

[root@docker01 test01]# for i in {1..5};do tesseract $i.gif out.$i -l eng;done 
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory

查看识别出的内容

[root@docker01 test01]# ll
总用量 48
-rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif
-rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif
-rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif
-rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif
-rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif
-rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif
-rw-r--r-- 1 root root 6 10月 26 00:52 out.0.txt
-rw-r--r-- 1 root root 9 10月 26 01:00 out.1.txt
-rw-r--r-- 1 root root 5 10月 26 01:00 out.2.txt
-rw-r--r-- 1 root root 6 10月 26 01:00 out.3.txt
-rw-r--r-- 1 root root 7 10月 26 01:00 out.4.txt
-rw-r--r-- 1 root root 5 10月 26 01:00 out.5.txt
[root@docker01 test01]# cat *.txt
[54v
ikhb‘
ymm
7y28
nl 9c
mzb

和上面的图片对应，其实就一个 3.gif 图片识别对了

训练工作

合成图片工作

返回到win系统上，运行jTessBoxEditor工具，把所有图片合成一张 .tif 格式的图片

打开所有要合成的图片

命名要合成图片的名字

注：有关这个命名有个说法，必须要按以下格式命名

tif命名规范：

[lang].[fontname].exp[num].tif

其中lang为语言名称，fontname为字体名称，num为序号，可以随便定义。

但我试了其他的明白，直接命名也是正常的。

提示创建成功，在图片目录下生成一个 mytest.tif 的文件

生成box文件工作

把 mytest.tif 文件上传到centos 7 系统上

[root@docker01 04test]# ll
 
总用量 100
-rw-r--r-- 1 root root 99212 10月 26 15:42 mytest.tif

在mytest.tif所在的目录下打开一个命令行，产生相应的Box文件（*.box）

来生成一个box文件，该文件记录了tesseract识别出来的每一个字和其位置坐标。

[root@docker01 04test]# tesseract mytest.tif mytest batch.nochop makebox
 
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Page 2
Page 3
Empty page!!
Empty page!!
Empty page!!
Page 4
Page 5
Page 6
Page 7
Empty page!!
Empty page!!
Empty page!!
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Page 14
Page 15
Page 16
Page 17
Empty page!!
Empty page!!
Empty page!!
Page 18
Page 19
Page 20
Page 21
Empty page!!
Empty page!!
Empty page!!
Warning in pixReadMemTiff: tiff page 21 not found

这时目录多出了一个mytest.box和mytest.txt文件

[root@docker01 04test]# ll
 
总用量 108
-rw-r--r-- 1 root root 1005 10月 26 23:52 mytest.box
-rw-r--r-- 1 root root 99212 10月 26 15:42 mytest.tif
-rw-r--r-- 1 root root 101 10月 26 23:52 mytest.txt

修正文字内容

把mytest.box下载下来，放到win系统下，放到之前mytest.tif目录下。

使用jTessBoxEditor开始修正文字

修正文字会遇到的几种情况

普通情况

可以看到，识别到的第一个值是 6 ，但图片中的值为 e ，所以开始手动修改

修改后，回车，然后点击 save 保存

然后进行一张图片修正

若识别到的图片的文字与图片上一样，即可继续下一张图片识别
表中无内容

部分图片可能由于背景颜色关系，导致此张图片无法识别，可跳过继续下一张识别。
识别一半

例如以下图片，四个字符，只被分割成两个

此时，可以用到分割识别框以及调整识别框位置的功能

调整后的图形

Run Tesseract for Training

产生字符特征文件（*.tr）

把修正后的box文件传回centos7系统中，删除原来在centos 7系统中的box文件

[root@docker01 03test]# rm 200test.box
 
rm：是否删除普通文件 "200test.box"？y
[root@docker01 03test]# rz -by
 
rz waiting to receive.
Starting zmodem transfer. Press Ctrl+C to cancel.
Transferring 200test.box...
100% 9 KB 9 KB/sec 00:00:01 0 Errors

[root@docker01 03test]# tesseract 200test.tif 200test nobatch box.train

目录下都了一个tr文件

[root@docker01 03test]# ll
 
总用量 1756
-rw-r--r-- 1 root root 10210 10月 26 16:53 200test.box
-rw-r--r-- 1 root root 949532 10月 26 15:13 200test.tif
-rw-r--r-- 1 root root 830214 10月 27 00:58 200test.tr
-rw-r--r-- 1 root root 325 10月 27 00:58 200test.txt

Compute the Character Set

产生计算字符集（unicharset）

[root@docker01 03test]# unicharset_extractor 200test.box
 
Extracting unicharset from 200test.box
Wrote unicharset file ./unicharset.

定义字体特征文件并聚集字符特征

新建文件“font_properties”。那么需要在目录下新建一个名字为“font_properties”的文件，并且输入文本 :

注意:这里 200test 必须与训练名中的名称保持一致,填入下面内容 ,这里全取值为0，表示字体不是粗体、斜体等等。

200test 0 0 0 0 0

[root@docker01 03test]# ll
总用量 1764
-rw-r--r-- 1 root root 10210 10月 26 16:53 200test.box
-rw-r--r-- 1 root root 949532 10月 26 15:13 200test.tif
-rw-r--r-- 1 root root 830214 10月 27 00:58 200test.tr
-rw-r--r-- 1 root root 325 10月 27 00:58 200test.txt
-rw-r--r-- 1 root root 18 10月 27 01:02 font_properties
-rw-r--r-- 1 root root 2301 10月 27 01:00 unicharset

[root@docker01 03test]# cat font_properties

200test 0 0 0 0 0

执行命令：

[root@docker01 03test]# mftraining -F font_properties -U unicharset 200test.tr
Warning: No shape table file present: shapetable
Reading 200test.tr ...
Flat shape table summary: Number of shapes = 43 max unichars = 1 number with multiple unichars = 0
Warning: no protos/configs for Joined in CreateIntTemplates()
Warning: no protos/configs for |Broken|0|1 in CreateIntTemplates()
Done!

输入命令：

[root@docker01 03test]# cntraining 200test.tr
Reading 200test.tr ...
Clustering ...
Writing normproto ...

此时，在目录下应该生成若干个文件了，把unicharset, inttemp, normproto, pffmtable这四个文件加上前缀“200test.”。然后合并训练文件

[root@docker01 03test]# ll
总用量 2100
-rw-r--r-- 1 root root 10210 10月 26 16:53 200test.box
-rw-r--r-- 1 root root 949532 10月 26 15:13 200test.tif
-rw-r--r-- 1 root root 830214 10月 27 00:58 200test.tr
-rw-r--r-- 1 root root 325 10月 27 00:58 200test.txt
-rw-r--r-- 1 root root 18 10月 27 01:02 font_properties
-rw-r--r-- 1 root root 323869 10月 27 01:03 inttemp
-rw-r--r-- 1 root root 5342 10月 27 01:04 normproto
-rw-r--r-- 1 root root 341 10月 27 01:03 pffmtable
-rw-r--r-- 1 root root 778 10月 27 01:03 shapetable
-rw-r--r-- 1 root root 2301 10月 27 01:00 unicharset

修改文件，并合并训练文件

[root@docker01 03test]# ll
总用量 2100
-rw-r--r-- 1 root root 10210 10月 26 16:53 200test.box
-rw-r--r-- 1 root root 949532 10月 26 15:13 200test.tif
-rw-r--r-- 1 root root 830214 10月 27 00:58 200test.tr
-rw-r--r-- 1 root root 325 10月 27 00:58 200test.txt
-rw-r--r-- 1 root root 18 10月 27 01:02 font_properties
-rw-r--r-- 1 root root 323869 10月 27 01:03 test200.inttemp
-rw-r--r-- 1 root root 5342 10月 27 01:04 test200.normproto
-rw-r--r-- 1 root root 341 10月 27 01:03 test200.pffmtable
-rw-r--r-- 1 root root 778 10月 27 01:03 test200.shapetable
-rw-r--r-- 1 root root 2301 10月 27 01:00 test200.unicharse

合并文件

[root@docker01 03test]# combine_tessdata test200.
 
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 (test200.config ) is -1
Offset for type 1 (test200.unicharset ) is 140
Offset for type 2 (test200.unicharambigs ) is -1
Offset for type 3 (test200.inttemp ) is 2441
Offset for type 4 (test200.pffmtable ) is 326310
Offset for type 5 (test200.normproto ) is 326651
Offset for type 6 (test200.punc-dawg ) is -1
Offset for type 7 (test200.word-dawg ) is -1
Offset for type 8 (test200.number-dawg ) is -1
Offset for type 9 (test200.freq-dawg ) is -1
Offset for type 10 (test200.fixed-length-dawgs ) is -1
Offset for type 11 (test200.cube-unicharset ) is -1
Offset for type 12 (test200.cube-word-dawg ) is -1
Offset for type 13 (test200.shapetable ) is 331993
Offset for type 14 (test200.bigram-dawg ) is -1
Offset for type 15 (test200.unambig-dawg ) is -1
Offset for type 16 (test200.params-model ) is -1
Output test200.traineddata created sucessfully.

此时目录下“test200.traineddata”的文件拷贝到tesseract程序目录下的“tessdata”目录。

[root@docker01 03test]# cp test200.traineddata /usr/share/tesseract/tessdata

查看当前语言包有哪些

[root@docker01 tesseract_test]# tesseract --list-langs
List of available languages (4):
eng
normal
myfont
test200

至此，新的语言包已训练完成，下一步就是要用此语言包来识别图形文字

再次识别

还是最开始的5涨图片

[root@docker01 test01]# ll
总用量 44
-rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif
-rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif
-rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif
-rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif
-rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif
-rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif

用一个循环批量识别

[root@docker01 test01]# for i in {1..5};do tesseract $i.gif out.$i -l test200;done
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory

识别后输出的文件

[root@docker01 test01]# ll
总用量 48
-rw-r--r-- 1 root root 1829 10月 24 16:05 0.gif
-rw-r--r-- 1 root root 1930 10月 24 16:05 1.gif
-rw-r--r-- 1 root root 1890 10月 24 16:05 2.gif
-rw-r--r-- 1 root root 1986 10月 24 16:05 3.gif
-rw-r--r-- 1 root root 1828 10月 24 16:05 4.gif
-rw-r--r-- 1 root root 1866 10月 24 16:06 5.gif
-rw-r--r-- 1 root root 6 10月 27 01:18 out.0.txt
-rw-r--r-- 1 root root 6 10月 27 01:18 out.1.txt
-rw-r--r-- 1 root root 6 10月 27 01:18 out.2.txt
-rw-r--r-- 1 root root 6 10月 27 01:18 out.3.txt
-rw-r--r-- 1 root root 7 10月 27 01:18 out.4.txt
-rw-r--r-- 1 root root 6 10月 27 01:18 out.5.txt

查看文件内容，以及对比图片

[root@docker01 test01]# cat out.*
l54v
 
ikh6
 
ynxn
 
7y28
 
nl 9c
 
w4zb

图片内容

可以对比下最开始的识别情况，识别率大大提高了。

本文链接：https://blog.csdn.net/haluoluo211/article/details/53483534

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

Docker 快速上手学习入门教程_docker菜鸟教程-程序员宅基地

文章浏览阅读2.5w次，点赞6次，收藏50次。官方解释是，docker 容器是机器上的沙盒进程，它与主机上的所有其他进程隔离。所以容器只是操作系统中被隔离开来的一个进程，所谓的容器化，其实也只是对操作系统进行欺骗的一种语法糖。_docker菜鸟教程

电脑技巧：Windows系统原版纯净软件必备的两个网站_msdn我告诉你-程序员宅基地

文章浏览阅读5.7k次，点赞3次，收藏14次。该如何避免的，今天小编给大家推荐两个下载Windows系统官方软件的资源网站，可以杜绝软件捆绑等行为。该站提供了丰富的Windows官方技术资源，比较重要的有MSDN技术资源文档库、官方工具和资源、应用程序、开发人员工具（Visual Studio 、SQLServer等等）、系统镜像、设计人员工具等。总的来说，这两个都是非常优秀的Windows系统镜像资源站，提供了丰富的Windows系统镜像资源，并且保证了资源的纯净和安全性，有需要的朋友可以去了解一下。这个非常实用的资源网站的创建者是国内的一个网友。_msdn我告诉你

vue2封装对话框el-dialog组件_<el-dialog 封装成组件 vue2-程序员宅基地

文章浏览阅读1.2k次。vue2封装对话框el-dialog组件_

MFC 文本框换行_c++ mfc同一框内输入二行怎么换行-程序员宅基地

文章浏览阅读4.7k次，点赞5次，收藏6次。MFC 文本框换行标签： it mfc 文本框1.将Multiline属性设置为True2.换行是使用"\r\n" (宽字符串为L"\r\n")3.如果需要编辑并且按Enter键换行,还要将 Want Return 设置为 True4.如果需要垂直滚动条的话将Vertical Scroll属性设置为True,需要水平滚动条的话将Horizontal Scroll属性设_c++ mfc同一框内输入二行怎么换行

redis-desktop-manager无法连接redis-server的解决方法_redis-server doesn't support auth command or ismis-程序员宅基地

文章浏览阅读832次。检查Linux是否是否开启所需端口，默认为6379，若未打开，将其开启：以root用户执行iptables -I INPUT -p tcp --dport 6379 -j ACCEPT如果还是未能解决，修改redis.conf，修改主机地址：bind 192.168.85.**；然后使用该配置文件，重新启动Redis服务./redis-server redis.conf..._redis-server doesn't support auth command or ismisconfigured. try

实验四数据选择器及其应用-程序员宅基地

文章浏览阅读4.9k次。济大数电实验报告_数据选择器及其应用

随便推点

灰色预测模型matlab_MATLAB实战|基于灰色预测河南省社会消费品零售总额预测-程序员宅基地

文章浏览阅读236次。1研究内容消费在生产中占据十分重要的地位，是生产的最终目的和动力，是保持省内经济稳定快速发展的核心要素。预测河南省社会消费品零售总额，是进行宏观经济调控和消费体制改变创新的基础，是河南省内人民对美好的全面和谐社会的追求的要求，保持河南省经济稳定和可持续发展具有重要意义。本文建立灰色预测模型，利用MATLAB软件，预测出2019年~2023年河南省社会消费品零售总额预测值分别为21881...._灰色预测模型用什么软件

log4qt-程序员宅基地

文章浏览阅读1.2k次。12.4-在Qt中使用Log4Qt输出Log文件，看这一篇就足够了一、为啥要使用第三方Log库，而不用平台自带的Log库二、Log4j系列库的功能介绍与基本概念三、Log4Qt库的基本介绍四、将Log4qt组装成为一个单独模块五、使用配置文件的方式配置Log4Qt六、使用代码的方式配置Log4Qt七、在Qt工程中引入Log4Qt库模块的方法八、获取示例中的源代码一、为啥要使用第三方Log库，而不用平台自带的Log库首先要说明的是，在平时开发和调试中开发平台自带的“打印输出”已经足够了。但_log4qt

100种思维模型之全局观思维模型-67_计算机中对于全局观的-程序员宅基地

文章浏览阅读786次。全局观思维模型，一个教我们由点到线，由线到面，再由面到体，不断的放大格局去思考问题的思维模型。_计算机中对于全局观的

线程间控制之CountDownLatch和CyclicBarrier使用介绍_countdownluach于cyclicbarrier的用法-程序员宅基地

文章浏览阅读330次。一、CountDownLatch介绍CountDownLatch采用减法计算；是一个同步辅助工具类和CyclicBarrier类功能类似，允许一个或多个线程等待，直到在其他线程中执行的一组操作完成。二、CountDownLatch俩种应用场景：场景一：所有线程在等待开始信号(startSignal.await()),主流程发出开始信号通知，既执行startSignal.countDown()方法后;所有线程才开始执行；每个线程执行完发出做完信号，既执行do..._countdownluach于cyclicbarrier的用法

自动化监控系统Prometheus&Grafana_-自动化监控系统prometheus&grafana实战-程序员宅基地

文章浏览阅读508次。Prometheus 算是一个全能型选手，原生支持容器监控，当然监控传统应用也不是吃干饭的，所以就是容器和非容器他都支持，所有的监控系统都具备这个流程，_-自动化监控系统prometheus&grafana实战

React 组件封装之 Search 搜索_react search-程序员宅基地

文章浏览阅读4.7k次。输入关键字，可以通过键盘的搜索按钮完成搜索功能。_react search