diff options
author | Matěj Cepl <mcepl@cepl.eu> | 2016-11-24 19:48:06 +0100 |
---|---|---|
committer | Matěj Cepl <mcepl@cepl.eu> | 2016-11-24 19:48:06 +0100 |
commit | cd4935a574b160bcfd5487c068d97df8160fb56b (patch) | |
tree | 947fd691e46d9de83da671bf0f3a809615deaf15 /README.txt | |
parent | eca83a778a8e277410f8621569518c839a6a6e99 (diff) | |
download | havlicek_co_jest_obec-cd4935a574b160bcfd5487c068d97df8160fb56b.tar.gz |
Cleanup the scans resulting in a way better text.
Diffstat (limited to 'README.txt')
-rw-r--r-- | README.txt | 8 |
1 files changed, 8 insertions, 0 deletions
diff --git a/README.txt b/README.txt new file mode 100644 index 0000000..106d111 --- /dev/null +++ b/README.txt @@ -0,0 +1,8 @@ + +Postup jak vygenerovat kvalitn OCR + +pdfimages Co_jest_obect_KHB.pdf cojeimg +pdfinfo Co_jest_obect_KHB.pdf +for img in cojeimg-*.ppm ; do textcleaner -g $img ${img%*.ppm}-clean.ppm ; done +for img in cojeimg-*-clean.ppm ; do tesseract $img ${img%*-clean.ppm} -l ces -c tessedit_create_hocr=1 ; done +cat cojeimg-0*.txt >cojestobec-img.txt |