aboutsummaryrefslogtreecommitdiffstats
path: root/README.txt
diff options
context:
space:
mode:
authorMatěj Cepl <mcepl@cepl.eu>2016-11-24 19:48:06 +0100
committerMatěj Cepl <mcepl@cepl.eu>2016-11-24 19:48:06 +0100
commitcd4935a574b160bcfd5487c068d97df8160fb56b (patch)
tree947fd691e46d9de83da671bf0f3a809615deaf15 /README.txt
parenteca83a778a8e277410f8621569518c839a6a6e99 (diff)
downloadhavlicek_co_jest_obec-cd4935a574b160bcfd5487c068d97df8160fb56b.tar.gz
Cleanup the scans resulting in a way better text.
Diffstat (limited to 'README.txt')
-rw-r--r--README.txt8
1 files changed, 8 insertions, 0 deletions
diff --git a/README.txt b/README.txt
new file mode 100644
index 0000000..106d111
--- /dev/null
+++ b/README.txt
@@ -0,0 +1,8 @@
+
+Postup jak vygenerovat kvalitn OCR
+
+pdfimages Co_jest_obect_KHB.pdf cojeimg
+pdfinfo Co_jest_obect_KHB.pdf
+for img in cojeimg-*.ppm ; do textcleaner -g $img ${img%*.ppm}-clean.ppm ; done
+for img in cojeimg-*-clean.ppm ; do tesseract $img ${img%*-clean.ppm} -l ces -c tessedit_create_hocr=1 ; done
+cat cojeimg-0*.txt >cojestobec-img.txt