summaryrefslogtreecommitdiffstats
path: root/tlgu/tlgu.1
blob: d3cc149a68669b75d7d82f1e9684750feb65581c (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
.TH tlgu 1 "Feb, 2005" "Version 1.1" "TLG to Unicode Converter"
.SH NAME

tlgu \- convert TLG (D) CD-ROM txt files to Unicode

.SH SYNOPSIS
.B tlgu
[
.I options
]
.I input_file
.I output_file

.SH DESCRIPTION
.B tlgu
will convert an \fIinput_file\fP from Thesaurus Linguae Graeca (TLG) representation
to a Unicode (UTF-8) \fIoutput_file\fP.  The TLG representation consists of \fBbeta-code\fP
text and \fBcitation\fP information.

.SH OPTIONS
.TP
.B \-b
inserts a form feed and citation information (levels a, b, c, d) on every "book" citation
change.  By default the program will output line feeds only (see also \fB\-p\fP).
.TP
.B \-p
observes paging instructions.  
By default the program will output line feeds only.
.TP
.B \-r
primarily Roman text. Some TLG texts, notably doccan1.txt and doccan2.txt are mainly
roman texts lacking explicit language change codes.  Setting this option will force
a change to roman text after each citation block is encountered.
.TP
.B \-v
highest-level reference citation is included before each text line (v-level)
.TP
.B \-w
reference citation is included before each text line (w-level)
.TP
.B \-x
reference citation is included before each text line (x-level)
.TP
.B \-y
reference citation is included before each text line (y-level)
.TP
.B \-z
lowest-level reference citation is included before each text line (z-level).
.sp 1
.TP
.B \-B
inserts blank space (a tab) before each and every line.
.TP
.B \-C
citation debug information is output.
.TP
.B \-S
special code debug information is output.
.TP
.B \-V
block processing information is output (verbose).
.TP
.B \-W
each work (book) is output as a separate file in the form output_file-xxx.txt

.SH HISTORY AND INTENDED USE
The purpose of \fBtlgu\fP is to translate binary TLG-format files into readable and editable text.
It is based on an earlier program written in 80x86 assembly language (1996) outputting codes for
a home-made font which used the prevalent hellenic font encodings of that time complemented
by dead accent characters - not very attractive, but readable.
.sp 1
Then came Unicode and a plethora of accented character glyphs; nice-looking but
with the well-known drawback that special processing is needed to do wild-card searches.
Nice polytonic fonts have now been made available (Cardo, Gentium, Athena, Athenian,
Porson) and, surely, these will be expanded as special-use code points are included
in the Unicode definition (musical symbols, other special symbols) and more fonts will be created.
.sp 1
So, at this point in time, \fBtlgu\fP will crunch a file which has been formatted
according to the published TLG-D format and produce codes for most glyphs
generally available.  No attempt has been made to introduce multi-character sequences
or formatting codes (font changes).  If a code has not been defined, the program will output
the respective "code family" glyph.  You may use the \fB\-S\fP option to check such codes
against the published beta code definition.
.sp 1
You may not like the character output for a specific code.  Check out the \fBtlgcodes.h\fP file
containing the special symbol and punctuation codes and select one to suit you better.  It will
probably be a while before the beta to Unicode correspondence settles down.


.SH EXAMPLES
.B ./tlgu -r DOCCAN2.TXT doccanu.txt
Translate the TLG canon to a unicode text file. Note the use of the \fB-r\fP option (this file
expects Roman as the default font).
.TP
.B ./tlgu -x -y -z TLG1799.TXT tlg1799u.txt
Generate a continuous file with the texts of granpa Euclides. Available citations (-x -y -z)
are Book//demonstratio/line as shown in the respective "cit" field of doccan2.txt.
.TP
.B ./tlgu -b -B TLG1799.TXT tlg1799u.txt
Generate the same texts, this time with a page feed and book citation information on the first
page of each book and a tab before each line (use with OOo versions earlier than 1.1.4).
.TP
.B ./tlgu -C TLG1799.TXT tlg1799u.txt
See how the citation information changes within each TLG block.
.TP
.B ./tlgu -S TLG1799.TXT tlg1799u.txt | sort > symbols1799.txt
Check out the symbols used in a work.  Book and x, y, z references are printed on a separate
line for each symbol. Sort / grep the output to locate specific symbols of interest; save in
a file for later use.
.TP
.B ./tlgu -W TLG0006.TXT tlg0006u
Will produce separate files for each work, named tlg006u-001.txt etc.

.SH POST-PROCESSING EXAMPLES
I use the OpenOffice suite for most of my work.  This example shows one of many possible
ways of using the search and replace facility to create a readable version of the Suda lexicon.
.TP
.B ./tlgu -B TLG4085.TXT tlg4085u.txt
A Unicode file with the text is created
.TP
.B Open the generated file with OOo:
File | Open | Filename: tlg4085u.txt,
File Type: Text Encoded \-\- Press Open
.sp 1
The ASCII Filter Options window appears. Select the Unicode (UTF-8) character set and
a proper Unicode font installed in your machine (e.g. Cardo).  Press OK.
.TP
.B Replace angle brackets with expanded text
Lexicon terms are enclosed in <angle brackets>.  The actual beta codes indicate the use of
expanded text for emphasis.  Select Edit | Find & Replace.  The \fBFind & Replace\fP window appears.
.sp 1
In the \fBSearch For\fP field, type the following expression: \fB<[^<>]*>\fP
This means "find any characters between angle brackets, not including angle brackets".
.sp 1
In the \fBReplace With\fP window insert a single ampersand: \fB&\fP
This means that we need to \fBadd\fP formatting information (this case) or additional text to
the text found.  Press \fBFormat...\fP and select the \fBPosition\fP tab; select Spacing 
Expanded by 2.0 points.  Press OK.
.sp 1
Check the \fBRegular Expressions\fP box and press \fBReplace All\fP.
.sp 1
You may now replace the angle brackets with nothings.
.sp 1
Repeat the above procedure for titles enclosed in {braces}.  Write a macro...
.TP
.B Other useful information
In the "Execute" tab of the "Properties" window of my KDesktop Link to Application
I have the following command (single line):
.br
\fBLC_CTYPE=el_GR.UTF-8 /whereitsat/OpenOffice.org1.1.x/soffice\fP
.br
The prefix, an environment variable, allows you to use the same program with different locales;
in this case, hellenic Unicode (UTF-8).
.sp 1
I put my default locale and keyboard definitions in my \fB.profile\fP: 
.br
.na
.B export LC_CTYPE=el_GR.UTF-8
.br
.na
.B setxkbmap us+el polytonic -option grp:ctrl_shift_toggle
.br
.sp 1
This way multi-lingual text can be entered;  keyboard layout switching is done by pressing Ctrl/Shift.
.SH REFERENCES
There are several texts describing the internal representation of \fBPHI\fP and 
\fBTLG\fP text, ID data, citation data and index files.  The originator of this
format is the Packard Humanities Institute.  The TLG is maintained by UCI \- see
\fBwww.tlg.uci.edu\fP \- where you may find the \fBTLG Beta Code Manual\fP and the 
\fBTLG Beta Code Quick Reference Guide\fP.
.sp 1
Unicode consortium publications pertaining to the codification
of characters used in Hellenic literature, scientific and musical texts.
.sp 1
The OpenOffice suite (\fBwww.openoffice.org\fP) includes a word processor that you
can use to load, process and create new polytonic texts.

.SH COPYRIGHT
Copyright (C) 2004, 2005 Dimitri Marinakis (dm ssa gr).

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License (version 2) as published by
the Free Software Foundation.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA