summaryrefslogtreecommitdiffstats
path: root/roff.c
Commit message (Collapse)AuthorAgeFilesLines
* Purge duplicate error reporting from the .tr request parser:Ingo Schwarze2022-06-071-11/+2
| | | | | the error was already reported earlier when roff_expand() called roff_escape().
* During identifier parsing, handle undefined escape sequencesIngo Schwarze2022-06-031-10/+48
| | | | | | | | | | | | | | | | | | | | | | in the same way as groff: * \\ is always reduced to \ * \. is always reduced to . * other undefined escape sequences are usually reduced to the escape name, for example \G to G, except during the expansion of expanding escape sequences having the standard argument form (in particular \* and \n), in which case the backslash is preserved literally. Yes, this is confusing indeed. For example, the following have the same meaning: * .ds \. and .ds . which is not the same as .ds \\. * \*[\.] and \*[.] which is not the same as \*[\\.] * .ds \G and .ds G which is not the same as .ds \\G * \*[\G] and \*[\\G] which is not the same as \*[G] <- sic! To feel less dirty, have a leaning toothpick, if you are so inclined. This patch also slightly improves the string shown by the "escaped character not allowed in a name" error message.
* Avoid the layering violation of re-parsing for \E in roff_expand().Ingo Schwarze2022-06-021-11/+2
| | | | | | | | | To that end, add another argument to roff_escape() returning the index of the escape name. This also makes the code in roff_escape() a bit more uniform in so far as it no longer needs the "char esc_name" local variable but now does everything with indices into buf[]. No functional change.
* Rudimentary implementation of the \A escape sequence, following groffIngo Schwarze2022-05-311-0/+5
| | | | | | | | | | | | | | | | | | | | semantics (test identifier for syntactical validity), not at all following the completely unrelated Heirloom semantics (define hyperlink target position). The main motivation for providing this implementation is to get \A into the parsing class ESCAPE_EXPAND that corresponds to groff parsing behaviour, which is quite similar to the \B escape sequence (test numerical expression for syntactical validity). This is likely to improve parsing of nested escape sequences in the future. Validation isn't perfect yet. In particular, this implementation rejects \A arguments containing some escape sequences that groff allows to slip through. But that is unlikely to cause trouble even in documents using \A for non-trivial purposes. Rejecting the nested escapes in question might even improve robustnest because the rejected names are unlikely to really be usable for practical purposes - no matter that groff dubiously considers them syntactically valid.
* Trivial patch to put the roff(7) \g (interpolate format of register)Ingo Schwarze2022-05-311-0/+2
| | | | | | | | | | | | | escape sequence into the correct parsing class, ESCAPE_EXPAND. Expansion of \g is supposed to work exactly like the expansion of the related escape sequence \n (interpolate register value), but since we ignore the .af (assign output format) request, we just interpolate an empty string to replace the \g sequence. Surprising as it may seem, this actually makes a formatting difference for deviate input like ".O\gNx" which used to raise bogus "escaped character not allowed in a name" and "skipping unknown macro" errors and printed nothing, whereas now it correctly prints "OpenBSD".
* Dummy implementation of the roff(7) \V (interpolate environment variable)Ingo Schwarze2022-05-301-3/+8
| | | | | | | | | escape sequence. This is needed to get \V into the correct parsing class, ESCAPE_EXPAND. It is intentional that mandoc(1) output is *not* influenced by environment variables, so interpolate the name of the variable with some decorating punctuation rather than interpolating its value.
* Make roff_expand() parse left-to-right rather than right-to-left.Ingo Schwarze2022-05-191-256/+181
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some escape sequences have side effects on global state, implying that the order of evaluation matters. For example, this fixes the long-standing bug that "\n+x\n+x\n+x" after ".nr x 0 1" used to print "321"; now it correctly prints "123". Right-to-left parsing was convenient because it implicitly handled nested escape sequences. With correct left-to-right parsing, nesting now requires an explicit implementation, here solved as follows: 1. Handle nested expanding escape sequences iteratively. When finding one, expand it, then retry parsing the enclosing escape sequence from the beginning, which will ultimately succeed as soon as it no longer contains any nested expanding escape sequences. 2. Handle nested non-expanding escape sequences recursively. When finding one, the escape sequence parser calls itself to find the end of the inner sequence, then continues parsing the outer sequence after that point. This requires the mandoc_escape() function to operate in two different modes. The roff(7) parser uses it in a mode where it generates diagnostics and may return an expansion request instead of a parse result. All other callers, in particular the formatters, use it in a simpler mode that never generates diagnostics and always returns a definite parsing result, but that requires all expanding escape sequences to already have been expanded earlier. The bulk of the code is the same for both modes. Since this required a major rewrite of the function anyway, move it into its own new file roff_escape.c and out of the file mandoc.c, which was misnamed in the first place and lacks a clear focus. As a side benefit, this also fixes a number of assertion failures that tb@ found with afl(1), for example "\n\\\\*0", "\v\-\\*0", and "\w\-\\\\\$0*0". As another side benefit, it also resolves some code duplication between mandoc_escape() and roff_expand() and centralizes all handling of escape sequences (except for expansion) in roff_escape.c, hopefully easing maintenance and feature improvements in the future. While here, also move end-of-input handling out of the complicated function roff_expand() and into the simpler function roff_parse_comment(), making the logic easier to understand. Since this is a major reorganization of a central component of mandoc(1), stability of the program might slightly suffer for a few weeks, but i believe that's not a problem at this point of the release cycle. The new code already satisfies the regression suite, but more tweaking and regression testing to further improve the handling of various escape sequences will likely follow in the near future.
* Split a new function roff_parse_comment() out of roff_expand() because thisIngo Schwarze2022-05-011-96/+106
| | | | | | | functionality is not needed when called from roff_getarg(). This makes the long and complicated function roff_expand() significantly shorter, and also simpler in so far as it no longer needs to return ROFF_APPEND. No functional change intended.
* Provide a new function roff_req_or_macro() to parse and handle a requestIngo Schwarze2022-04-301-35/+34
| | | | | | | | | | | | | | or macro, including context-dependent error handling inside tbl(7) code and inside .ce/.rj blocks. Use it both in the top level roff(7) parser and inside conditional blocks. This fixes an assertion failure triggered by ".if 1 .ce" inside tbl(7) code, found by tb@ using afl(1). As a side benefit for readability, only one place remains in the code that calls the main handler functions for the various roff(7) requests. This patch also improves column numbers in some error messages and various comments.
* Refactor the handler function roff_block_sub() for clarity and simplicity.Ingo Schwarze2022-04-301-16/+9
| | | | | | | | | | | | | 1. Do not needlessly access the function pointer table roffs[]. Instead, simply call the block closing function directly. 2. Sort code: handle both cases of block closing at the beginning of the function rather than one at the beginning and one at the end. 3. Trim excessive, partially repetitive and obvious comments, also making the comments considerably more precise. No functional change.
* The syntax of the roff(7) .mc request is quite specialIngo Schwarze2022-04-281-1/+50
| | | | | | | | | and the roff_onearg() parsing function is too generic, so provide a dedicated parsing function instead. This fixes an assertion failure when an \o escape sequence is passed as the argument; the bug was found by tb@ using afl(1). It also makes mandoc output more similar to groff in various cases.
* When we open a new .while loop, let's not attempt to close outIngo Schwarze2022-04-241-2/+4
| | | | | | | | | another enclosing .while loop at the same time. Instead, postpone the closing until the next iteration of ROFF_RERUN. This prevents one-line constructions like ".while 0 .while 0 something" and ".while rx .while rx .rr x" (which admittedly aren't particularly useful) from dying of abort(3), which was a bug tb@ found with afl(1).
* If a .shift request has a negative argument, do not use a negative arrayIngo Schwarze2022-04-241-2/+7
| | | | | | | | index but use 0 instead of the argument, just like groff. Warn about the invalid argument. While here, fix the column number in another warning message. Segfault reported by tb@, found with afl(1).
* Surprisingly, groff supports multiple copy mode escapes at theIngo Schwarze2022-04-131-2/+2
| | | | | | | | | | | | | beginning of an escape sequence: \, \E, \EE, \EEE, and so on all do the same outside copy mode, so let them do the same in mandoc(1), too. This fixes an assertion failure triggered by \EE*X that tb@ found with afl(1). The first E was consumed by roff_expand(), but that function failed to recognize the escape sequence as the expansion of a user-defined string and handed it over to mandoc_escape(), which consumed the second E and then died on an assertion because it is not prepared to handle user-defined strings. Fix this by letting *both* functions handle arbitrary numbers of 'E's correctly.
* store the operating system name obtained from uname(3) in the adequateIngo Schwarze2021-10-041-1/+2
| | | | | | struct together with similar state date rather than in a function-scope static variable, such that it can be free(3)d in roff_man_free(); no functional change
* Do not leak 64 bytes of heap memory every time a manual page callsIngo Schwarze2021-10-041-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | a user-defined macro. Calls of standard mdoc(7) and man(7) macros were unaffected, so the effect on OpenBSD manual pages was small, about 80 Kilobytes grand total for a full run of "makewhatis /usr/share/man". Argument expansion contexts for user-defined macros are stored on a stack that grows as needed if calls of user-defined macros are nested or recursive. Individual stack entries contain dynamically allocated arrays of pointers to arguments; these argument arrays also grow as needed if user-defined macros take more than eight arguments. The mistake was that argument arrays of already initialized expansion contexts were leaked rather than reused on subsequent macro calls. I found this issue in a systematic hunt for memory leaks after Michael <Stapelberg at Debian> reported memory exhaustion problems on the production server manpages.debian.org. This sub-Megabyte leak is not the cause of Michael's trouble, though, where Gigabytes of memory are being wasted. We are still investigating whether the original problem may be related to his supervisor process, which is written in Go, rather than to mandoc.
* Support two-character font names (BI, CW, CR, CB, CI)Ingo Schwarze2021-08-101-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | in the tbl(7) layout font modifier. Get rid of the TBL_CELL_BOLD and TBL_CELL_ITALIC flags and use the usual ESCAPE_FONT* enum mandoc_esc members from mandoc.h instead, which simplifies and unifies some code. While here, also support CB and CI in roff(7) \f escape sequences and in roff(7) .ft requests for all output modes. Using those is certainly not recommended because portability is limited even with groff, but supporting them makes some existing third-party manual pages look better, in particular in HTML output mode. Bug-compatible with groff as far as i'm aware, except that i consider font names starting with the '\n' (ASCII 0x0a line feed) character so insane that i decided to not support them. Missing feature reported by nabijaczleweli dot xyz in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=992002. I used none of the code from the initial patch submitted by nabijaczleweli, but some of their ideas. Final patch tested by them, too.
* add a style message about overlong text lines,Ingo Schwarze2021-06-271-1/+9
| | | | | | | | trying very hard to avoid false positives, not at all trying to catch as many cases as possible; feature originally suggested by tb@, OK tb@ kn@ jmc@
* Avoid artifacts in the most common case of closing conditional blocksIngo Schwarze2020-08-271-1/+3
| | | | | | | | when no arguments follow the closing brace, \}. For example, the line "'br\}" contained in the pod2man(1) preamble would throw a bogus "escaped character not allowed in a name" error. This issue was originally reported by Chris Bennett on ports@, and afresh1@ noticed it came from the pod2man(1) preamble.
* Put the code handling \} into a new function roff_cond_checkend()Ingo Schwarze2020-08-031-58/+63
| | | | | | | | | | | | | | | | | | | | and call that function not only from both places where copies existed - when processing text lines and when processing request/macro lines in conditional block scope - but also when closing a macro definition request, such that this construction works: .if n \{.de macroname macro content .. \} ignored arguments .macroname This fixes a bug reported by John Gardner <gardnerjohng at gmail dot com>. While here, avoid a confusing decrement of the line scope counter in roffnode_cleanscope() for conditional blocks that do not have line scope in the first place (no functional change for this part). Also improve validation of an internal invariant in roff_cblock() and polish some comments.
* Use a separate node->tag attribute rather than abusing the node->stringIngo Schwarze2020-04-081-0/+1
| | | | | attribute for the purpose. No functional change intended. The purpose is to make it possible to later attach tags to text nodes.
* Support manual tagging of .Pp, .Bd, .D1, .Dl, .Bl, and .It.Ingo Schwarze2020-04-061-1/+7
| | | | | | In HTML output, improve the logic for writing inside permalinks: skip them when there is no child content or when there is a risk that the children might contain flow content.
* Remove some stray argument names from function prototypes,Ingo Schwarze2020-04-031-3/+5
| | | | | | for consistency with the dominant style used in mandoc. No functional change. Patch from Martin Vahlensieck <academicsolutions dot ch>.
* Fully support explicit tagging of .Sh and .Ss.Ingo Schwarze2020-02-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | This fixes the offset of two lines in terminal output and this improves HTML output by putting the id= attribute and <a> element into the respective <h1> or <h2> element rather than writing an additional <mark> element. To that end, introduce node flags NODE_ID (to make the node a link target, for example by writing an HTML id= attribute or by calling tag_put()) and NODE_HREF (to make the node a link source, used only in HTML output, used only to write an <a class="permalink"> element). In particular: * In the validator, generalize the concept of the "next node" such that it also works before .Sh and .Ss. * If the first argument of .Tg is empty, don't forget to complain if there are additional arguments, which will be ignored. * In the terminal formatter, support writing of explicit tags for all kinds of nodes, not just for .Tg. * In deroff(), allow nodes to have an explicit string representation even when they aren't text nodes. Use this for explicitly tagged section headers. Suprisingly, this is sufficient to make HTML output work, without explicit code changes in the HTML formatter. * In syntax tree output, display NODE_ID and NODE_HREF.
* Introduce the concept of nodes that are semantically transparent:Ingo Schwarze2020-02-271-0/+53
| | | | | | | | | | | | | | they are skipped when looking for previous or following high-level macros. Examples include roff(7) .ft, .ll, and .ta, mdoc(7) .Sm and .Tg, and man(7) .DT and .PD. Use this concept for a variety of improved decisions in various validators and formatters. While here, * remove a few const qualifiers on struct arguments that caused trouble; * get rid of some more Yoda notation in the vicinity; * and apply some other stylistic improvements in the vicinity. I found this class of issues while considering .Tg patches from kn@.
* Introduce a new mdoc(7) macro .Tg ("tag") to explicitly mark a placeIngo Schwarze2020-01-191-2/+2
| | | | | | | | | | | | | | | as defining a term. Please only use it when automatic tagging does not work. Manual page authors will not be required to add the new macro; using it remains optional. HTML output is still rudimentary in this version and will be polished later. Thanks to kn@ for reminding me that i have been considering since BSDCan 2014 whether something like this might be useful. Given that possibilities of making automatic tagging better are running out and there are still several situations where automatic tagging cannot do the job, i think the time is now ripe. Feedback and no objection from millert@; OK espie@ inoguchi@ kn@.
* Do not fail an assertion when a high level macro occurs in the bodyIngo Schwarze2019-12-261-1/+13
| | | | | | | of a conditional inside a .ce request block. Instead, abort the .ce block just like when there is no conditional in between. Bug found by espie@ working on the textproc/fstrcmp port.
* In the past, generating comment nodes stopped at the .TH or .DdIngo Schwarze2019-11-091-3/+7
| | | | | | | | | | | macro, which is usually close to the beginning of the file, right after the Copyright header comments. But espie@ found horrible input files in the textproc/fstrcmp port that generate lots of parse nodes before even getting to the header macro. In some formatters, comment nodes after some kinds of real content triggered assertions. So make sure generation of comment nodes stops once real content is encountered.
* delete trailing whitespace and space-tab sequences; no code change;Ingo Schwarze2019-07-011-2/+2
| | | | | patch from Michal Nowak <mnowak at startmail dot com> who found these with git pbchk in the illumos tree
* When calling an empty macro, do not clobber existing arguments.Ingo Schwarze2019-04-211-1/+6
| | | | | Fixing a bug found with the groffer(1) version 1.19 manual page following a report from Jan Stary.
* Implement the roff .break request (break out of a .while loop).Ingo Schwarze2019-04-211-8/+45
| | | | | | | Jan Stary <hans at stare dot cz> found it in an ancient groffer(1) manual page (version 1.19) on MacOS X Mojave. Having .break not implemented wasn't a particularly bright idea because obviously, it tended to cause infinite loops.
* Let roff_getname() end the roff identifier at a tab characterIngo Schwarze2019-02-061-7/+18
| | | | | | | | | | | | | | | | | | | | | | and audit all its callers whether termination is handled correctly. Resulting improvements: * An escape or tab ending the macro name in a macro invocation is discarded, and argument processing is started after it. * An escape or tab ending a name in ".if d" and ".if r" is preserved. * An escape ending a name in ".ds" causes the whole request to be ignored. * A tab ending a name in ".ds" becomes part of the string. * An escape or tab ending a name in ".rm" causes the rest of the line to be ignored. * An escape or tab ending the first name in ".als", ".rn", or ".nr" causes the whole request to be ignored. Kurt Jaeger <pi at FreeBSD> made me aware of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235456#c0 and in that bug report, comment 0 item (3) is a special case of this class of issues. Yes, the "mh" manual pages are no doubt among the worst on the planet.
* adjust style and comments in roff_getname(); no functional changeIngo Schwarze2019-02-061-11/+14
|
* no-fill mode has to be suspended during tbl(7) rendering, tooIngo Schwarze2019-01-051-0/+2
|
* Some high-level block macros have an effect similar to temporarilyIngo Schwarze2019-01-051-2/+2
| | | | | | | | | | | | suspending no-fill mode during their head. Model this with an additional roff parser state flag ROFF_NONOFILL. That is much simpler than it would be to save and restore the ROFF_NOFILL flag itself, in particular since the latter can be switched (with lasting effect) by the .nf and .fi requests even while its effect is temporarily suspended. This commit does not change formatting yet, but prepares for future formatting simplifications and improvements.
* Store the fill mode with a new flag NODE_NOFILL in every node,Ingo Schwarze2018-12-311-0/+4
| | | | | | like it is already done with NODE_SYNPRETTY, such that the fill mode becomes more directly available to the formatters. Not used yet, but will be used by upcoming commits.
* Move parsing of the .nf and .fi (fill mode) requests from the man(7)Ingo Schwarze2018-12-311-20/+28
| | | | | | parser to the roff(7) parser. As a side effect, .nf and .fi are now also parsed in mdoc(7) input, though the mdoc(7) formatters still ignore most of their effect.
* Cleanup, minus 15 LOC, no functional change:Ingo Schwarze2018-12-311-5/+12
| | | | | | | | | Simplify the way the man(7) and mdoc(7) validators are called. Reset the parser state with a common function before calling them. There is no need to again reset the parser state afterwards, the parsers are no longer used after validation. This allows getting rid of man_node_validate() and mdoc_node_validate() as separate functions.
* Cleanup, no functional change:Ingo Schwarze2018-12-301-16/+13
| | | | | | | | | | | | | | The struct roff_man used to be a bad mixture of internal parser state and public parsing results. Move the public results to the parsing result struct roff_meta, which is already public. Move the rest of struct roff_man to the parser-internal header roff_int.h. Since the validators need access to the parser state, call them from the top level parser during mparse_result() rather than from the main programs, also reducing code duplication. This keeps parser internal state out of thee main programs (five in mandoc portable) and out of eight formatters.
* Rename mandoc_getarg() to roff_getarg() and pass it the roff parserIngo Schwarze2018-12-211-20/+53
| | | | | | | | | | | | | | | | | | struct as an argument such that after copy-in, it can call roff_expand() once again, which used to be called roff_res() before this. This fixes a subtle low-level roff(7) parsing bug reported by Fabio Scotoni <fabio at esse dot ch> in the 4.4BSD-Lite2 mdoc.samples(7) manual page, because that page used an escaped escape sequence in a macro argument. To expand escaped escape sequences in quoted mdoc(7) arguments, too, stop bypassing the call to roff_getarg() in mdoc_argv.c, function args() for this case. This does not solve the case of escaped escape sequences in quoted .Bl -column phrases yet. Because roff_expand() can make the string longer, roff_getarg() can no longer operate in-place but needs to malloc(3) the returned string. In the high-level parsers, free(3) that string after processing it.
* Bugfix:Ingo Schwarze2018-12-201-1/+1
| | | | | | | When after a \\, \t, or \a, another \t or \a had to be resolved in copy mode within the same argument, the argument got corrupted. Found while working on a loosely related bug report from Fabio Scotoni <fabio at esse dot ch>.
* As a first step towards making roff_res() callable from mandoc_getarg(),Ingo Schwarze2018-12-181-0/+97
| | | | | | | | | move the function mandoc_getarg() from mandoc.c to roff.c. It was misplaced in mandoc.c in the first place; that file is intended for utilities needed both by parsers and by formatters, while reading macro arguments in copy mode is purely a task of the roff(7) parser. Needed as a preliminary for an upcoming bugfix. No code change.
* Several improvements to escape sequence handling.Ingo Schwarze2018-12-151-14/+33
| | | | | | | | | | | | | | | | | | | | | | | * Add the missing special character \_ (underscore). * Partial implementations of \a (leader character) and \E (uninterpreted escape character). * Parse and ignore \r (reverse line feed). * Add a WARNING message about undefined escape sequences. * Add an UNSUPP message about unsupported escape sequences. * Mark \! and \? (transparent throughput) and \O (suppress output) as unsupported. * Treat the various variants of zero-width spaces as one-byte escape sequences rather than as special characters, to avoid defining bogus forms with square brackets. * For special characters with one-byte names, do not define bogus forms with square brackets, except for \[-], which is valid. * In the form with square brackets, undefined special characters do not fall back to printing the name verbatim, not even for one-byte names. * Starting a special character name with a blank is an error. * Undefined escape sequences never abort formatting of the input string, not even in HTML output mode. * Document the newly handled escapes, and a few that were missing. * Regression tests for most of the above.
* Cleanup, no functional change:Ingo Schwarze2018-12-141-9/+5
| | | | | | | | | | Now that message handling is properly encapsulated, remove struct mparse pointers from four structs (roff, roff_man, tbl_node, eqn_node) and from the argument lists of five functions (roff_alloc, roff_man_alloc, mandoc_getarg, tbl_alloc, eqn_alloc). Except for being passed to the main program as an opaque object, it now only occurs in read.c, as it should, and not across 15 files like in the past.
* Almost mechanical diff to remove the "struct mparse *" argumentIngo Schwarze2018-12-141-91/+76
| | | | | | | | from mandoc_msg(), where it is no longer used. While here, rename mandoc_vmsg() to mandoc_msg() and retire the old version: There is really no point in having another function merely to save "%s" in a few places. Minus 140 lines of code.
* Cleanup, no functional change:Ingo Schwarze2018-12-131-0/+1
| | | | | | | | | | Split the top level parser interface out of the utility header mandoc.h, into a new header mandoc_parse.h, for use in the main program and in the main parser only. Move enum mandoc_os into roff.h because struct roff_man is the place where it is stored. This allows removal of mandoc.h from seven files in low-level parsers and in formatters.
* Cleanup, no functional change:Ingo Schwarze2018-12-131-2/+1
| | | | | | No need to expose the eqn(7) syntax tree data structures everywhere. Move them to their own include file, "eqn.h". While here, delete the unused enum eqn_pilet.
* Cleanup, no functional change:Ingo Schwarze2018-12-131-5/+3
| | | | | | | | In libroff.h, nothing was left except the eqn(7) parser interface, which isn't really part of the roff(7) parser, so rename it to eqn_parse.h. While here, move struct eqn_def to eqn.c because that's the only file using it, and let eqn_box_free() and eqn_free() handle NULL.
* Cleanup, no functional change:Ingo Schwarze2018-12-131-21/+13
| | | | | Move tbl(7)-specific parser internals out of libroff.h. Move some tbl(7)-internal processing from roff.c to tbl.c.
* Cleanup, no functional change:Ingo Schwarze2018-12-121-1/+2
| | | | | No need to expose the tbl(7) syntax tree data structures everywhere. Move them to their own include file, "tbl.h", and improve comments.