Copyright (C) 2006 Steve Cheng <stevecheng@users.sourceforge.net>
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALNGS IN THE SOFTWARE.
A novel feature of this program is that it uses its own mini-language to define the transformation from TEX math to MathML. The skeptical reader might think that this is an attempt at over-engineering. Actually, the author would rather not employ a mini-language at all, but write the entire program in JavaScript --- but the use of the mini-language is essentially forced upon us, given our requirements. We need the program be usable on the client-side Web browser, which requires that the program be in JavaScript; on the other hand, we would also like the program to be available on the server side as well.
While JavaScript can run on the server-side, JavaScript implementations outside Web browsers are terrible — they are not widely available, and they do not hold a cradle to more standard scripting languages such as Perl or Python.
The other alternative is to write two versions of the transform rules, one for JavaScript, and one for Python, say. This is not pleasant, to say the least. And we should also consider that some people might want a PHP implementation, for example, if that is what they are using to generate Web pages on the server side.
Therefore, we should write a single version of the program in one language, then employ automatic translation of that program into the other computer languages.
A solution that avoids the mini-language is to translate JavaScript directly to the other languages. But this solution takes too much effort to implement.
It is better instead, to start with a minimalist mini-language, designed specifically to tackle the problem of TEX to MathML translation, and translate that to the actual scripting languages. We require that the mini-language be easy to parse — and we achieve this goal by using the S-expression syntax from the Lisp languages (i.e. sub-expressions are wrapped in parentheses, and operations are written in prefix notation).
Although the use of S-expressions makes the mini-language superficially look like Scheme, it is fundamentally very different from Scheme. Our mini-language is a procedural or imperative language, rather than a functional language like Scheme. Its structure closely mirrors that of JavaScript (the first implementations of this program were in JavaScript). A procedural design is used, to make translation from the mini-language to JavaScript and Python simple.
On the other hand, if the mini-language is functional, then it would take a non-trivial amount of effort to translate it to imperative languages (one of the obstacles being to optimize tail recursion). Of course, making the mini-language imperative disfavors the functional languages. But this is less of a problem than disfavoring the imperative languages. Firstly, (pure) functional languages, such as Scheme and Haskell, are not as widely used as scripting languages, compared to the procedural languages such as Perl and Python. (This is no accident, since scripting languages are often used for one-time, small-scale hacks, where the function formalism becomes more of a hindrance.)
And secondly, it is actually slightly more complicated to write the TEX-to-MathML transformation rules in a functional style than in an imperative style (yes, the author has tried it). This is because TEX syntax is somewhat irregular, being intended to be read and written by humans, and most efficiencies of functional programming over imperative programming comes from exploiting the structure of data. Also, it should be noted that a certain naïve way of writing the transformation rules functionally will lead to an inefficient program with an execution time that is quadratic in the number of input tokens. (The author unwittingly fell into this trap on his original implementation.)
The mini-language in our program can be regarded as analogous to the specification formats that drive other data converters, which are usually declarative. However, a declarative specification format falls down in the present application, for the same reasons that functional programming does not well either — TEX syntax is irregular enough that designing a declarative format that captures all its subtleties is hard.
The mini-language has to be minimal, for the obvious reason that small languages take less effort to translate. The less effort required, the better.
Although minimal often implies elegance in design, we are not writing this program in the mini-language for the sake of elegance. We are interested in:
For us, the mini-language itself is mostly a distraction that has to be tolerated. (However, we cannot deny that it is a nice bonus that the transformation rules can be specified more concisely in the mini-language than in direct JavaScript.)
Thus, we will not “design” the mini-language as such, but simply tack on features only as they are needed to describe the algorithms. For instance, the mini-language currently has no support for arrays; if we want to write a transformation rule that uses arrays, then we simply add that to the language. And the array operations would be implemented by translating them to array operations in the underlying language (JavaScript, Python, etc.).
This strategy of ad-hoc, unprincipled extension can sometimes be disasterous, but there is little cause for concern in our case. Since the mini-language syntax is based on S-expressions, it can be easily extended without worrying about the parsing details. Also, the problem of TEX-to-MathML conversion is small enough, and fairly restricted, so that we should not need to extend it very much: the input is TEX math markup; the output is MathML markup. Or, as the saying goes: we do not need to expand our program to the point that it can read e-mail.
This strategy saves the author from wasting effort to design yet another extension language. We do need some implementation experience to know what features to include, and what features can be safely omitted — this was provided to the author by his first JavaScript-only implementation of this program.
It hardly needs to be said, but any glue code or interface code (e.g. interacting with the user, or the Web browser) need not, should not, and will not be written in the mini-language.
We will also point out that, because the mini-language is so minimal, sometimes the procedures written in it are forced to be written in a low-level way — looping over conditions, reading and advancing token pointers, etc. — somewhat like programming in Forth or assembly language. This mostly cannot be helped, because TEX syntax is so varied that it is difficult to define higher-level abstractions to parse it.
As our mini-language is all about “anti-design” — the author refuses to even name the mini-language — naturally it is defined only by its implementations — that is, its translations to other computer languages. But to help the reader understand the use of the mini-language in this document, we briefly describe its syntax here:
Expressions are either variable (or procedure/function) references,
or one of the following forms enclosed in parentheses.
Expressions never have side effects except for the (call …)
form (for the procedure being called may generate side effects).
(call f x1 x2…) |
call a procedure or function f with arguments x1, x2…, and evaluates to the return or result from f |
(not x) |
logical negation of expression |
(or x1 x2…) |
logical or (disjunction) of two or more expressions |
(and x1 x2…) |
logical and (conjunction) of two or more expressions |
(null? x) |
test for x being null |
(not-null? x) |
test for x not being null |
(result-element n x1
x2…) |
create MathML element (sub-tree) whose name is n, and whose children are x1, x2… in order. |
(if t x y) |
test if t is true; if it is, evaluate expression x, otherwise evaluate expression y |
(str-eq? x y) |
test if two strings are equal |
(in? x A) |
test if x is a key in the associative array A |
(get x A) |
return value which is paired with key x in the associative array A |
(lambda x1 x2… xn e) |
define an inline, anonymous function that returns the expression e, with the given arguments x1, x2… |
(read-token) |
get token at current pointer |
(nil) |
the null object |
The mini-language does not have lists or arrays. For the most common usage of lists in this application, namely, to build up content for MathML elements, direct MathML tree manipulation may be used instead. This turns out to be more efficient, and not unlike another popular transformation language, XSLT.
This section gives the definitions for the “simple” TEX and LATEX commands that produce a single mathematical symbol. To translate to MathML, each symbol must be mapped to its Unicode representation.
Punctuation and space Delimiters Operator symbols Relation symbols Named identifiers Word operators Greek letters
The majority of the following TEX definitions were derived by taking the lists of characters, given as TEX escapes, in the Short Math Guide for LATEX by the American Mathematical Society, and cross-mapping them with the STIX character table, which contains TEX escapes for various Unicode characters used in mathematical expressions.
The lists of characters from the Short Guide to LATEX were entered by hand, as the author was not able to find a more official machine-readable list. Each list was entered as simple text files in the tables/ directory, with one TEX escape on each line. For convenience, there is also a copy of the STIX character table in the tables/ directory.
A Python script and Unix shell script (using standard Unix commands like sort, cut, join and awk) together take the AMS-LATEX lists of characters and the STIX table as input, and output the mappings of the TEX escapes to their Unicode characters. This output, in the format of HTML tables, can then be pasted in the XHTML source for this document.
The STIX character table was not used alone as the source, because it does not contain all the TEX escapes, and it is not clear whether it categorizes the TEX characters in the way that we want them to be used in the user script. It is better to double-check the character mappings, and add any missing characters, manually.
Any entries added directly to the JavaScript source will be separated out in their own sections, so that when we need to automatically generate the majority of the entries again, we will not clobber over the manual changes.
|
|
Punctuation and space characters are to be displayed with the mtext element.
These were all added manually.
(var $punct-and-space (table See table below ))
|
|
(var $left-delimiters (table Left delimiters)) (var $right-delimiters (table Right delimiters ))
|
|
|
|
Operator symbols are to be displayed with the mo element.
(var $operator-symbols (table Operator symbols More operator symbols Extensible vertical arrows Big operators Big operators without limit style Miscellaneous simple symbols Other alphabetic symbols ))
This listing comes from tables/tex-ops.txt, which in turn comes from the list under “3.7 Binary Operator Symbols”, p. 6, in the AMS LATEX Guide.
|
|
|
|
This listing comes from tables/tex-bigops.txt, which in turn comes from the list under “3.11 Cumulative (variable-size) operators”, p. 7, in the AMS LATEX Guide.
|
|
The integral symbol is special from the others, in that even in display style, subscripts and superscripts are not in the limit style (for the obvious reason that it would not look good).
|
|
This listing comes from tables/tex-ext-arrows.txt, which in turn comes from the list under “3.14 Extensible vertical arrows”, p. 8, in the AMS LATEX Guide.
|
|
This listing comes from tables/tex-miscsym.txt, which in turn comes from the list under “3.6 Miscellaneous simple symbols”, p. 5, in the AMS LATEX Guide.
|
|
|
|
(var $relation-symbols (table
("=" "=")
("<" "<")
(">" ">")
Comparison symbols
Arrows
Miscellaneous
))
This listing comes from tables/tex-relsym.txt, which in turn comes from the list under “3.8 Relation Symbols”, p. 6, in the AMS LATEX Guide.
Equality, inequality, etc.
|
|
This listing comes from tables/tex-relmisc.txt, which in turn comes from the list under “3.10 Relation Symbols: Miscellaneous”, p. 7, in the AMS LATEX Guide.
|
|
This listing comes from tables/tex-arrows.txt, which in turn comes from the list under “3.9 Relation Symbols: Arrows”, p. 7, in the AMS LATEX Guide.
|
|
Named identifiers include all Greek letters, named functions and operators, and other alphabetic identifiers that are not single Roman letters.
These are to be displayed using the mi element.
(var $named-identifiers (table Word operators Big word operators Greek letters Roman letters Ellipsis characters ))
(var $word-operators (table Word operators )) (var $big-word-operators (table Big word operators))
This listing comes from “3.17 Named operators”, p. 8, in the AMS LATEX Guide.
These TEX commands represent common operators and functions written with several letters (e.g. the sine function). Obviously, the STIX table generally contains only single-character mappings and not these commands, so we must type enter all of them in manually. The mapping is trivial.
|
|
|
|
(var $greek-letters (table See table below))
|
|
Although the Roman letters obviously map to themselves, we still list them explicitly, so that we do not have to add a special case (check for Roman letters) to the transformation rules.
|
|
By a complex command, we mean a TeX command that is not simple; that is, the TeX command does more than just output a character or string of characters.
Complex commands include any sort of TeX command that takes arguments.
Fractions
Binomial coefficients
Square roots and radicals
Parenthesized mod
User-named functions and operators
Setting display style
Setting display style: displaymath
Changing math font styles
Changing math fonts (old-style commands)
Changing math font sizes
Accents on characters
Matrices
Array environment
Matrices: mtable subroutine
Under- and over-decorations
Matching sizes of delimiters
Matching sizes of delimiters: get delimiter subroutine
LATEX blocks
LATEX blocks: end
Combining operators
Character escapes
Embedded text
List of all commands
List of all LATEX blocks
(var $tex-commands (table
("\\frac" fraction-to-mathml )
("\\dfrac" fraction-to-mathml )
("\\tfrac" fraction-to-mathml )
("\\binom" binom-to-mathml )
("\\sqrt" sqrt-to-mathml )
("\\operatorname" operatorname-to-mathml )
("\\displaystyle" displaystyle-to-mathml )
Parenthesized mod
Changing math font styles
Changing math fonts (old-style commands)
Changing font sizes
Accents on characters
Under- and over- decorations
Combining operators
Matching sizes of delimiters
("\\char" char-escape-to-mathml)
("\\!" (lambda () (nil)))
Embedded text commands
("\\begin" latex-block-to-mathml)
))
A LaTeX environment is a special form of commands,
where the logical argument is enclosed
in between
two tags \begin{name}
and \end{name},
instead of the braces { and }.
A LaTeX environment is also called a LaTeX block (a more intuitive term).
The following is a list of all the LaTeX blocks we can parse.
(var $tex-environments (table
("smallmatrix" (lambda () (call matrix-to-mathml "(" ")" ) ))
("pmatrix" (lambda () (call matrix-to-mathml "(" ")" ) ))
("bmatrix" (lambda () (call matrix-to-mathml "[" "]" ) ))
("Bmatrix" (lambda () (call matrix-to-mathml "{" "}" ) ))
("vmatrix" (lambda () (call matrix-to-mathml "\u007c" "\u007c") ))
("Vmatrix" (lambda () (call matrix-to-mathml "\u2016" "\u2016") ))
("cases" (lambda () (call matrix-to-mathml "{" (nil) ) ))
("array" array-to-mathml )
("displaymath" displaymath-to-mathml)
))
\begin dispatch
The real command name in \begin{name}
is name,
so we need the following function to extract
this name, and dispatch to the procedure
that handles that command specifically.
(procedure (latex-block-to-mathml)
(set cmd (read-token))
(cond ((in? cmd $tex-environments)
(advance-token)
(return (call (get cmd $tex-environments))))
(else
(throw "unknown command"))))
\end dispatch
This procedure should be called when
the \end{name} tag of
a LaTeX block is encountered.
It performs error-checking, and prepares
for the parse after the \end{name}
tag.
(procedure (finish-latex-block)
(cond ((null? (read-token))
(throw "unexpected eof")))
(advance-token)
(advance-token))
(procedure (fraction-to-mathml) (var numerator (call piece-to-mathml)) (var denominator (call piece-to-mathml)) (return (result-element "mfrac" (numerator denominator))))
The common “choose” notation for the binomial coefficients are essentially fractions with added parentheses but without the fraction bar, and is encoded that way in presentational MathML.
(procedure (binom-to-mathml)
(var top (call piece-to-mathml))
(var bottom (call piece-to-mathml))
(return (result-element "mrow"
((result-element "mo" ( "(" ))
(result-element "mfrac" (("linethickness" "0"))
( top bottom ))
(result-element "mo" ( ")" ))))))
A radical in TeX is written \sqrt[i]{x},
where i is the index of the radical,
and x is the expression that appears under the radical.
The index part is optional; if it is omitted,
then the radical is a square root and is translated to the MathML
sqrt element.
Otherwise, the MathML mroot element
is used for the radical with index.
(procedure (sqrt-to-mathml)
(var index (call optional-arg-to-mathml))
(var object (call piece-to-mathml))
(cond ((not-null? index)
(return (result-element "mroot" (object index))))
(else
(return (result-element "msqrt" (object))))))
("\\pod" (lambda () (call parenthesized-operator (nil))))
("\\pmod" (lambda () (call parenthesized-operator "mod")))
(procedure (parenthesized-operator word)
(var object (call piece-to-mathml))
(cond ((not-null? word)
(return (result-element "mrow"
((result-element "mo" ("("))
(result-element "mo" (word))
object
(result-element "mo" (")"))))))
(else
(return (result-element "mrow"
((result-element "mo" ("("))
object
(result-element "mo" (")"))))))))
Note that, by a special case in the tokenization,
\operatorname always receives its argument
as one token.
(procedure (operatorname-to-mathml) (var result (result-element "mo" ((read-token)))) (advance-token) (return result))
When the \displaystyle command is encountered,
everything following it should be typeset as a math “display”.
In MathML, presentational aspects such as the “display mode”
and script size are controlled by attributes on the container element.
However, in Mozilla, some of these attributes do not take effect
unless they are applied to the MathML mstyle.
This is a bug, but the work-around is simple: wrap
the content around a mstyle element.
The only disadvantage is that the output MathML becomes slightly
more bloated.
(procedure (displaystyle-to-mathml)
(var result (call subexpr-chain-to-mathml $hard-stop-tokens))
(return (result-element "mstyle" (("displaystyle" "true")
("scriptlevel" "0")) (result))))
For reference, here is the procedure
to generate a display without the mstyle
work-around:
(procedure (displaystyle-to-mathml) (var result (call subexpr-chain-to-mathml $hard-stop-tokens)) (set-attr result "displaystyle" "true") (set-attr result "scriptlevel" "0") (return result))
Strictly speaking, the displaymath environment
is not an environment to be used inside TEX math mode.
Rather, it activates TEX math mode when
used in text mode, and typesets the formula inside it
as a “display”.
However, this environment appears often in the TEX markup
generated by the LaTeX2HTML translator,
and it is trivially supported.
(procedure (displaymath-to-mathml)
(var result (call subexpr-chain-to-mathml $hard-stop-tokens))
(call finish-latex-block)
(return (result-element "mstyle" (("displaystyle" "true")
("scriptlevel" "0")) (result))))
As with the displaystyle-to-mathml procedure,
the version that outputs standard MathML without the mstyle
work-around is as follows:
(procedure (displaymath-to-mathml) (var result (call subexpr-chain-to-mathml $hard-stop-tokens)) (set-attr result "displaystyle" "true") (set-attr result "scriptlevel" "0") (call finish-latex-block) (return result))
("\\boldsymbol" (lambda () (call font-to-mathml "bold")) )
("\\bold" (lambda () (call font-to-mathml "bold")) )
("\\Bbb" (lambda () (call font-to-mathml "double-struck")) )
("\\mathbb" (lambda () (call font-to-mathml "double-struck")) )
("\\mathbbmss" (lambda () (call font-to-mathml "double-struck")) )
("\\mathbf" (lambda () (call font-to-mathml "bold")) )
("\\mathop" (lambda () (call font-to-mathml "normal")) )
("\\mathrm" (lambda () (call font-to-mathml "normal")) )
("\\mathfrak" (lambda () (call font-to-mathml "fraktur")) )
("\\mathit" (lambda () (call font-to-mathml "italic")) )
("\\mathscr" (lambda () (call font-to-mathml "script")) )
("\\mathcal" (lambda () (call font-to-mathml "script")) )
("\\mathsf" (lambda () (call font-to-mathml "sans-serif")) )
("\\mathtt" (lambda () (call font-to-mathml "monospace")) )
("\\EuScript" (lambda () (call font-to-mathml "script")) )
(procedure (font-to-mathml font-name)
(cond ((str-neq? (read-token) "{")
(var result (result-element "mi" (("mathvariant" font-name))
((read-token))))
(cond ((str-eq? font-name "normal")
(set-attr result "fontstyle" "normal")))
(advance-token)
(return result))
(else
(var result (call piece-to-mathml))
(set-attr result "mathvariant" font-name)
(cond ((str-eq? font-name "normal")
(set-attr result "fontstyle" "normal")))
(return result))))
("\\bf" (lambda () (call old-font-to-mathml "bold")) )
("\\rm" (lambda () (call old-font-to-mathml "normal")) )
(procedure (old-font-to-mathml font-name)
(return
(result-element "mstyle"
(("mathvariant" font-name)
("fontstyle" (if (str-eq? font-name "normal") "normal" (nil))))
((call subexpr-chain-to-mathml $hard-stop-tokens)))))
("\\big" (lambda () (call size-to-mathml "2" "2")))
("\\Big" (lambda () (call size-to-mathml "3" "3")))
("\\bigg" (lambda () (call size-to-mathml "4" "4")))
("\\Bigg" (lambda () (call size-to-mathml "5" "5")))
(procedure (size-to-mathml min-size max-size) (var result (call piece-to-mathml)) (set-attr result "minsize" min-size) (set-attr result "maxsize" max-size) (return result))
In MathML, the inside of the matrix or array
is encoded as an mtable element.
The borders, parentheses, or brackets
are encoded separately.
The procedure matrix-to-mtable
converts the inside of a matrix
into a mtable.
It works
in this straightforward manner:
The procedure starts the mtable with one row
(mtr)
containing one cell (mtd),
which become the current row and current cell.
As we parse TeX content, we add that content
(translated to MathML) to the current cell.
If & is encountered, indicating that we
should break the cell, then we create a new cell
for the current row, and the new cell becomes the current cell.
If \\ is encountered, indicating a line break,
then we insert a new row in the table, with a new cell,
and the new row and the new cell become the current ones.
The procedure should be passed an empty mtable
result element. It does not create one itself,
because the caller may want to add some attributes to the
mtable before processing the table contents.
(procedure (matrix-to-mtable mtable)
(var mtr (result-element "mtr"))
(var mtd (result-element "mtd"))
(var token (read-token))
(append mtable mtr)
(append mtr mtd)
(while (and (not-null? token) (str-neq? token "\\end"))
(cond ((str-eq? token "\\\\")
(set mtr (result-element "mtr"))
(set mtd (result-element "mtd"))
(append mtable mtr)
(append mtr mtd)
(advance-token))
((str-eq? token "&")
(set mtd (result-element "mtd"))
(append mtr mtd)
(advance-token))
(else
(append mtd (call subexpr-chain-to-mathml $hard-stop-tokens))))
(set token (read-token)))
(call finish-latex-block)
(return mtable))
The following procedure, matrix-to-mathml,
handles parsing of TEX matrix blocks.
It adds the borders to the mtable
obtained from the procedure matrix-to-mtable, with
a mrow wrapper around the whole thing.
The border delimiters for matrix-to-mathml
are specified by parameters.
See LATEX block commands list
for the actual parameters for each TEX matrix or array command.
(procedure (matrix-to-mathml open-delim close-delim)
(var mtable (call matrix-to-mtable (result-element "mtable")))
(cond ((or (not-null? open-delim) (not-null? close-delim))
(var mrow (result-element "mrow"))
(cond ((not-null? open-delim)
(append mrow (result-element "mo" (open-delim)))))
(append mrow mtable)
(cond ((not-null? close-delim)
(append mrow (result-element "mo" (close-delim)))))
(return mrow))
(else
(return mtable))))
The following procedure, array-to-mathml,
converts TEX array environments.
As in TEX, no delimiters are put around the array.
(procedure (array-to-mathml) (var mtable (result-element "mtable"))Process the specifications for the column alignment in cells
(cond ((str-eq? (read-token) "{") (advance-token) (while (and (not-null? (read-token)) (str-neq? (read-token) "}")) (cond ((str-eq? (read-token) "c") (append-attr mtable "columnalign" "center ")) ((str-eq? (read-token) "l") (append-attr mtable "columnalign" "left ")) ((str-eq? (read-token) "r") (append-attr mtable "columnalign" "right "))) (advance-token)) (cond ((not-null? (read-token)) (advance-token)))))Process the table itself
(return (call matrix-to-mtable mtable)))
("\\acute" (lambda () (call accent-to-mathml "\u0301")) )
("\\grave" (lambda () (call accent-to-mathml "\u0300")) )
("\\tilde" (lambda () (call accent-to-mathml "\u0303")) )
("\\bar" (lambda () (call accent-to-mathml "\u0304")) )
("\\breve" (lambda () (call accent-to-mathml "\u0306")) )
("\\check" (lambda () (call accent-to-mathml "\u030c")) )
("\\hat" (lambda () (call accent-to-mathml "\u0302")) )
("\\vec" (lambda () (call accent-to-mathml "\u20d7")) )
("\\dot" (lambda () (call accent-to-mathml "\u0307")) )
("\\ddot" (lambda () (call accent-to-mathml "\u0308")) )
("\\dddot" (lambda () (call accent-to-mathml "\u20db")) )
Accents are displayed in MathML by placing
the corresponding Unicode character in an mover element,
over the base letter, and with the accent
attribute set to true.
(procedure (accent-to-mathml char)
(return (result-element "mover" (("accent" "true"))
((call piece-to-mathml)
(result-element "mo" (char))))))
("\\underbrace" (lambda () (call under-to-mathml "\ufe38")) )
("\\overbrace" (lambda () (call over-to-mathml "\ufe37")) )
("\\underline" (lambda () (call under-to-mathml "\u0332")) )
("\\overline" (lambda () (call over-to-mathml "\u00af")) )
("\\widetilde" (lambda () (call over-to-mathml "\u0303")) )
("\\widehat" (lambda () (call over-to-mathml "\u0302")) )
(procedure (over-to-mathml char)
(return (result-element "mover" ((call piece-to-mathml)
(result-element "mo" (char))))))
(procedure (under-to-mathml char)
(return (result-element "munder" ((call piece-to-mathml)
(result-element "mo" (char))))))
Combining operators refer to TEX commands
such as \not which draws a slash through the next
operator. In general, it may refer to any operator
that modifies the next one by superimposing some other
mark over the glyph for the next operator.
The term “combining” is from the Unicode standards; a “combining character” (or “combining diacritic”) modifies the base character in some way. The only difference is that in Unicode, the combining character or diacritic comes after the base character rather than before it.
("\\not" (lambda () (call combining-to-mathml 0338 ̸)))
(procedure (combining-to-mathml char)
(var base (read-token))
(advance-token)
(return
(result-element "mo" (base char))))
("\\left" (lambda () (call delimiter-to-mathml "\\right" "1" (nil))) )
("\\bigl" (lambda () (call delimiter-to-mathml "\\bigr" "2" "2" )) )
("\\Bigl" (lambda () (call delimiter-to-mathml "\\Bigr" "3" "3" )) )
("\\biggl" (lambda () (call delimiter-to-mathml "\\biggr" "4" "4" )) )
("\\Biggl" (lambda () (call delimiter-to-mathml "\\Biggr" "5" "5" )) )
The commands for matching sizes of delimiters
are unusual in that they do not take arguments
enclosed between { and },
but the character that occurs right after the command.
Also the command for the left delimiter
must be matched by the command for the right delimiter.
So we must know to stop parsing when we encounter
the command for the right delimiter.
(procedure (delimiter-to-mathml end-command min-size max-size)
(var mrow (result-element "mrow"))
(append mrow (result-element "mo" (("minsize" min-size)
("maxsize" max-size))
((call read-delimiter))))
(append mrow (call subexpr-chain-to-mathml $hard-stop-tokens))
(cond ((str-neq? (read-token) end-command)
(return mrow)))
(advance-token)
(append mrow (result-element "mo" (("minsize" min-size)
("maxsize" max-size))
((call read-delimiter))))
(return mrow))
This subroutine reads the next token, checks that it is a valid delimiter, and handles the special cases:
\left. | produces a blank instead of a dot |
\right. | |
\left< | produces the left-pointing angle bracket instead of a less-than sign |
\right< | |
\right> | produces the right-pointing bracket instead of a greater-than sign |
\left> |
(procedure (read-delimiter)
(var token (read-token))
(cond ((null? token)
(throw "unexpected eof"))
((str-eq? token ".")
(advance-token)
(return ""))
((str-eq? token "<")
(advance-token)
(return "\u2329"))
((str-eq? token ">")
(advance-token)
(return "\u232a"))
((in? token $punct-and-space)
(advance-token)
(return (get token $punct-and-space)))
((in? token $left-delimiters)
(advance-token)
(return (get token $left-delimiters)))
((in? token $right-delimiters)
(advance-token)
(return (get token $right-delimiters)))
((in? token $operator-symbols)
(advance-token)
(return (get token $operator-symbols)))
(else
(throw "invalid delimiter"))))
This is weird stuff that comes out from LaTeX2HTML. Normally the TeX typist should not enter character codes for any character.
(var $char-escape-codes (table
("93" "#")))
(procedure (char-escape-to-mathml)
(var result (nil))
(cond ((in? (read-token) $char-escape-codes)
(set result (result-element "mtext"
((get (read-token) $char-escape-codes)))))
(else
(set result (result-element "merror" ("\\char" (read-token))))))
(advance-token)
(return result))
("\\text" text-to-mathml )
("\\textnormal" text-to-mathml )
("\\textrm" text-to-mathml )
("\\textsl" text-to-mathml )
("\\textit" text-to-mathml )
("\\texttt" text-to-mathml )
("\\textbf" text-to-mathml )
("\\hbox" text-to-mathml )
("\\mbox" text-to-mathml )
The embedded text is not split into words or tokens;
it is split only when it is interrupted by an inline math
formula. Each segment of text goes in a mtext
element. Embedded inline math formulas in the embedded text
are signaled, as in TEX, by $ tokens.
When $ is encountered,
normal math parsing begins again, until interrupted by another
$ (at the same nesting level).
The parsing knows when to stop because $
is considered a “hard stop token”.
(procedure (text-to-mathml)
(cond ((str-neq? (read-token) "{")
(var result (result-element "mtext" ((read-token))))
(advance-token)
(return result)))
(advance-token)
(var result (nil))
(var mrow (nil))
(var node (nil))
(while (not-null? (read-token))
(cond ((str-eq? (read-token) "}")
(advance-token)
(return result))
((str-eq? (read-token) "$")
(advance-token)
(set node (call subexpr-chain-to-mathml $hard-stop-tokens))
(advance-token))
(else
(set node (result-element "mtext" ((read-token))))
(advance-token)))
Collect results)
(return result))
At the most fundamental level, TEX markup is parsed by reading the markup
one character at a time, and taking actions based on what that character is.
This is the approach taken by the TEX
typesetting system itself. So, for example, the
string typeset means, in TEX, to construct
the glyphs for each character (
t
y
p
e
s
e
t
), and then paste them together to form (part of) a paragraph.
We do not take this approach in our TEX parser,
because, in most scripting languages, and in XML, processing
text by looping through each individual character is inefficient.
Also, some character sequences have to be processed together
anyway — these are called “tokens”.
Examples of tokens are: \catcode and
3.14 (as opposed to interpreting them as
\
c
a
t
c
o
d
e
or
3
.
1
4
)
.
A well-understood and widely implemented specification for matching character sequences is that of regular expressions. Our program considers the TEX markup that is input as a stream of characters; the head of the character stream is matched against a regular expression. The matching characters are grouped into one token (or sometimes a few tokens), and then those characters are consumed (removed from the head of the input character stream). Much of the LATEX-to-MathML translator only deals with the tokens, not the character stream.
The following regular expression greedily matches sequences of characters to be considered as tokens in “math mode”. The matching sequences are grouped by parentheses.
| (\\begin|\\operatorname|\\mathrm|\\mathop|\\end)\s*\{\s*([A-Z a-z]+)\s*\} | match commands for LATEX environments | |
| | | (\\[a-zA-Z]+|\\[\\#\{\},:;!]) | match other commands |
| | | (\s+) | match blank token |
| | | ([0-9\.]+) | match numbers |
| | | ([\$!"#%&'()*+,-.\/:;<=>?\[\]^_`\{\|\}~]) | match tokens for operators |
For reasons that will be explained later, the regular expression
just given omits the matching of
identifiers (constants or variables denoted by Roman letters).
Assuming, for the moment, that such identifiers
are simply the letters of the Roman alphabet,
we can give an example of applying this regular expression.
Using it, the TEX markup \sin(\frac 1 {xy^2}) (in math mode)
will be split into these tokens:
\sin
(
\frac
1
{
x
y
^
2
}
)
Here, for clarity, the blank tokens consisting of only spaces have been ignored, because blank tokens will eventually be displayed as nothing anyway.
The regular expression for math mode works well for mathematics written in TEX that do not require switching out of “math mode”. To elaborate, some mathematical formulae make use of natural-language descriptions like this one:
B = \{ z \in A : \text{where $f(z)$ is purely imaginary} \}
In this case, the English text “where … is purely imaginary” must
be parsed as English text, and not as the mathematical variables
w, h, e, r, ….
TEX has the notion of modes that distinguish between symbolic formulae and natural-language text. The former is handled by the “math mode”, and the latter by the “horizontal mode” or “paragraph mode”. Since TEX always processes markup a character at a time, the modes do not affect the syntax of the markup, but rather they affect how the input characters are typeset into the resulting document.
However, our TEX markup parser works differently; it parses a formulae written in TEX into tokens using regular expressions, and the regular expressions are not valid for parsing natural-language text. So to parse natural-language text embedded into mathematical formulae, we need a separate set of regular expressions to parse natural-language text. When embedded text is encountered we must switch into the separate set of regular expressions for parsing it.
Here is the regular expression for parsing natural-language text (outside of “math mode”), possibly with embedded TEX commands:
| [\${}\\] | ||
| | | \\[a-zA-Z]+ | match commands |
| | | [^{}\$]+ |
So far, we have presented two sets of regular expressions.
The two sets cannot be combined into one, because selecting
the correct parsing mode depends on counting nesting levels
of matching braces ({ and }) or
math-mode delimiters ($). And regular expressions
provably cannot do this — they do not provide
recursively-defined subexpressions.
Thus, full TEX markup cannot be parsed using regular expressions alone. It could be parsed with a context-free grammar instead. However, parsing a general context-free grammar, using a parser generator, is overkill for our purposes — parser generators are not readily available for the JavaScript language, and generators from scripting languages are likely to be less efficient from a hand-rolled parser.
Actually, a hand-rolled parser is not difficult to implement.
The text, textrm, textsl, textit, texttt, and textbf commands switch
out of math mode and into text mode (TEX’s horizontal mode to be precise.
This means that whatever that occurs inside
these commands cannot be tokenized as usual math tokens.
(\\textrm|\\textsl|\\textit|\\texttt|\\textbf|\\text|\\hbox)
const tokenize_re = /Regular expression for tokenizing TEX input|([a-zA-Z])/g; const tokenize_text_re = /[\${}\\]|\\[a-zA-Z]+|[^{}\$]+/g; const tokenize_text_commands = { '\\textrm': 1, '\\textsl': 1, '\\textit': 1, '\\texttt': 1, '\\textbf': 1, '\\textnormal': 1, '\\text': 1, '\\hbox': 1, '\\mbox': 1 }; function tokenize_latex_math(input) { var result = new Array(); var in_text_mode = 0; var brace_level = []; var pos = 0; if(input.charAt(0) == '$' && input.charAt(input.length-1) == '$') input = input.slice(1, input.length-1); while(1) { if(!in_text_mode) { tokenize_re.lastIndex = pos; var m = tokenize_re.exec(input); pos = tokenize_re.lastIndex; if(m == null) { return result; } else if(m[1] != null) { result.push(m[1], m[2]); } else if(m[3] == '\\sp') { result.push('^'); } else if(m[3] == '\\sb') { result.push('_'); } else { if(m[0] == '$') { in_text_mode = 1; } else if(m[4] != null) { continue; } else if(m[3] != null && m[3] in tokenize_text_commands) { in_text_mode = 2; brace_level.push(0); } result.push(m[0]); } } else { tokenize_text_re.lastIndex = pos; var m = tokenize_text_re.exec(input); pos = tokenize_text_re.lastIndex; if(m == null) { return result; } else if(m[0] == '$') { in_text_mode = 0; } else if(m[0] == '{') { brace_level[brace_level.length-1]++; } else if(m[0] == '}') { if(--brace_level[brace_level.length-1] <= 0) { in_text_mode = 0; brace_level.pop(); } } result.push(m[0]); } } }
tokenize_re = re.compile(ur"""Regular expression for tokenizing TEX input|([a-zA-Z])""") tokenize_text_re = re.compile(ur"""[\${}\\]|\\[a-zA-Z]+|[^{}\$]+""") tokenize_text_commands = { u'\\textrm': 1, u'\\textsl': 1, u'\\textit': 1, u'\\texttt': 1, u'\\textbf': 1, u'\\text': 1, u'\\textnormal': 1, u'\\hbox': 1, u'\\mbox': 1, } def tokenize_latex_math(self, input): in_text_mode = 0; brace_level = []; pos = 0; input = unicode(input) if input[0] == u'$' and input[-1] == u'$': input = input[1:-1] while True: if not in_text_mode: m = self.tokenize_re.match(input, pos) if m is None: return pos = m.end() if m.group(1) is not None: self.tokens.extend(m.group((1,2))) elif m.group(3) == u"\\sp": self.tokens.append(u"^") elif m.group(3) == u"\\sb": self.tokens.append(u"_") else: if m.group(0) == u"$": in_text_mode = 1 elif m.group(4) is not None: continue elif m.group(3) in self.tokenize_text_commands: in_text_mode = 2; brace_level.append(0) self.tokens.append(m.group(0)) else: m = self.tokenize_text_re.match(input, pos) if m is None: return pos = m.end() if m.group(0) == u"$": in_text_mode = 0 elif m.group(0) == u"{": brace_level[-1] += 1 elif m.group(0) == u"}": brace_level[-1] -= 1 if brace_level[-1] <= 0: in_text_mode = 0 brace_level.pop() self.tokens.append(m.group(0))
We parse TEX and LATEX in two stages: the first stage splits the input string into tokens, then the transformation rules march through these tokens (generally from left to right), translating them to the MathML.
The transformation rules are written as procedures that call each other recursively to build up MathML sub-trees.
List of limit commands Parse a piece Parse a subexpression Parse a subexpression chain Parse optional arguments
In our parlance, a piece is a token or a group of tokens that TEX takes as one piece. Examples:
xy |
consists of two pieces, x and y, each
piece being also a token |
{xy} |
has only one piece consisting of 4 tokens, {xy} |
2^nm |
The piece occurring after ^ is the token n |
2^{nm} |
The piece occurring after ^ is {nm},
consisting of 4 tokens |
2^\mathrm{F}_2 |
The piece occurring after ^ is \mathrm{F},
consisting of 4 tokens |
The necessity of distinguishing between a piece and a token
should be clear from these examples. In 2^mn,
the superscript, as parsed by TEX, is just m.
To get an mn superscript, the two tokens must
be enclosed as one piece: {mn}.
More precisely, a piece is defined to be one of the following:
{ or } or a TEX command token{ and ending with
a matching }
The following procedure, piece-to-mathml,
takes the next piece and translates it to MathML.
To do this, it reads in the next token,
and determines the appropriate action depending on the type of
the token.
(procedure (piece-to-mathml)
(var token (read-token))
(var result (nil))
(cond ((str-eq? token "{")
(advance-token)
(set result (call subexpr-chain-to-mathml $hard-stop-tokens))
(cond ((str-eq? (read-token) "}")
(advance-token))))
((in? token $relation-symbols)
(set result (result-element "mo"
((get token $relation-symbols))))
(advance-token))
((in? token $operator-symbols)
(set result (result-element "mo"
((get token $operator-symbols))))
(advance-token))
((in? token $left-delimiters)
(set result (result-element "mo"
((get token $left-delimiters))))
(advance-token))
((in? token $right-delimiters)
(set result (result-element "mo"
((get token $right-delimiters))))
(advance-token))
((in? token $word-operators)
(set result (result-element "mi" (("mathvariant" "normal"))
((get token $word-operators))))
(advance-token))
((in? token $greek-letters)
(set result (result-element "mi" (("fontstyle" "normal"))
((get token $greek-letters))))
(advance-token))
((in? token $named-identifiers)
(set result (result-element "mi"
((get token $named-identifiers))))
(advance-token))
((in? token $punct-and-space)
(set result (result-element "mtext"
((get token $punct-and-space))))
(advance-token))
((in? token $tex-commands)
(advance-token)
(set result (call (get token $tex-commands))))
(else
(set result (result-element "mn" (token)))
(advance-token)))
(return result))
In TEX the concept that corresponds roughly to a piece is instead called an atom. However, “atom” is the wrong word for the concept, since “atom” means indivisible, but pieces can be groups of atoms.
The two terms piece and atom technically do not refer
to the same thing, because of a technical difference in which
we do parsing.
Recall that we parse a character sequence
consisting of digits (and decimal points)
as one token representing a number, rather than as separate tokens
for each digit.
So for example, 431
is one token — and hence one piece — in our sense, while in TEX
there are 3 atoms: 4 3 1.
This can lead to incompatibilities when parsing TEX markup.
For example, in TEX, the markup
\frac42 6 means the same as \frac{4}{2} 6,
and not \frac{42}{6}.
Also, TEX does not allow LATEX blocks to occur whenever the context demands an atom. So for example, this markup is illegal:
x^\begin{smallmatrix} 1 & 2 \\ 3 & 4\end{smallmatrix},
although this is legal:
x^{\begin{smallmatrix} 1 & 2 \\ 3 & 4\end{smallmatrix}}.
In our LATEX parser both are legal — that simplifies the implementation slightly. Technically it constitutes an incompatibility, but only invalid TEX markup is effected, so there is little to worry about.
A subexpression, in our case, means a piece along with its attached subscripts and superscripts.
FIX ME. This definition will have to be expanded eventually, to allow semantic parsing.
(procedure (subexpr-to-mathml) (var result (nil)) Parse any prescripts for tensor indices (var limit-style (in? (read-token) $limit-commands)) (cond ((null? (read-token)) (cond ((not-null? mmultiscripts) (prepend mmultiscripts (result-element "mrow") mprescripts) (return mmultiscripts)) (else (return (result-element "mrow"))))) ((in? (read-token) $left-delimiters) (set result (call heuristic-subexpression))) (else (set result (call piece-to-mathml)))) (var base result) Incorporate the following TEX subscript and superscript, if present Parse any postscripts for tensor indices Place the sub- and superscripts (return result))
Parsing subscripts and superscripts is not entirely straightforward.
Firstly, TEX signals subscripts and superscripts by placing ^
or _ tokens after the base object,
whereas MathML signals subscripts and superscripts by an element
that wraps the base object. In practice, this means
we must check for the ^ or _ tokens
separately after processing other kinds of objects,
and go back and wrap the previous object if ^
or _ is found.
Secondly, subscripts and superscripts in TEX actually stand
for underscripts and overscripts, when they are used in big operators
(in display style).
This is called the “limit style” of subscripts and superscripts.
MathML makes clear distinctions between subscripts/superscripts,
and underscripts/overscripts, and we must disambiguate between
the two styles by looking at the object that the TEX ^
or _ token is being applied to.
Finally, parsing tensor indices (multiple indices for a single base expression) opens another can of worms.
The following subportion of the subexpr-to-mathml procedure
determines if the token immediately following a piece
is ^ or _,
and if so, puts the previous result element
into a MathML subscript (msub), superscript
(msup),
underscript (munder), or overscript
(mover) wrapper.
Also, a combination of a subscript and a superscript
should result in a combined MathML msubsup
or munderover element, to avoid the scripts
being staggered.
If no TEX subscript or superscript is found, the previous result element is left alone.
(var subscript (nil))
(var superscript (nil))
(cond
((str-eq? (read-token) "_")
(advance-token)
(set subscript (call piece-to-mathml)))
((str-eq? (read-token) "^")
(advance-token)
(set superscript (call piece-to-mathml))))
(cond
((str-eq? (read-token) "_")
(advance-token)
(set subscript (call piece-to-mathml)))
((str-eq? (read-token) "^")
(advance-token)
(set superscript (call piece-to-mathml))))
(cond
((not-null? mmultiscripts)
(prepend mmultiscripts
base
mprescripts)
(prepend mmultiscripts
(if (not-null? subscript)
subscript
(result-element "none"))
mprescripts)
(prepend mmultiscripts
(if (not-null? superscript)
superscript
(result-element "none"))
mprescripts)))
(cond
((not-null? mmultiscripts)
(set result mmultiscripts))
((and (not-null? subscript) (not-null? superscript))
(set result (result-element
(if limit-style "munderover" "msubsup")
(base subscript superscript))))
((not-null? subscript)
(set result (result-element
(if limit-style "munder" "msub")
(base subscript))))
((not-null? superscript)
(set result (result-element
(if limit-style "mover" "msup")
(base superscript)))))
TEX subscripts and superscripts applied to the following commands should be understood as MathML underscripts and overscripts. For other commands, subscripts and superscripts continue to be mapped to subscripts and superscripts.
(var $limit-commands (table Big operators Big word operators ("\\underbrace" (nil)) ("\\overbrace" (nil)) ("\\underline" (nil)) ("\\overline" (nil)) ))
When mathematical expressions are to be typeset inline,
even the big operators drop the limit style.
Nevertheless, the limits are still encoded as underscripts
and overscripts in MathML; MathML has a separate attribute
(movablelimits) to indicate that the limit style
should be dropped.
(var mmultiscripts (nil))
(var mprescripts (nil))
(cond ((and (str-eq? (read-token 0) "{")
(str-eq? (read-token 1) "}")
(or (str-eq? (read-token 2) "_")
(str-eq? (read-token 2) "^")))
(set mmultiscripts (result-element "mmultiscripts"))
(set mprescripts (result-element "mprescripts"))
(append mmultiscripts mprescripts)
(while (and (str-eq? (read-token 0) "{")
(str-eq? (read-token 1) "}")
(or (str-eq? (read-token 2) "_")
(str-eq? (read-token 2) "^")))
(var subscript (nil))
(var superscript (nil))
(advance-token)
(advance-token)
(cond ((str-eq? (read-token) "_")
(advance-token)
(set subscript (call piece-to-mathml)))
((str-eq? (read-token) "^")
(advance-token)
(set superscript (call piece-to-mathml))))
(cond ((str-eq? (read-token) "_")
(advance-token)
(set subscript (call piece-to-mathml)))
((str-eq? (read-token) "^")
(advance-token)
(set superscript (call piece-to-mathml))))
(append mmultiscripts
(if (not-null? subscript)
subscript
(result-element "none")))
(append mmultiscripts
(if (not-null? superscript)
superscript
(result-element "none"))))))
(while (and (str-eq? (read-token 0) "{")
(str-eq? (read-token 1) "}")
(or (str-eq? (read-token 2) "_")
(str-eq? (read-token 2) "^")))
(cond ((null? mmultiscripts)
(set mmultiscripts (result-element "mmultiscripts" (base)))
(set mprescripts (nil))
(cond ((or? (not-null? superscript)
(not-null? subscript))
(append mmultiscripts
(if (not-null? subscript)
subscript
(result-element "none")))
(append mmultiscripts
(if (not-null? superscript)
superscript
(result-element "none")))))))
(var subscript (nil))
(var superscript (nil))
(advance-token)
(advance-token)
(cond ((str-eq? (read-token) "_")
(advance-token)
(set subscript (call piece-to-mathml)))
((str-eq? (read-token) "^")
(advance-token)
(set superscript (call piece-to-mathml))))
(cond ((str-eq? (read-token) "_")
(advance-token)
(set subscript (call piece-to-mathml)))
((str-eq? (read-token) "^")
(advance-token)
(set superscript (call piece-to-mathml))))
(prepend mmultiscripts
(if (not-null? subscript)
subscript
(result-element "none"))
mprescripts)
(prepend mmultiscripts
(if (not-null? superscript)
superscript
(result-element "none"))
mprescripts))
(cond ((and (str-eq? token2 "^") (str-eq? (read-token) "\\circ"))
(advance-token)
(append result 00b0 °))
(cond ((str-eq? (read-token) "_")
(set token2 "_")
(advance-token))
(else
(set token2 (nil)))))
In TEX, the markup ' is equivalent to
^{\prime}. This sets the \prime symbol
— which is a vertically-centered glyph — to the superscript position,
so that it appears as the familiar mathematical prime mark.
There is some ambiguity in how a prime mark should be translated:
In view of these two interpretations of the prime mark, we employ the following heuristic approach when converting formulae with prime marks into MathML:
'),
but it should be translated to
the proper Unicode prime mark character for MathML
(2032 ′).
<mi>a′</mi>,
for example.
In the case of two prime marks,
the marks should be coalesced into one Unicode double-prime character
(2033 ″).
<mi>f</mi>
<mo>′</mo>
<mo>′</mo>
<mo>′</mo>
.
The following procedure fragment implements this heuristic.
((in? token $named-identifiers)
(set result (call piece-to-mathml))
(cond ((str-eq? (read-token) "'")
(advance-token)
(cond ((str-eq? (read-token) "'")
(advance-token)
(cond ((str-eq? (read-token) "'")
(advance-token)
Add arbitrary number of prime marks as operators)
(else
(append result 2033 ″))))
(else
(append result 2032 ′))))))
For three or more prime marks, loop through them and enter them separately as operators.
(set result (result-element "mrow"
(result
(result-element "mo" 2032 ′)
(result-element "mo" 2032 ′)
(result-element "mo" 2032 ′))))
(while (str-eq? (read-token) "'")
(append result (result-element "mo" 2032 ′))
(advance-token))
(procedure (subexpr-chain-to-mathml stop-tokens)
(var result (nil))
(var mrow (nil))
(var mfrac (nil))
(var wrapped-result (nil))
(while (and (not-null? (read-token))
(not (in? (read-token) stop-tokens)))
(cond
Parse \over
Parse \choose
(else
(var node
Parse subexpressions, organizing them in a tree according to precedence
)
Collect results
)))
(cond ((not-null? mfrac)
(append mfrac result)
(return wrapped-result))
(else
(return result))))
(var node (call subexpr-to-mathml)) Collect results
(cond ((not-null? mrow)
(append mrow node))
((not-null? result)
(set mrow (result-element "mrow" (result node)))
(set result mrow))
(else
(set result node)))
Fractions and binomial coefficients in the old TEX style
are written using the \over and \choose commands.
The design of these commands, to put it mildly, is terrible.
These commands take their arguments on the left- and right- sides,
up to the edges of the current block (delimited by braces {
and }).
They are a pain to parse because they do not behave like other
TEX and LATEX commands and require special cases in code.
Nevertheless some authors still use them instead of the better designed — if a little more verbose — LATEX equivalents, so we must support them.
Whenever we encounter an \over or \choose,
the current subexpression up to that point — i.e. the left argument
to the command — is stowed away. Then an empty
mrow is made, which will accumulate the rest
of the arguments (the right argument to the command).
At the end of the expression, the two pieces are put together.
Two separate variables, mfrac and frac
are needed, to provide a location to store the arguments,
and to signal at the end that a fraction (or binomial coefficient)
is to be returned.
((str-eq? (read-token) "\\over") (advance-token) (set mfrac (result-element "mfrac" (result))) (set wrapped-result mfrac) (set mrow (nil)) (set result (nil)))
((str-eq? (read-token) "\\choose")
(advance-token)
(set mfrac (result-element "mfrac" (("linethickness" "0")) (result)))
(set wrapped-result (result-element "mrow" ((result-element "mo" "(")
mfrac
(result-element "mo" ")"))))
(set mrow (nil))
(set result (nil)))
A required argument to a command
(such as the argument {ab} in
the TEX markup \frac{ab}{2} is,
by definition, a piece.
An argument is almost always enclosed in between {
and }, and books and articles on LATEX always talk
about arguments as if this is always the case.
But the TEXbook, and the LATEX implementation itself,
says otherwise: arguments can be atoms without the braces.
(In fact, TEX itself (but not LATEX) does not even
have the concept of a “command with arguments”.)
Consequently, our TEX parser does not need to specifically care about arguments either — we can just parse them as pieces.
LATEX does have a concept of “optional argument”
to commands. At a fundamental level optional arguments are different
from required arguments. An optional argument is not a piece,
but a series of pieces: the first piece being
the opening bracket [, and the last
piece being the closing bracket ].
Note that the brackets are not syntax characters of TEX,
and can be used literally in TEX markup outside
of optional arguments.
All this means we must optional argument processing must be handled by a special handler (below), and the handler must be explicitly called. Fortunately, optional arguments are very rarely used, so the added complexity is not too much to bear.
(var $optional-arg-stop-tokens (table Hard stop tokens ("]" (nil)) )) (procedure (optional-arg-to-mathml) (cond ((str-neq? (read-token) "[") (return (nil)))) (advance-token) (var result (call subexpr-chain-to-mathml $optional-arg-stop-tokens)) (cond ((str-eq? (read-token) "]") (advance-token))) (return result))
("&" (nil))
("\\\\" (nil))
("}" (nil))
("$" (nil))
("\\end" (nil))
("\\right" (nil))
("\\bigr" (nil))
("\\Bigr" (nil))
("\\biggr" (nil))
("\\Biggr" (nil))
("\\choose" (nil))
("\\over" (nil))
Grouping Parsing according to precedence levels
The TEX-to-MathML parser also does grouping of subexpressions.
This means to convert a series of symbols that occur without nesting
in TEX, but really constitute a logical subexpression and should
be grouped as such in the MathML output.
For example, the TEX markup
(x+y+2z)^2
should be translated
as:
<msup>
<mrow>
<mo>(</mo>
<mrow>
<mi>x</mi>
<mo>+</mo>
<mi>y</mi>
<mrow>
<mn>2</mn>
<mi>z</mi>
</mrow>
</mo>)</mo>
</mrow>
<mn>2</mn>
</msup>
The straightforward translation misses the proper semantic encoding, and is not good MathML markup:
<mrow> <mo>(</mo> <mi>x</mi> <mo>+</mo> <mi>y</mi> <mo>+</mo> <mn>2</mn> <mi>z</mi> <msup> <mo>)</mo> <mn>2</mn> </msup> </mrow>
(var $hard-stop-tokens (table Hard stop tokens )) (var $right-delimiter-stop-tokens (table Hard stop tokens Right delimiters )) (procedure (heuristic-subexpression) (var result (result-element "mrow")) (append result (call piece-to-mathml)) (append result (call subexpr-chain-to-mathml $right-delimiter-stop-tokens)) (cond ((and (not-null? (read-token)) (not (in? (read-token) $hard-stop-tokens))) (append result (call piece-to-mathml)))) (return result))
In many situations, the first step for a computer to understand a language is being able to deduce a parse tree from the input (a list of tokens), in which the substructures in the input are clearly isolated.
Although basic LATEX parsing can be done without a parse tree — indeed, the TEX system itself does not generate parse trees to digest its input — the vast majority of math formulae and expressions, as used by practitioners of mathematics, are based on nesting substructures. An example has already been shown at the beginning of this chapter.
The heuristic semantic parsing pursued by our parser, in essence, is the deduction of a parse tree for mathematical formulae without being given the exact formal grammar that the formulae are written in. The exact formal grammar, of course, does not exist.
Nevertheless, we can try to construct the required formal grammar. We will not actually be using a parser generator on the formal grammar to write the LATEX parser, because of problems with efficiency, robustness, and ambiguity. But it is still useful to write down pieces of the formal grammar, as a concise summary of what parse tree (i.e. MathML tree) we are to generate from the input.
| Relation | Expr | Expr RelationOp Relation |
|---|---|
| Expr | Term | Term AddOp Expr |
| Term | Subexpr (MultiplyOp Term | Term)? |
| Subexpr | Identifier | Numeral | ( Relation ) | BigOperation | FunctionApplication |
| FunctionApplication | Identifier ( (Identifier | Numeral | Relation | BigOperation) ) |
| BigOperation | BigOp BigOperandExpr |
| BigOperandExpr | Term | BigOperation |
| AddOp | + | − | ⊕ | ∪ | ∨ | … |
| MultiplyOp | ⋅ | ∕ | × | ⊗ | ∩ | ∧ | … |
| BigOp | ∑ | ∫ | ⋃ | ⋂ | … |
| RelationOp | = | ∊ | ⊂ | ⊆ | … |
| Identifier | [a-zA-Z] | α | … | Ω | sin | exp | … |
| Numeral | −? [0-9]* (. [0-9]+)? |
The goal is to expressions containing operators mixed together in a list, but turn the output into a tree, where each node is a subexpression where the operator leaf nodes all have the same precedence level. And of course, operators that bind more tightly are to occur nearer to the bottom of the tree (away from the root).
Definition of precedence levels for common operators The general parsing procedure, for infix operators Procedure to parse invisible groups
The general procedure to parse infix operators with precedence is (surprisingly) easy and straightforward.
Consider first a simple case, an input like 3*4*5 + 6*7,
where we are to separate the addition terms and the multiplication factors.
Suppose that the current token is at 3.
We read that in, and process it.
Next, we look at the next token, which is *.
This tells us that the next token (4)
that follows is the other operand
of the multiplication *.
So we accumulate in our buffer: 3*4.
Then, we read the next token, the second * operator.
This tells us again that whatever follows (5) is an operand.
We then accumulate in the buffer: 3*4*5.
Finally, we encounter the + operator.
Since this is not the same precedence level as the *
operators we have been considering, we immediately flush the buffer,
and start anew. Then the + is output,
and processing at the token 6 progresses
as at the first token 3.
Parsing, in general, of more than two precedence levels uses stacked functions. Each level of depth of the function calls handles one level of precedence. The operation of “flushing the buffer” then is just returning from a function call with the buffered result, passing control to the function on top — the caller — which will handle the next precedence level. See the next section for a concrete example.
Here is the actual implementation:
(procedure (collect-precedence-group operators stop-tokens reader)
(var result (call reader stop-tokens))
(var mrow (nil))
(while (and (not-null? (read-token))
(not (in? (read-token) stop-tokens))
(in? (read-token) operators))
(cond ((null? mrow)
(set mrow (result-element "mrow" (result)))
(set result mrow)))
(append mrow (call piece-to-mathml))
(cond ((and (not-null? (read-token))
(in? (read-token) stop-tokens))
(return result))
(else
(append mrow (call reader stop-tokens)))))
(return result))
The parameter operators is a table of operators
at the same given precedence level.
The parameter reader is a function object
that will be called to parse tokens at the next precedence
level (deeper into the tree).
The mrow variable points to a mrow
result element which serves as a buffer.
But if the buffer is to contain one object (subtree) only,
the mrow wrapper will be omitted.
Most of the apparent complexity in the collect-precedence-group
procedure is actually checking when to stop
processing. Clearly, the procedure must not loop past
the end of input, but it also must stop processing at
certain tokens (given by the parameter table stop-tokens).
For example, the procedure typically stops processing
at the } token marking the end of the current block.
(Obviously, expression chains should not leak through a TEX syntactic block.)
Compared to parsing precedence levels based on a context-free grammar, our implementation has a serious “win” in that no explicit back-tracking of tokens is required. In effect, the back-tracking happens by having the function call unwind. Thus the implementation is simplified with good performance characteristics: the time to process the input remains linear in the number of tokens; and the run-time memory required is linear in the nesting depth of the mathematical structures represented by the input markup. Yet, because each parsing level can be implemented by different functions, we still retain a lot of flexibility.
This is the hook into subexpr-chain-to-mathml
that parses subexpressions with precedence levels.
Level 1: relations
(call collect-precedence-group $relations-precedence-group stop-tokensLevel 2: addition-like operators
(lambda (stop-tokens) (call collect-precedence-group $addition-precedence-group stop-tokensLevel 3: multiplication-like operators
(lambda (stop-tokens) (call collect-precedence-group $multiplication-precedence-group stop-tokensLevel 4: invisible multiplication-like operators
collect-invisible-group)))))
(var $relations-precedence-group $relation-symbols)
(var $addition-precedence-group (table
("+" (nil))
("-" (nil))
("\\oplus" (nil))
))
(var $multiplication-precedence-group (table
("*" (nil))
("\\times" (nil))
("\\cdot" (nil))
("/" (nil))
))
(procedure (collect-invisible-group stop-tokens)
(var result (call subexpr-to-mathml))
(var mrow (nil))
(while (and (not-null? (read-token))
(not (in? (read-token) stop-tokens))
(or (in? (read-token) $named-identifiers)
(in? (read-token) $left-delimiters)))
(cond ((null? mrow)
(set mrow (result-element "mrow" (result)))
(set result mrow)))
(append mrow (result-element "mo" ("\u2062")))
(cond ((and (not-null? (read-token))
(in? (read-token) stop-tokens))
(return result))
(else
(append mrow (call subexpr-to-mathml)))))
(return result))
// This script was automatically generated from a literate source.
// Do not edit this file; look at the literate source instead!
//
// Greasemonkey user script to
// Display LaTeX in Web pages by transforming to MathML
//
// Home page: http://gold-saucer.afraid.org/mathml/greasemonkey/
//
// --------------------------------------------------------------------
Copyright notice
// --------------------------------------------------------------------
User script data
Subroutines for result trees
Workaround for Mozilla not supporting mathvariant
The following are basic metadata describing the user script, required by the Greasemonkey extension. Also sets the default Web pages where the user script is to activate.
// ==UserScript== // @name Display LaTeX // @namespace http://gold-saucer.afraid.org/mathml/greasemonkey/ // @description Display LaTeX in Web pages by transforming into MathML // @include http://gold-saucer.afraid.org/mathml/greasemonkey/ // @include http://gold-saucer.afraid.org/writings/Display_LaTeX_sandbox // @include http://planetmath.org/* // ==/UserScript==
The MathML XML namespace, needed for creating MathML elements
const mmlns = 'http://www.w3.org/1998/Math/MathML'; Create MathML result element Append MathML result element to another element’s content Prepend MathML result element before another child element Change attribute of result element Change attribute of result element Compatibility layer for the Epiphany browser Greasemonkey configuration
function result_element(tag, num_attrs)
{
var node = document.createElementNS(mmlns, tag);
var k = 2;
while(--num_attrs >= 0) {
if(arguments[k+1] != null) {
node.setAttribute(arguments[k], arguments[k+1]);
}
k += 2;
}
for(; k < arguments.length; k++) {
if(arguments[k] != null) {
if(typeof(arguments[k]) == 'string')
node.appendChild(document.createTextNode(arguments[k]));
else
node.appendChild(arguments[k]);
}
}
return node;
}
function result_element_append(parent, child)
{
if(parent != null && child != null) {
if(typeof(child) == 'string')
parent.appendChild(document.createTextNode(child));
else
parent.appendChild(child);
}
}
function result_element_prepend(parent, child, next)
{
if(next == null)
result_element_append(parent, child);
else if (parent != null && child != null)
parent.insertBefore(child, next);
}
function result_set_attr(elem, attr, value)
{
if(elem != null && attr != null) {
if(value != null)
elem.setAttribute(attr, value);
else
elem.removeAttribute(attr);
}
}
function result_append_attr(elem, attr, value)
{
if(elem != null && attr != null) {
var old_value = elem.getAttribute(elem, attr);
if(old_value == null)
elem.setAttribute(attr, value);
else
elem.setAttribute(attr, old_value + value);
}
}
if(!this.GM_getValue) {
this.GM_getValue = function(key, value) { return value; }
this.GM_log = function() {}
}
if(this.GM_registerMenuCommand) {
GM_registerMenuCommand("Enable native display of math images",
function() {
GM_setValue("patch-images", true);
do_patch_images = true;
patch_element(document.documentElement);
});
GM_registerMenuCommand("Disable native display of math images",
function() {
GM_setValue("patch-images", false);
});
}
mathvariant
Unfortunately, Mozilla does not support the mathvariant
attribute in MathML 2.0.
We fix this by directly substituting any Roman letters
by their variant characters.
This process is done in JavaScript and not in the S-Expressions mini-language, because the mini-language is not powerful enough for this, and the resultant MathML is not portable anyway. (The list of character codes seems to be specific to the Mozilla and not standard.)
const char_map = {
'script': [
'\uEF35', '\u212C', '\uEF36', '\uEF37', '\u2130', '\u2131',
'\uEF38', '\u210B', '\u2110', '\uEF39', '\uEF3A', '\u2112',
'\u2133', '\uEF3B', '\uEF3C', '\uEF3D', '\uEF3E', '\u211B',
'\uEF3F', '\uEF40', '\uEF41', '\uEF42', '\uEF43', '\uEF44',
'\uEF45', '\uEF46' ],
'fraktur': [
'\uEF5D', '\uEF5E', '\u212D', '\uEF5F', '\uEF60', '\uEF61',
'\uEF62', '\u210C', '\u2111', '\uEF63', '\uEF64', '\uEF65',
'\uEF66', '\uEF67', '\uEF68', '\uEF69', '\uEF6A', '\u211C',
'\uEF6B', '\uEF6C', '\uEF6D', '\uEF6E', '\uEF6F', '\uEF70',
'\uEF71', '\u2128' ],
'double-struck': [
'\uEF8C', '\uEF8D', '\u2102', '\uEF8E', '\uEF8F', '\uEF90',
'\uEF91', '\u210D', '\uEF92', '\uEF93', '\uEF94', '\uEF95',
'\uEF96', '\u2115', '\uEF97', '\u2119', '\u211A', '\u211D',
'\uEF98', '\uEF99', '\uEF9A', '\uEF9B', '\uEF9C', '\uEF9D',
'\uEF9E', '\u2124' ],
};
const uppercase_re = /[A-Z]/;
function fix_mathvariant(node, style)
{
if(node.nodeType == node.TEXT_NODE) {
if(style != null && style != '' && style in char_map) {
node.data = node.data.replace(uppercase_re,
function(s) { return char_map[style][s.charCodeAt(0)-65] });
}
} else if(node.nodeType == node.ELEMENT_NODE) {
var new_style = node.getAttribute('mathvariant');
if(new_style != null && new_style != '')
style = new_style;
for(var i=0; i < node.childNodes.length; i++)
fix_mathvariant(node.childNodes.item(i), style);
}
}
The user script calls patch_element on the root
element to patch all LATEX markup within the document.
Tokenizing TEX input
Supplemental processing of MathML
Procedure to patch an img
Procedure to patch a text node
Procedure to patch a DOM subtree
var do_patch_images = GM_getValue("patch-images", false);
var delayed_patch = GM_getValue("delayed-patch", false);
patch_element(document.documentElement);
The patch_element procedure takes the given node,
and replaces all LATEX text in $ signs occurring within that node.
It recursively calls itself to handle the children of the given node.
The procedure can be written more succinctly, without an explicit recursion, as:
var iter = document.createNodeIterator(
document.documentElement, NodeFilter.SHOW_TEXT, null, true);
var n;
while(n = iter.nextNode()) {
patch_text(n);
}
However, the DOM node iteration API is not supported by Mozilla, so we have to write the procedure the “long” way, as given below.
function patch_element(node)
{
if(node.nodeType == node.TEXT_NODE)
patch_text(node);
else if(node.nodeType == node.ELEMENT_NODE) {
If the current node is a text box control of an HTML form,
then do not replace LATEX markup there. Not only does it not display properly
(the text box only displays only plain text),
the user may well be editing some LATEX markup in the text box!
Also, do not attempt to replace LATEX markup in (JavaScript) scripts either;
the dollar sign used in JavaScript confuses the LATEX parser.
if(node.tagName == 'TEXTAREA' || node.tagName == 'textarea' ||
node.tagName == 'INPUT' || node.tagName == 'input' ||
node.tagName == 'SCRIPT' || node.tagName == 'script')
return;
if(do_patch_images && (node.tagName == 'IMG' || node.tagName == 'img')) {
if(!delayed_patch)
patch_img(node);
else
node.addEventListener("click", patch_img, false);
return;
}
var child = node.firstChild;
while(child) {
var next = child.nextSibling;
patch_element(child);
child = next;
}
}
}
It is easy to extend the math patching code so that it also patches the math images produced by LaTeX2HTML translator, and display them with the browser’s native MathML renderer instead.
Sometimes the browser’s MathML rendering turns out to look better than the raster images from LaTeX2HTML — especially if the user is reading Web pages at a font size smaller or larger than usual. Sometimes the MathML rendering is worse. So patching images is only done optionally.
But more importantly, by having the MathML translation code activate on LaTeX2HTML pages, we get more opportunities to test and debug our program.
function patch_img(node)
{
patch_img can also called from an event handler; in that case, the argument is an event object rather than a DOM node. We extract the DOM node stored in that event object.
if(node.currentTarget)
node = node.currentTarget;
var alt = node.getAttribute('alt');
if(alt == null ||
/^\\includegraphics|^\$\\displaystyle \\xymatrix/.test(alt))
return;
var latex_string = null;
Prefer use of the MATH comment, if present instead of ALT attribute
if(!latex_string && /^\$.+\$$/.test(alt)
&& !(/\.{3} \.{3}/.test(alt)))
{
latex_string = alt;
}
if(latex_string == null)
return;
Do the LATEX-to-MathML translation
if(mathml == null)
return;
node.parentNode.replaceChild(mathml, node);
}
LaTeX2HTML has a stupid misfeature where it will
truncate a long math formula in an ALT attribute.
Thus we cannot always recover the original math formula
from the ALT attribute alone.
But at least, LaTeX2HTML does — most of the time — put the full formula
in a <-- MATH --> HTML comment
occurring before the image. Use the contents
of that instead, if the comment is present.
Check for images of formulae produced by LaTeX2HTML
if((node.parentNode.tagName == 'DIV' && node.parentNode.getAttribute('CLASS') == 'mathdisplay') || (node.parentNode.tagName == 'SPAN' && node.parentNode.getAttribute('CLASS') == 'MATH')) { var parent = node.parentNode; var previous = parent.previousSibling; const non_whitespace = /[^\s]/;Skip over whitespace nodes in the DOM
if(previous && previous.nodeType == node.TEXT_NODE && !non_whitespace.test(previous.data)) previous = previous.previousSibling;Sometimes the comment appears in the previous paragraph
if(previous && previous.nodeType == node.ELEMENT_NODE && previous.tagName == 'P' && previous.lastChild) { previous = previous.lastChild; if(previous && previous.nodeType == node.TEXT_NODE && !non_whitespace.test(previous.data)) previous = previous.previousSibling; }Extract the full TEX formulae for the image from the comment, if found
if(previous && previous.nodeType == node.COMMENT_NODE) { latex_string = previous.data.replace(/^\s*MATH\s*/, '') .replace(/\s+$/, ''); } }
tokens = new Object();
tokens.list = tokenize_latex_math(latex_string);
tokens.list.push(null);
tokens.index = 0;
var mathml = null;
try {
var mrow = v_subexpr_chain_to_mathml(tokens, {});
Fix display of variant characters, as described in
fix_mathvariant(mrow, null);
mathml = document.createElementNS(mmlns, 'math');
mathml.setAttribute("latex", latex_string);
mathml.setAttribute("mathvariant", "normal");
mathml.appendChild(mrow);
mathml.addEventListener("click", post_process_mathml, false);
} catch(e) {
GM_log("Display LaTeX failed with error " + e + " on " + latex_string);
}
function patch_text(node0)
{
var text = node0.nodeValue;
var results = /\$[^$]+\$|\[tex\](.+?)\[\/tex\]/.exec(text);
if(results) {
var latex_string = (results[1] == null ? results[0] : '$'+results[1]+'$');
Do the LATEX-to-MathML translation
if(mathml == null)
return;
Split up node0 into two nodes,
at the point where the first $ sign occurs
var node2 = node0.splitText(results.index);
Now delete the original LATEX markup
node2.deleteData(0, results[0].length);
Make and insert a math element in place of the deleted LATEX markup
node2.parentNode.insertBefore(mathml, node2);
There may be more than one $...$ LATEX block in a single text node,
and we processed only one of them, so process the others by
a recursive call.
patch_text(node2);
}
}
The LATEX-to-MathML translator can post the generated MathML content to a Web service. This feature is intended for advanced users and developers — for example, the author of this program uses it to activate a separate script on his computer that speaks the MathML to him (voice synthesis), and to collect the MathML into a database for the program test suite.
To use this feature, the user must initialize it manually by setting the Mozilla preference “greasemonkey.scriptvals.http://gold-saucer.afraid.org/mathml/greasemonkey//Display LaTeX.click-post-url” to the desired Web service URL. Then, to post any displayed MathML content in the Web browser, click on the MathML.
It would be more ideal if this functionality was implemented by calling an external program directly instead, but unfortunately, Greasemonkey user scripts do not have access to the XPCOM interfaces required for this task.
Warning: This feature, if used carelessly, may lead to security or privacy breaches. (Hence, it is not activated by default.)
function post_process_mathml(event)
{
var url = GM_getValue('click-post-url', null);
if(url == null)
return;
var ser = new XMLSerializer();
var xhr = GM_xmlhttpRequest({
method: 'POST',
url: url,
headers: { 'Content-Type': 'text/xml; charset=utf-8',
'Content-Location': document.location },
data: ser.serializeToString(event.currentTarget),
onerror: function(details) {
alert("There was an error processing the request. " +
"HTTP status code " + details.status + ' ' + details.statusText);
},
onload: function(details) {
window.status = "Successfully posted MathML. Status: "
+ details.status + ' ' + details.statusText;
}});
window.status = "Posting MathML to " + url + "...";
}
Trees Serializing XML Tokenizing TEX input Program
String.prototype.repeat = function(n) {
return new Array(n+1).join(this);
}
function xml_escape(s)
{
s = s.replace('&', '&').
replace('<', '<').
replace('>', '>');
return s.replace(/[\u0080-\uffff]/, function(x) { return '&#' + x.charCodeAt(0) + ';' });
}
function xml_attr_escape(s)
{
s = s.replace('&', '&').
replace('"', '"').
replace('<', '<').
replace('>', '>');
return s.replace(/[\u0080-\uffff]/, function(x) { return '&#' + x.charCodeAt(0) + ';' });
}
function serialize_mathml(tree, indent_level)
{
var indent_string = ' '.repeat(indent_level);
if(tree instanceof PlainXMLNode) {
var start_tag = '<' + tree.tag;
if(tree.attrs != null) {
for(var a in tree.attrs)
start_tag += ' ' + a + '="' + xml_attr_escape(tree.attrs[a]) + '"';
}
var empty_tag = start_tag + ' />';
start_tag += '>';
var end_tag = '</' + tree.tag + '>';
if(tree.content.length == 0) {
print(indent_string + empty_tag);
} else if(tree.content.length == 1 &&
typeof(tree.content[0]) == 'string') {
print(indent_string + start_tag + xml_escape(tree.content[0]) + end_tag);
} else {
print(indent_string + start_tag);
for(var i=0; i < tree.content.length; ++i)
serialize_mathml(tree.content[i], indent_level+1);
print(indent_string + end_tag);
}
} else if(typeof(tree) == 'string') {
print(indent_string + xml_escape(tree));
}
}
for(var j=0; j < arguments.length; ++j) {
var input = arguments[j];
var tokens = new Object();
tokens.list = tokenize_latex_math(input);
tokens.list.push(null);
tokens.index = 0;
print('<!--');
for(var i=0; i < tokens.list.length; ++i)
print('token ' + i + ': ' + tokens.list[i]);
print('-->');
var mathml = v_subexpr_chain_to_mathml(tokens, {});
print('<math xmlns="http://www.w3.org/1998/Math/MathML">');
serialize_mathml(mathml, 1);
print('</math>');
}
PlainXMLNode result_element result_element_append result_element_prepend result_set_attr result_append_attr
function PlainXMLNode(tag)
{
this.tag = tag;
this.content = [];
this.attrs = {};
}
function result_element(tag, num_attrs)
{
var node = new PlainXMLNode(tag);
var k = 2;
while(--num_attrs >= 0) {
if(arguments[k+1] != null) {
node.attrs[arguments[k]] = arguments[k+1];
}
k += 2;
}
for(; k < arguments.length; k++) {
if(arguments[k] != null) {
node.content.push(arguments[k]);
}
}
return node;
}
function result_element_append(parent, child)
{
if(parent != null && child != null) {
parent.content.push(child);
}
}
function result_element_prepend(parent, child, next)
{
if(next == null)
result_element_append(parent, child);
else if(parent != null && child != null) {
for(var i = 0; i < parent.content.length; i++) {
if(parent.content[i] == next) {
parent.content.splice(i, 0, child);
return;
}
}
}
}
function result_set_attr(elem, attr, value)
{
if(elem != null && attr != null) {
if(value != null)
elem.attrs[attr] = value;
else
delete elem.attrs[attr];
}
}
function result_append_attr(elem, attr, value)
{
if(elem != null && attr != null) {
if(elem.attrs[attr] == null)
elem.attrs[attr] = value;
else
elem.attrs[attr] += value;
}
}
# This script was automatically generated from a literate source.
# Do not edit this file; look at the literate source instead!
#
import LaTeX2MathMLModule
import sys
import re
class TokenInput:
def __init__(self, input):
self.tokens = []
self.tokens_index = 0
self.tokenize_latex_math(input)
self.tokens.append(None)
Tokenize LATEX input
for input in sys.argv[1:]:
t = TokenInput(input)
sys.stdout.write("<math xmlns='%s'>\n" % LaTeX2MathMLModule.mmlns)
sys.stdout.write(LaTeX2MathMLModule.v_subexpr_chain_to_mathml(t, {}).toxml("utf-8"))
sys.stdout.write("\n</math>\n")
import xml.dom.minidomThe MathML XML namespace, needed for creating MathML elements
mmlns = 'http://www.w3.org/1998/Math/MathML' document = xml.dom.minidom.getDOMImplementation().createDocument(None,None,None) Create MathML result element Append MathML result element to another element’s content Prepend MathML result element before another child element Change attribute of result element Change attribute of result element
def result_element(tag, num_attrs, *args):
node = document.createElementNS(mmlns, tag)
for i in range(0, num_attrs):
if args[2*i+1] is not None:
node.setAttribute(args[2*i], args[2*i+1])
for i in range(num_attrs*2, len(args)):
if args[i] is not None:
if isinstance(args[i], unicode):
node.appendChild(document.createTextNode(args[i]))
else:
node.appendChild(args[i])
return node
def result_element_append(parent, child):
if (parent is not None) and (child is not None):
if isinstance(child, unicode):
parent.appendChild(document.createTextNode(child))
else:
parent.appendChild(child)
def result_element_prepend(parent, child, next):
if next is None:
result_element_append(parent, child)
elif (parent is not None) and (child is not None):
parent.insertBefore(child, next)
def result_set_attr(elem, attr, value):
if (elem is not None) and (attr is not None):
if value is not None:
elem.setAttribute(attr, value)
else:
elem.removeAttribute(attr)
def result_append_attr(elem, attr, value):
if (elem is not None) and (attr is not None):
old_value = elem.getAttribute(elem, attr)
if old_value is None:
elem.setAttribute(attr, value)
else:
elem.setAttribute(attr, old_value + value)