Copyright (C) 2006 Steve Cheng <stevecheng@users.sourceforge.net>
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALNGS IN THE SOFTWARE.
A novel feature of this program is that it uses its own minilanguage to define the transformation from T_{E}X math to MathML. The skeptical reader might think that this is an attempt at overengineering. Actually, the author would rather not employ a minilanguage at all, but write the entire program in JavaScript  but the use of the minilanguage is essentially forced upon us, given our requirements. We need the program be usable on the clientside Web browser, which requires that the program be in JavaScript; on the other hand, we would also like the program to be available on the server side as well.
While JavaScript can run on the serverside, JavaScript implementations outside Web browsers are terrible — they are not widely available, and they do not hold a cradle to more standard scripting languages such as Perl or Python.
The other alternative is to write two versions of the transform rules, one for JavaScript, and one for Python, say. This is not pleasant, to say the least. And we should also consider that some people might want a PHP implementation, for example, if that is what they are using to generate Web pages on the server side.
Therefore, we should write a single version of the program in one language, then employ automatic translation of that program into the other computer languages.
A solution that avoids the minilanguage is to translate JavaScript directly to the other languages. But this solution takes too much effort to implement.
It is better instead, to start with a minimalist minilanguage, designed specifically to tackle the problem of T_{E}X to MathML translation, and translate that to the actual scripting languages. We require that the minilanguage be easy to parse — and we achieve this goal by using the Sexpression syntax from the Lisp languages (i.e. subexpressions are wrapped in parentheses, and operations are written in prefix notation).
Although the use of Sexpressions makes the minilanguage superficially look like Scheme, it is fundamentally very different from Scheme. Our minilanguage is a procedural or imperative language, rather than a functional language like Scheme. Its structure closely mirrors that of JavaScript (the first implementations of this program were in JavaScript). A procedural design is used, to make translation from the minilanguage to JavaScript and Python simple.
On the other hand, if the minilanguage is functional, then it would take a nontrivial amount of effort to translate it to imperative languages (one of the obstacles being to optimize tail recursion). Of course, making the minilanguage imperative disfavors the functional languages. But this is less of a problem than disfavoring the imperative languages. Firstly, (pure) functional languages, such as Scheme and Haskell, are not as widely used as scripting languages, compared to the procedural languages such as Perl and Python. (This is no accident, since scripting languages are often used for onetime, smallscale hacks, where the function formalism becomes more of a hindrance.)
And secondly, it is actually slightly more complicated to write the T_{E}XtoMathML transformation rules in a functional style than in an imperative style (yes, the author has tried it). This is because T_{E}X syntax is somewhat irregular, being intended to be read and written by humans, and most efficiencies of functional programming over imperative programming comes from exploiting the structure of data. Also, it should be noted that a certain naïve way of writing the transformation rules functionally will lead to an inefficient program with an execution time that is quadratic in the number of input tokens. (The author unwittingly fell into this trap on his original implementation.)
The minilanguage in our program can be regarded as analogous to the specification formats that drive other data converters, which are usually declarative. However, a declarative specification format falls down in the present application, for the same reasons that functional programming does not well either — T_{E}X syntax is irregular enough that designing a declarative format that captures all its subtleties is hard.
The minilanguage has to be minimal, for the obvious reason that small languages take less effort to translate. The less effort required, the better.
Although minimal often implies elegance in design, we are not writing this program in the minilanguage for the sake of elegance. We are interested in:
For us, the minilanguage itself is mostly a distraction that has to be tolerated. (However, we cannot deny that it is a nice bonus that the transformation rules can be specified more concisely in the minilanguage than in direct JavaScript.)
Thus, we will not “design” the minilanguage as such, but simply tack on features only as they are needed to describe the algorithms. For instance, the minilanguage currently has no support for arrays; if we want to write a transformation rule that uses arrays, then we simply add that to the language. And the array operations would be implemented by translating them to array operations in the underlying language (JavaScript, Python, etc.).
This strategy of adhoc, unprincipled extension can sometimes be disasterous, but there is little cause for concern in our case. Since the minilanguage syntax is based on Sexpressions, it can be easily extended without worrying about the parsing details. Also, the problem of T_{E}XtoMathML conversion is small enough, and fairly restricted, so that we should not need to extend it very much: the input is T_{E}X math markup; the output is MathML markup. Or, as the saying goes: we do not need to expand our program to the point that it can read email.
This strategy saves the author from wasting effort to design yet another extension language. We do need some implementation experience to know what features to include, and what features can be safely omitted — this was provided to the author by his first JavaScriptonly implementation of this program.
It hardly needs to be said, but any glue code or interface code (e.g. interacting with the user, or the Web browser) need not, should not, and will not be written in the minilanguage.
We will also point out that, because the minilanguage is so minimal, sometimes the procedures written in it are forced to be written in a lowlevel way — looping over conditions, reading and advancing token pointers, etc. — somewhat like programming in Forth or assembly language. This mostly cannot be helped, because T_{E}X syntax is so varied that it is difficult to define higherlevel abstractions to parse it.
As our minilanguage is all about “antidesign” — the author refuses to even name the minilanguage — naturally it is defined only by its implementations — that is, its translations to other computer languages. But to help the reader understand the use of the minilanguage in this document, we briefly describe its syntax here:
Expressions are either variable (or procedure/function) references,
or one of the following forms enclosed in parentheses.
Expressions never have side effects except for the (call …)
form (for the procedure being called may generate side effects).
(call f x1 x2…) 
call a procedure or function f with arguments x1, x2…, and evaluates to the return or result from f 
(not x) 
logical negation of expression 
(or x1 x2…) 
logical or (disjunction) of two or more expressions 
(and x1 x2…) 
logical and (conjunction) of two or more expressions 
(null? x) 
test for x being null 
(notnull? x) 
test for x not being null 
(resultelement n x1
x2…) 
create MathML element (subtree) whose name is n, and whose children are x1, x2… in order. 
(if t x y) 
test if t is true; if it is, evaluate expression x, otherwise evaluate expression y 
(streq? x y) 
test if two strings are equal 
(in? x A) 
test if x is a key in the associative array A 
(get x A) 
return value which is paired with key x in the associative array A 
(lambda x1 x2… xn e) 
define an inline, anonymous function that returns the expression e, with the given arguments x1, x2… 
(readtoken) 
get token at current pointer 
(nil) 
the null object 
The minilanguage does not have lists or arrays. For the most common usage of lists in this application, namely, to build up content for MathML elements, direct MathML tree manipulation may be used instead. This turns out to be more efficient, and not unlike another popular transformation language, XSLT.
This section gives the definitions for the “simple” T_{E}X and L_{A}T_{E}X commands that produce a single mathematical symbol. To translate to MathML, each symbol must be mapped to its Unicode representation.
Punctuation and space Delimiters Operator symbols Relation symbols Named identifiers Word operators Greek letters
The majority of the following T_{E}X definitions were derived by taking the lists of characters, given as T_{E}X escapes, in the Short Math Guide for L_{A}T_{E}X by the American Mathematical Society, and crossmapping them with the STIX character table, which contains T_{E}X escapes for various Unicode characters used in mathematical expressions.
The lists of characters from the Short Guide to L_{A}T_{E}X were entered by hand, as the author was not able to find a more official machinereadable list. Each list was entered as simple text files in the tables/ directory, with one T_{E}X escape on each line. For convenience, there is also a copy of the STIX character table in the tables/ directory.
A Python script and Unix shell script (using standard Unix commands like sort, cut, join and awk) together take the AMSL_{A}T_{E}X lists of characters and the STIX table as input, and output the mappings of the T_{E}X escapes to their Unicode characters. This output, in the format of HTML tables, can then be pasted in the XHTML source for this document.
The STIX character table was not used alone as the source, because it does not contain all the T_{E}X escapes, and it is not clear whether it categorizes the T_{E}X characters in the way that we want them to be used in the user script. It is better to doublecheck the character mappings, and add any missing characters, manually.
Any entries added directly to the JavaScript source will be separated out in their own sections, so that when we need to automatically generate the majority of the entries again, we will not clobber over the manual changes.


Punctuation and space characters are to be displayed with the mtext element.
These were all added manually.
(var $punctandspace (table See table below ))


(var $leftdelimiters (table Left delimiters)) (var $rightdelimiters (table Right delimiters ))




Operator symbols are to be displayed with the mo element.
(var $operatorsymbols (table Operator symbols More operator symbols Extensible vertical arrows Big operators Big operators without limit style Miscellaneous simple symbols Other alphabetic symbols ))
This listing comes from tables/texops.txt, which in turn comes from the list under “3.7 Binary Operator Symbols”, p. 6, in the AMS L_{A}T_{E}X Guide.




This listing comes from tables/texbigops.txt, which in turn comes from the list under “3.11 Cumulative (variablesize) operators”, p. 7, in the AMS L_{A}T_{E}X Guide.


The integral symbol is special from the others, in that even in display style, subscripts and superscripts are not in the limit style (for the obvious reason that it would not look good).


This listing comes from tables/texextarrows.txt, which in turn comes from the list under “3.14 Extensible vertical arrows”, p. 8, in the AMS L_{A}T_{E}X Guide.


This listing comes from tables/texmiscsym.txt, which in turn comes from the list under “3.6 Miscellaneous simple symbols”, p. 5, in the AMS L_{A}T_{E}X Guide.




(var $relationsymbols (table ("=" "=") ("<" "<") (">" ">") Comparison symbols Arrows Miscellaneous ))
This listing comes from tables/texrelsym.txt, which in turn comes from the list under “3.8 Relation Symbols”, p. 6, in the AMS L_{A}T_{E}X Guide.
Equality, inequality, etc.


This listing comes from tables/texrelmisc.txt, which in turn comes from the list under “3.10 Relation Symbols: Miscellaneous”, p. 7, in the AMS L_{A}T_{E}X Guide.


This listing comes from tables/texarrows.txt, which in turn comes from the list under “3.9 Relation Symbols: Arrows”, p. 7, in the AMS L_{A}T_{E}X Guide.


Named identifiers include all Greek letters, named functions and operators, and other alphabetic identifiers that are not single Roman letters.
These are to be displayed using the mi element.
(var $namedidentifiers (table Word operators Big word operators Greek letters Roman letters Ellipsis characters ))
(var $wordoperators (table Word operators )) (var $bigwordoperators (table Big word operators))
This listing comes from “3.17 Named operators”, p. 8, in the AMS L_{A}T_{E}X Guide.
These T_{E}X commands represent common operators and functions written with several letters (e.g. the sine function). Obviously, the STIX table generally contains only singlecharacter mappings and not these commands, so we must type enter all of them in manually. The mapping is trivial.




(var $greekletters (table See table below))


Although the Roman letters obviously map to themselves, we still list them explicitly, so that we do not have to add a special case (check for Roman letters) to the transformation rules.


By a complex command, we mean a TeX command that is not simple; that is, the TeX command does more than just output a character or string of characters.
Complex commands include any sort of TeX command that takes arguments.
Fractions
Binomial coefficients
Square roots and radicals
Parenthesized mod
Usernamed functions and operators
Setting display style
Setting display style: displaymath
Changing math font styles
Changing math fonts (oldstyle commands)
Changing math font sizes
Accents on characters
Matrices
Array environment
Matrices: mtable
subroutine
Under and overdecorations
Matching sizes of delimiters
Matching sizes of delimiters: get delimiter subroutine
L_{A}T_{E}X blocks
L_{A}T_{E}X blocks: end
Combining operators
Character escapes
Embedded text
List of all commands
List of all L_{A}T_{E}X blocks
(var $texcommands (table ("\\frac" fractiontomathml ) ("\\dfrac" fractiontomathml ) ("\\tfrac" fractiontomathml ) ("\\binom" binomtomathml ) ("\\sqrt" sqrttomathml ) ("\\operatorname" operatornametomathml ) ("\\displaystyle" displaystyletomathml ) Parenthesized mod Changing math font styles Changing math fonts (oldstyle commands) Changing font sizes Accents on characters Under and over decorations Combining operators Matching sizes of delimiters ("\\char" charescapetomathml) ("\\!" (lambda () (nil))) Embedded text commands ("\\begin" latexblocktomathml) ))
A LaTeX environment is a special form of commands,
where the logical argument is enclosed
in between
two tags \begin{name}
and \end{name}
,
instead of the braces {
and }
.
A LaTeX environment is also called a LaTeX block (a more intuitive term).
The following is a list of all the LaTeX blocks we can parse.
(var $texenvironments (table ("smallmatrix" (lambda () (call matrixtomathml "(" ")" ) )) ("pmatrix" (lambda () (call matrixtomathml "(" ")" ) )) ("bmatrix" (lambda () (call matrixtomathml "[" "]" ) )) ("Bmatrix" (lambda () (call matrixtomathml "{" "}" ) )) ("vmatrix" (lambda () (call matrixtomathml "\u007c" "\u007c") )) ("Vmatrix" (lambda () (call matrixtomathml "\u2016" "\u2016") )) ("cases" (lambda () (call matrixtomathml "{" (nil) ) )) ("array" arraytomathml ) ("displaymath" displaymathtomathml) ))
\begin
dispatch
The real command name in \begin{name}
is name,
so we need the following function to extract
this name, and dispatch to the procedure
that handles that command specifically.
(procedure (latexblocktomathml) (set cmd (readtoken)) (cond ((in? cmd $texenvironments) (advancetoken) (return (call (get cmd $texenvironments)))) (else (throw "unknown command"))))
\end
dispatch
This procedure should be called when
the \end{name}
tag of
a LaTeX block is encountered.
It performs errorchecking, and prepares
for the parse after the \end{name}
tag.
(procedure (finishlatexblock) (cond ((null? (readtoken)) (throw "unexpected eof"))) (advancetoken) (advancetoken))
(procedure (fractiontomathml) (var numerator (call piecetomathml)) (var denominator (call piecetomathml)) (return (resultelement "mfrac" (numerator denominator))))
The common “choose” notation for the binomial coefficients are essentially fractions with added parentheses but without the fraction bar, and is encoded that way in presentational MathML.
(procedure (binomtomathml) (var top (call piecetomathml)) (var bottom (call piecetomathml)) (return (resultelement "mrow" ((resultelement "mo" ( "(" )) (resultelement "mfrac" (("linethickness" "0")) ( top bottom )) (resultelement "mo" ( ")" ))))))
A radical in TeX is written \sqrt[i]{x}
,
where i is the index of the radical,
and x is the expression that appears under the radical.
The index part is optional; if it is omitted,
then the radical is a square root and is translated to the MathML
sqrt
element.
Otherwise, the MathML mroot
element
is used for the radical with index.
(procedure (sqrttomathml) (var index (call optionalargtomathml)) (var object (call piecetomathml)) (cond ((notnull? index) (return (resultelement "mroot" (object index)))) (else (return (resultelement "msqrt" (object))))))
("\\pod" (lambda () (call parenthesizedoperator (nil)))) ("\\pmod" (lambda () (call parenthesizedoperator "mod")))
(procedure (parenthesizedoperator word) (var object (call piecetomathml)) (cond ((notnull? word) (return (resultelement "mrow" ((resultelement "mo" ("(")) (resultelement "mo" (word)) object (resultelement "mo" (")")))))) (else (return (resultelement "mrow" ((resultelement "mo" ("(")) object (resultelement "mo" (")"))))))))
Note that, by a special case in the tokenization,
\operatorname
always receives its argument
as one token.
(procedure (operatornametomathml) (var result (resultelement "mo" ((readtoken)))) (advancetoken) (return result))
When the \displaystyle
command is encountered,
everything following it should be typeset as a math “display”.
In MathML, presentational aspects such as the “display mode”
and script size are controlled by attributes on the container element.
However, in Mozilla, some of these attributes do not take effect
unless they are applied to the MathML mstyle
.
This is a bug, but the workaround is simple: wrap
the content around a mstyle
element.
The only disadvantage is that the output MathML becomes slightly
more bloated.
(procedure (displaystyletomathml) (var result (call subexprchaintomathml $hardstoptokens)) (return (resultelement "mstyle" (("displaystyle" "true") ("scriptlevel" "0")) (result))))
For reference, here is the procedure
to generate a display without the mstyle
workaround:
(procedure (displaystyletomathml) (var result (call subexprchaintomathml $hardstoptokens)) (setattr result "displaystyle" "true") (setattr result "scriptlevel" "0") (return result))
Strictly speaking, the displaymath
environment
is not an environment to be used inside T_{E}X math mode.
Rather, it activates T_{E}X math mode when
used in text mode, and typesets the formula inside it
as a “display”.
However, this environment appears often in the T_{E}X markup
generated by the LaTeX2HTML translator,
and it is trivially supported.
(procedure (displaymathtomathml) (var result (call subexprchaintomathml $hardstoptokens)) (call finishlatexblock) (return (resultelement "mstyle" (("displaystyle" "true") ("scriptlevel" "0")) (result))))
As with the displaystyletomathml
procedure,
the version that outputs standard MathML without the mstyle
workaround is as follows:
(procedure (displaymathtomathml) (var result (call subexprchaintomathml $hardstoptokens)) (setattr result "displaystyle" "true") (setattr result "scriptlevel" "0") (call finishlatexblock) (return result))
("\\boldsymbol" (lambda () (call fonttomathml "bold")) ) ("\\bold" (lambda () (call fonttomathml "bold")) ) ("\\Bbb" (lambda () (call fonttomathml "doublestruck")) ) ("\\mathbb" (lambda () (call fonttomathml "doublestruck")) ) ("\\mathbbmss" (lambda () (call fonttomathml "doublestruck")) ) ("\\mathbf" (lambda () (call fonttomathml "bold")) ) ("\\mathop" (lambda () (call fonttomathml "normal")) ) ("\\mathrm" (lambda () (call fonttomathml "normal")) ) ("\\mathfrak" (lambda () (call fonttomathml "fraktur")) ) ("\\mathit" (lambda () (call fonttomathml "italic")) ) ("\\mathscr" (lambda () (call fonttomathml "script")) ) ("\\mathcal" (lambda () (call fonttomathml "script")) ) ("\\mathsf" (lambda () (call fonttomathml "sansserif")) ) ("\\mathtt" (lambda () (call fonttomathml "monospace")) ) ("\\EuScript" (lambda () (call fonttomathml "script")) )
(procedure (fonttomathml fontname) (cond ((strneq? (readtoken) "{") (var result (resultelement "mi" (("mathvariant" fontname)) ((readtoken)))) (cond ((streq? fontname "normal") (setattr result "fontstyle" "normal"))) (advancetoken) (return result)) (else (var result (call piecetomathml)) (setattr result "mathvariant" fontname) (cond ((streq? fontname "normal") (setattr result "fontstyle" "normal"))) (return result))))
("\\bf" (lambda () (call oldfonttomathml "bold")) ) ("\\rm" (lambda () (call oldfonttomathml "normal")) )
(procedure (oldfonttomathml fontname) (return (resultelement "mstyle" (("mathvariant" fontname) ("fontstyle" (if (streq? fontname "normal") "normal" (nil)))) ((call subexprchaintomathml $hardstoptokens)))))
("\\big" (lambda () (call sizetomathml "2" "2"))) ("\\Big" (lambda () (call sizetomathml "3" "3"))) ("\\bigg" (lambda () (call sizetomathml "4" "4"))) ("\\Bigg" (lambda () (call sizetomathml "5" "5")))
(procedure (sizetomathml minsize maxsize) (var result (call piecetomathml)) (setattr result "minsize" minsize) (setattr result "maxsize" maxsize) (return result))
In MathML, the inside of the matrix or array
is encoded as an mtable
element.
The borders, parentheses, or brackets
are encoded separately.
The procedure matrixtomtable
converts the inside of a matrix
into a mtable
.
It works
in this straightforward manner:
The procedure starts the mtable
with one row
(mtr
)
containing one cell (mtd
),
which become the current row and current cell.
As we parse TeX content, we add that content
(translated to MathML) to the current cell.
If &
is encountered, indicating that we
should break the cell, then we create a new cell
for the current row, and the new cell becomes the current cell.
If \\
is encountered, indicating a line break,
then we insert a new row in the table, with a new cell,
and the new row and the new cell become the current ones.
The procedure should be passed an empty mtable
result element. It does not create one itself,
because the caller may want to add some attributes to the
mtable
before processing the table contents.
(procedure (matrixtomtable mtable) (var mtr (resultelement "mtr")) (var mtd (resultelement "mtd")) (var token (readtoken)) (append mtable mtr) (append mtr mtd) (while (and (notnull? token) (strneq? token "\\end")) (cond ((streq? token "\\\\") (set mtr (resultelement "mtr")) (set mtd (resultelement "mtd")) (append mtable mtr) (append mtr mtd) (advancetoken)) ((streq? token "&") (set mtd (resultelement "mtd")) (append mtr mtd) (advancetoken)) (else (append mtd (call subexprchaintomathml $hardstoptokens)))) (set token (readtoken))) (call finishlatexblock) (return mtable))
The following procedure, matrixtomathml
,
handles parsing of T_{E}X matrix blocks.
It adds the borders to the mtable
obtained from the procedure matrixtomtable
, with
a mrow
wrapper around the whole thing.
The border delimiters for matrixtomathml
are specified by parameters.
See L_{A}T_{E}X block commands list
for the actual parameters for each T_{E}X matrix or array command.
(procedure (matrixtomathml opendelim closedelim) (var mtable (call matrixtomtable (resultelement "mtable"))) (cond ((or (notnull? opendelim) (notnull? closedelim)) (var mrow (resultelement "mrow")) (cond ((notnull? opendelim) (append mrow (resultelement "mo" (opendelim))))) (append mrow mtable) (cond ((notnull? closedelim) (append mrow (resultelement "mo" (closedelim))))) (return mrow)) (else (return mtable))))
The following procedure, arraytomathml
,
converts T_{E}X array environments.
As in T_{E}X, no delimiters are put around the array.
(procedure (arraytomathml) (var mtable (resultelement "mtable"))Process the specifications for the column alignment in cells
(cond ((streq? (readtoken) "{") (advancetoken) (while (and (notnull? (readtoken)) (strneq? (readtoken) "}")) (cond ((streq? (readtoken) "c") (appendattr mtable "columnalign" "center ")) ((streq? (readtoken) "l") (appendattr mtable "columnalign" "left ")) ((streq? (readtoken) "r") (appendattr mtable "columnalign" "right "))) (advancetoken)) (cond ((notnull? (readtoken)) (advancetoken)))))Process the table itself
(return (call matrixtomtable mtable)))
("\\acute" (lambda () (call accenttomathml "\u0301")) ) ("\\grave" (lambda () (call accenttomathml "\u0300")) ) ("\\tilde" (lambda () (call accenttomathml "\u0303")) ) ("\\bar" (lambda () (call accenttomathml "\u0304")) ) ("\\breve" (lambda () (call accenttomathml "\u0306")) ) ("\\check" (lambda () (call accenttomathml "\u030c")) ) ("\\hat" (lambda () (call accenttomathml "\u0302")) ) ("\\vec" (lambda () (call accenttomathml "\u20d7")) ) ("\\dot" (lambda () (call accenttomathml "\u0307")) ) ("\\ddot" (lambda () (call accenttomathml "\u0308")) ) ("\\dddot" (lambda () (call accenttomathml "\u20db")) )
Accents are displayed in MathML by placing
the corresponding Unicode character in an mover
element,
over the base letter, and with the accent
attribute set to true.
(procedure (accenttomathml char) (return (resultelement "mover" (("accent" "true")) ((call piecetomathml) (resultelement "mo" (char))))))
("\\underbrace" (lambda () (call undertomathml "\ufe38")) ) ("\\overbrace" (lambda () (call overtomathml "\ufe37")) ) ("\\underline" (lambda () (call undertomathml "\u0332")) ) ("\\overline" (lambda () (call overtomathml "\u00af")) ) ("\\widetilde" (lambda () (call overtomathml "\u0303")) ) ("\\widehat" (lambda () (call overtomathml "\u0302")) )
(procedure (overtomathml char) (return (resultelement "mover" ((call piecetomathml) (resultelement "mo" (char)))))) (procedure (undertomathml char) (return (resultelement "munder" ((call piecetomathml) (resultelement "mo" (char))))))
Combining operators refer to T_{E}X commands
such as \not
which draws a slash through the next
operator. In general, it may refer to any operator
that modifies the next one by superimposing some other
mark over the glyph for the next operator.
The term “combining” is from the Unicode standards; a “combining character” (or “combining diacritic”) modifies the base character in some way. The only difference is that in Unicode, the combining character or diacritic comes after the base character rather than before it.
("\\not" (lambda () (call combiningtomathml 0338 ̸)))
(procedure (combiningtomathml char) (var base (readtoken)) (advancetoken) (return (resultelement "mo" (base char))))
("\\left" (lambda () (call delimitertomathml "\\right" "1" (nil))) ) ("\\bigl" (lambda () (call delimitertomathml "\\bigr" "2" "2" )) ) ("\\Bigl" (lambda () (call delimitertomathml "\\Bigr" "3" "3" )) ) ("\\biggl" (lambda () (call delimitertomathml "\\biggr" "4" "4" )) ) ("\\Biggl" (lambda () (call delimitertomathml "\\Biggr" "5" "5" )) )
The commands for matching sizes of delimiters
are unusual in that they do not take arguments
enclosed between {
and }
,
but the character that occurs right after the command.
Also the command for the left delimiter
must be matched by the command for the right delimiter.
So we must know to stop parsing when we encounter
the command for the right delimiter.
(procedure (delimitertomathml endcommand minsize maxsize) (var mrow (resultelement "mrow")) (append mrow (resultelement "mo" (("minsize" minsize) ("maxsize" maxsize)) ((call readdelimiter)))) (append mrow (call subexprchaintomathml $hardstoptokens)) (cond ((strneq? (readtoken) endcommand) (return mrow))) (advancetoken) (append mrow (resultelement "mo" (("minsize" minsize) ("maxsize" maxsize)) ((call readdelimiter)))) (return mrow))
This subroutine reads the next token, checks that it is a valid delimiter, and handles the special cases:
\left.  produces a blank instead of a dot 
\right.  
\left<  produces the leftpointing angle bracket instead of a lessthan sign 
\right<  
\right>  produces the rightpointing bracket instead of a greaterthan sign 
\left> 
(procedure (readdelimiter) (var token (readtoken)) (cond ((null? token) (throw "unexpected eof")) ((streq? token ".") (advancetoken) (return "")) ((streq? token "<") (advancetoken) (return "\u2329")) ((streq? token ">") (advancetoken) (return "\u232a")) ((in? token $punctandspace) (advancetoken) (return (get token $punctandspace))) ((in? token $leftdelimiters) (advancetoken) (return (get token $leftdelimiters))) ((in? token $rightdelimiters) (advancetoken) (return (get token $rightdelimiters))) ((in? token $operatorsymbols) (advancetoken) (return (get token $operatorsymbols))) (else (throw "invalid delimiter"))))
This is weird stuff that comes out from LaTeX2HTML. Normally the TeX typist should not enter character codes for any character.
(var $charescapecodes (table ("93" "#"))) (procedure (charescapetomathml) (var result (nil)) (cond ((in? (readtoken) $charescapecodes) (set result (resultelement "mtext" ((get (readtoken) $charescapecodes))))) (else (set result (resultelement "merror" ("\\char" (readtoken)))))) (advancetoken) (return result))
("\\text" texttomathml ) ("\\textnormal" texttomathml ) ("\\textrm" texttomathml ) ("\\textsl" texttomathml ) ("\\textit" texttomathml ) ("\\texttt" texttomathml ) ("\\textbf" texttomathml ) ("\\hbox" texttomathml ) ("\\mbox" texttomathml )
The embedded text is not split into words or tokens;
it is split only when it is interrupted by an inline math
formula. Each segment of text goes in a mtext
element. Embedded inline math formulas in the embedded text
are signaled, as in T_{E}X, by $
tokens.
When $
is encountered,
normal math parsing begins again, until interrupted by another
$
(at the same nesting level).
The parsing knows when to stop because $
is considered a “hard stop token”.
(procedure (texttomathml) (cond ((strneq? (readtoken) "{") (var result (resultelement "mtext" ((readtoken)))) (advancetoken) (return result))) (advancetoken) (var result (nil)) (var mrow (nil)) (var node (nil)) (while (notnull? (readtoken)) (cond ((streq? (readtoken) "}") (advancetoken) (return result)) ((streq? (readtoken) "$") (advancetoken) (set node (call subexprchaintomathml $hardstoptokens)) (advancetoken)) (else (set node (resultelement "mtext" ((readtoken)))) (advancetoken))) Collect results) (return result))
At the most fundamental level, T_{E}X markup is parsed by reading the markup
one character at a time, and taking actions based on what that character is.
This is the approach taken by the T_{E}X
typesetting system itself. So, for example, the
string typeset
means, in T_{E}X, to construct
the glyphs for each character (
t
y
p
e
s
e
t
), and then paste them together to form (part of) a paragraph.
We do not take this approach in our T_{E}X parser,
because, in most scripting languages, and in XML, processing
text by looping through each individual character is inefficient.
Also, some character sequences have to be processed together
anyway — these are called “tokens”.
Examples of tokens are: \catcode
and
3.14
(as opposed to interpreting them as
\
c
a
t
c
o
d
e
or
3
.
1
4
)
.
A wellunderstood and widely implemented specification for matching character sequences is that of regular expressions. Our program considers the T_{E}X markup that is input as a stream of characters; the head of the character stream is matched against a regular expression. The matching characters are grouped into one token (or sometimes a few tokens), and then those characters are consumed (removed from the head of the input character stream). Much of the L_{A}T_{E}XtoMathML translator only deals with the tokens, not the character stream.
The following regular expression greedily matches sequences of characters to be considered as tokens in “math mode”. The matching sequences are grouped by parentheses.
(\\begin\\operatorname\\mathrm\\mathop\\end)\s*\{\s*([AZ az]+)\s*\}  match commands for L_{A}T_{E}X environments  
  (\\[azAZ]+\\[\\#\{\},:;!])  match other commands 
  (\s+)  match blank token 
  ([09\.]+)  match numbers 
  ([\$!"#%&'()*+,.\/:;<=>?\[\]^_`\{\\}~])  match tokens for operators 
For reasons that will be explained later, the regular expression
just given omits the matching of
identifiers (constants or variables denoted by Roman letters).
Assuming, for the moment, that such identifiers
are simply the letters of the Roman alphabet,
we can give an example of applying this regular expression.
Using it, the T_{E}X markup \sin(\frac 1 {xy^2})
(in math mode)
will be split into these tokens:
\sin
(
\frac
1
{
x
y
^
2
}
)
Here, for clarity, the blank tokens consisting of only spaces have been ignored, because blank tokens will eventually be displayed as nothing anyway.
The regular expression for math mode works well for mathematics written in T_{E}X that do not require switching out of “math mode”. To elaborate, some mathematical formulae make use of naturallanguage descriptions like this one:
B = \{ z \in A : \text{where $f(z)$ is purely imaginary} \}
In this case, the English text “where … is purely imaginary” must
be parsed as English text, and not as the mathematical variables
w
, h
, e
, r
, ….
T_{E}X has the notion of modes that distinguish between symbolic formulae and naturallanguage text. The former is handled by the “math mode”, and the latter by the “horizontal mode” or “paragraph mode”. Since T_{E}X always processes markup a character at a time, the modes do not affect the syntax of the markup, but rather they affect how the input characters are typeset into the resulting document.
However, our T_{E}X markup parser works differently; it parses a formulae written in T_{E}X into tokens using regular expressions, and the regular expressions are not valid for parsing naturallanguage text. So to parse naturallanguage text embedded into mathematical formulae, we need a separate set of regular expressions to parse naturallanguage text. When embedded text is encountered we must switch into the separate set of regular expressions for parsing it.
Here is the regular expression for parsing naturallanguage text (outside of “math mode”), possibly with embedded T_{E}X commands:
[\${}\\]  
  \\[azAZ]+  match commands 
  [^{}\$]+ 
So far, we have presented two sets of regular expressions.
The two sets cannot be combined into one, because selecting
the correct parsing mode depends on counting nesting levels
of matching braces ({
and }
) or
mathmode delimiters ($
). And regular expressions
provably cannot do this — they do not provide
recursivelydefined subexpressions.
Thus, full T_{E}X markup cannot be parsed using regular expressions alone. It could be parsed with a contextfree grammar instead. However, parsing a general contextfree grammar, using a parser generator, is overkill for our purposes — parser generators are not readily available for the JavaScript language, and generators from scripting languages are likely to be less efficient from a handrolled parser.
Actually, a handrolled parser is not difficult to implement.
The text
, textrm
, textsl
, textit
, texttt
, and textbf
commands switch
out of math mode and into text mode (T_{E}X’s horizontal mode to be precise.
This means that whatever that occurs inside
these commands cannot be tokenized as usual math tokens.
(\\textrm\\textsl\\textit\\texttt\\textbf\\text\\hbox)
const tokenize_re = /Regular expression for tokenizing T_{E}X input([azAZ])/g; const tokenize_text_re = /[\${}\\]\\[azAZ]+[^{}\$]+/g; const tokenize_text_commands = { '\\textrm': 1, '\\textsl': 1, '\\textit': 1, '\\texttt': 1, '\\textbf': 1, '\\textnormal': 1, '\\text': 1, '\\hbox': 1, '\\mbox': 1 }; function tokenize_latex_math(input) { var result = new Array(); var in_text_mode = 0; var brace_level = []; var pos = 0; if(input.charAt(0) == '$' && input.charAt(input.length1) == '$') input = input.slice(1, input.length1); while(1) { if(!in_text_mode) { tokenize_re.lastIndex = pos; var m = tokenize_re.exec(input); pos = tokenize_re.lastIndex; if(m == null) { return result; } else if(m[1] != null) { result.push(m[1], m[2]); } else if(m[3] == '\\sp') { result.push('^'); } else if(m[3] == '\\sb') { result.push('_'); } else { if(m[0] == '$') { in_text_mode = 1; } else if(m[4] != null) { continue; } else if(m[3] != null && m[3] in tokenize_text_commands) { in_text_mode = 2; brace_level.push(0); } result.push(m[0]); } } else { tokenize_text_re.lastIndex = pos; var m = tokenize_text_re.exec(input); pos = tokenize_text_re.lastIndex; if(m == null) { return result; } else if(m[0] == '$') { in_text_mode = 0; } else if(m[0] == '{') { brace_level[brace_level.length1]++; } else if(m[0] == '}') { if(brace_level[brace_level.length1] <= 0) { in_text_mode = 0; brace_level.pop(); } } result.push(m[0]); } } }
tokenize_re = re.compile(ur"""Regular expression for tokenizing T_{E}X input([azAZ])""") tokenize_text_re = re.compile(ur"""[\${}\\]\\[azAZ]+[^{}\$]+""") tokenize_text_commands = { u'\\textrm': 1, u'\\textsl': 1, u'\\textit': 1, u'\\texttt': 1, u'\\textbf': 1, u'\\text': 1, u'\\textnormal': 1, u'\\hbox': 1, u'\\mbox': 1, } def tokenize_latex_math(self, input): in_text_mode = 0; brace_level = []; pos = 0; input = unicode(input) if input[0] == u'$' and input[1] == u'$': input = input[1:1] while True: if not in_text_mode: m = self.tokenize_re.match(input, pos) if m is None: return pos = m.end() if m.group(1) is not None: self.tokens.extend(m.group((1,2))) elif m.group(3) == u"\\sp": self.tokens.append(u"^") elif m.group(3) == u"\\sb": self.tokens.append(u"_") else: if m.group(0) == u"$": in_text_mode = 1 elif m.group(4) is not None: continue elif m.group(3) in self.tokenize_text_commands: in_text_mode = 2; brace_level.append(0) self.tokens.append(m.group(0)) else: m = self.tokenize_text_re.match(input, pos) if m is None: return pos = m.end() if m.group(0) == u"$": in_text_mode = 0 elif m.group(0) == u"{": brace_level[1] += 1 elif m.group(0) == u"}": brace_level[1] = 1 if brace_level[1] <= 0: in_text_mode = 0 brace_level.pop() self.tokens.append(m.group(0))
We parse T_{E}X and L_{A}T_{E}X in two stages: the first stage splits the input string into tokens, then the transformation rules march through these tokens (generally from left to right), translating them to the MathML.
The transformation rules are written as procedures that call each other recursively to build up MathML subtrees.
List of limit commands Parse a piece Parse a subexpression Parse a subexpression chain Parse optional arguments
In our parlance, a piece is a token or a group of tokens that T_{E}X takes as one piece. Examples:
xy 
consists of two pieces, x and y , each
piece being also a token 
{xy} 
has only one piece consisting of 4 tokens, {xy} 
2^nm 
The piece occurring after ^ is the token n 
2^{nm} 
The piece occurring after ^ is {nm} ,
consisting of 4 tokens 
2^\mathrm{F}_2 
The piece occurring after ^ is \mathrm{F} ,
consisting of 4 tokens 
The necessity of distinguishing between a piece and a token
should be clear from these examples. In 2^mn
,
the superscript, as parsed by T_{E}X, is just m
.
To get an mn
superscript, the two tokens must
be enclosed as one piece: {mn}
.
More precisely, a piece is defined to be one of the following:
{
or }
or a T_{E}X command token{
and ending with
a matching }
The following procedure, piecetomathml
,
takes the next piece and translates it to MathML.
To do this, it reads in the next token,
and determines the appropriate action depending on the type of
the token.
(procedure (piecetomathml) (var token (readtoken)) (var result (nil)) (cond ((streq? token "{") (advancetoken) (set result (call subexprchaintomathml $hardstoptokens)) (cond ((streq? (readtoken) "}") (advancetoken)))) ((in? token $relationsymbols) (set result (resultelement "mo" ((get token $relationsymbols)))) (advancetoken)) ((in? token $operatorsymbols) (set result (resultelement "mo" ((get token $operatorsymbols)))) (advancetoken)) ((in? token $leftdelimiters) (set result (resultelement "mo" ((get token $leftdelimiters)))) (advancetoken)) ((in? token $rightdelimiters) (set result (resultelement "mo" ((get token $rightdelimiters)))) (advancetoken)) ((in? token $wordoperators) (set result (resultelement "mi" (("mathvariant" "normal")) ((get token $wordoperators)))) (advancetoken)) ((in? token $greekletters) (set result (resultelement "mi" (("fontstyle" "normal")) ((get token $greekletters)))) (advancetoken)) ((in? token $namedidentifiers) (set result (resultelement "mi" ((get token $namedidentifiers)))) (advancetoken)) ((in? token $punctandspace) (set result (resultelement "mtext" ((get token $punctandspace)))) (advancetoken)) ((in? token $texcommands) (advancetoken) (set result (call (get token $texcommands)))) (else (set result (resultelement "mn" (token))) (advancetoken))) (return result))
In T_{E}X the concept that corresponds roughly to a piece is instead called an atom. However, “atom” is the wrong word for the concept, since “atom” means indivisible, but pieces can be groups of atoms.
The two terms piece and atom technically do not refer
to the same thing, because of a technical difference in which
we do parsing.
Recall that we parse a character sequence
consisting of digits (and decimal points)
as one token representing a number, rather than as separate tokens
for each digit.
So for example, 431
is one token — and hence one piece — in our sense, while in T_{E}X
there are 3 atoms: 4
3
1
.
This can lead to incompatibilities when parsing T_{E}X markup.
For example, in T_{E}X, the markup
\frac42 6
means the same as \frac{4}{2} 6
,
and not \frac{42}{6}
.
Also, T_{E}X does not allow L_{A}T_{E}X blocks to occur whenever the context demands an atom. So for example, this markup is illegal:
x^\begin{smallmatrix} 1 & 2 \\ 3 & 4\end{smallmatrix}
,
although this is legal:
x^{\begin{smallmatrix} 1 & 2 \\ 3 & 4\end{smallmatrix}}
.
In our L_{A}T_{E}X parser both are legal — that simplifies the implementation slightly. Technically it constitutes an incompatibility, but only invalid T_{E}X markup is effected, so there is little to worry about.
A subexpression, in our case, means a piece along with its attached subscripts and superscripts.
FIX ME. This definition will have to be expanded eventually, to allow semantic parsing.
(procedure (subexprtomathml) (var result (nil)) Parse any prescripts for tensor indices (var limitstyle (in? (readtoken) $limitcommands)) (cond ((null? (readtoken)) (cond ((notnull? mmultiscripts) (prepend mmultiscripts (resultelement "mrow") mprescripts) (return mmultiscripts)) (else (return (resultelement "mrow"))))) ((in? (readtoken) $leftdelimiters) (set result (call heuristicsubexpression))) (else (set result (call piecetomathml)))) (var base result) Incorporate the following T_{E}X subscript and superscript, if present Parse any postscripts for tensor indices Place the sub and superscripts (return result))
Parsing subscripts and superscripts is not entirely straightforward.
Firstly, T_{E}X signals subscripts and superscripts by placing ^
or _
tokens after the base object,
whereas MathML signals subscripts and superscripts by an element
that wraps the base object. In practice, this means
we must check for the ^
or _
tokens
separately after processing other kinds of objects,
and go back and wrap the previous object if ^
or _
is found.
Secondly, subscripts and superscripts in T_{E}X actually stand
for underscripts and overscripts, when they are used in big operators
(in display style).
This is called the “limit style” of subscripts and superscripts.
MathML makes clear distinctions between subscripts/superscripts,
and underscripts/overscripts, and we must disambiguate between
the two styles by looking at the object that the T_{E}X ^
or _
token is being applied to.
Finally, parsing tensor indices (multiple indices for a single base expression) opens another can of worms.
The following subportion of the subexprtomathml
procedure
determines if the token immediately following a piece
is ^
or _
,
and if so, puts the previous result element
into a MathML subscript (msub
), superscript
(msup
),
underscript (munder
), or overscript
(mover
) wrapper.
Also, a combination of a subscript and a superscript
should result in a combined MathML msubsup
or munderover
element, to avoid the scripts
being staggered.
If no T_{E}X subscript or superscript is found, the previous result element is left alone.
(var subscript (nil)) (var superscript (nil)) (cond ((streq? (readtoken) "_") (advancetoken) (set subscript (call piecetomathml))) ((streq? (readtoken) "^") (advancetoken) (set superscript (call piecetomathml)))) (cond ((streq? (readtoken) "_") (advancetoken) (set subscript (call piecetomathml))) ((streq? (readtoken) "^") (advancetoken) (set superscript (call piecetomathml)))) (cond ((notnull? mmultiscripts) (prepend mmultiscripts base mprescripts) (prepend mmultiscripts (if (notnull? subscript) subscript (resultelement "none")) mprescripts) (prepend mmultiscripts (if (notnull? superscript) superscript (resultelement "none")) mprescripts)))
(cond ((notnull? mmultiscripts) (set result mmultiscripts)) ((and (notnull? subscript) (notnull? superscript)) (set result (resultelement (if limitstyle "munderover" "msubsup") (base subscript superscript)))) ((notnull? subscript) (set result (resultelement (if limitstyle "munder" "msub") (base subscript)))) ((notnull? superscript) (set result (resultelement (if limitstyle "mover" "msup") (base superscript)))))
T_{E}X subscripts and superscripts applied to the following commands should be understood as MathML underscripts and overscripts. For other commands, subscripts and superscripts continue to be mapped to subscripts and superscripts.
(var $limitcommands (table Big operators Big word operators ("\\underbrace" (nil)) ("\\overbrace" (nil)) ("\\underline" (nil)) ("\\overline" (nil)) ))
When mathematical expressions are to be typeset inline,
even the big operators drop the limit style.
Nevertheless, the limits are still encoded as underscripts
and overscripts in MathML; MathML has a separate attribute
(movablelimits
) to indicate that the limit style
should be dropped.
(var mmultiscripts (nil)) (var mprescripts (nil)) (cond ((and (streq? (readtoken 0) "{") (streq? (readtoken 1) "}") (or (streq? (readtoken 2) "_") (streq? (readtoken 2) "^"))) (set mmultiscripts (resultelement "mmultiscripts")) (set mprescripts (resultelement "mprescripts")) (append mmultiscripts mprescripts) (while (and (streq? (readtoken 0) "{") (streq? (readtoken 1) "}") (or (streq? (readtoken 2) "_") (streq? (readtoken 2) "^"))) (var subscript (nil)) (var superscript (nil)) (advancetoken) (advancetoken) (cond ((streq? (readtoken) "_") (advancetoken) (set subscript (call piecetomathml))) ((streq? (readtoken) "^") (advancetoken) (set superscript (call piecetomathml)))) (cond ((streq? (readtoken) "_") (advancetoken) (set subscript (call piecetomathml))) ((streq? (readtoken) "^") (advancetoken) (set superscript (call piecetomathml)))) (append mmultiscripts (if (notnull? subscript) subscript (resultelement "none"))) (append mmultiscripts (if (notnull? superscript) superscript (resultelement "none"))))))
(while (and (streq? (readtoken 0) "{") (streq? (readtoken 1) "}") (or (streq? (readtoken 2) "_") (streq? (readtoken 2) "^"))) (cond ((null? mmultiscripts) (set mmultiscripts (resultelement "mmultiscripts" (base))) (set mprescripts (nil)) (cond ((or? (notnull? superscript) (notnull? subscript)) (append mmultiscripts (if (notnull? subscript) subscript (resultelement "none"))) (append mmultiscripts (if (notnull? superscript) superscript (resultelement "none"))))))) (var subscript (nil)) (var superscript (nil)) (advancetoken) (advancetoken) (cond ((streq? (readtoken) "_") (advancetoken) (set subscript (call piecetomathml))) ((streq? (readtoken) "^") (advancetoken) (set superscript (call piecetomathml)))) (cond ((streq? (readtoken) "_") (advancetoken) (set subscript (call piecetomathml))) ((streq? (readtoken) "^") (advancetoken) (set superscript (call piecetomathml)))) (prepend mmultiscripts (if (notnull? subscript) subscript (resultelement "none")) mprescripts) (prepend mmultiscripts (if (notnull? superscript) superscript (resultelement "none")) mprescripts))
(cond ((and (streq? token2 "^") (streq? (readtoken) "\\circ"))
(advancetoken)
(append result 00b0 °))
(cond ((streq? (readtoken) "_")
(set token2 "_")
(advancetoken))
(else
(set token2 (nil)))))
In T_{E}X, the markup '
is equivalent to
^{\prime}
. This sets the \prime
symbol
— which is a verticallycentered glyph — to the superscript position,
so that it appears as the familiar mathematical prime mark.
There is some ambiguity in how a prime mark should be translated:
In view of these two interpretations of the prime mark, we employ the following heuristic approach when converting formulae with prime marks into MathML:
'
),
but it should be translated to
the proper Unicode prime mark character for MathML
(2032 ′).
<mi>a′</mi>
,
for example.
In the case of two prime marks,
the marks should be coalesced into one Unicode doubleprime character
(2033 ″).
<mi>f</mi>
<mo>′</mo>
<mo>′</mo>
<mo>′</mo>
.
The following procedure fragment implements this heuristic.
((in? token $namedidentifiers) (set result (call piecetomathml)) (cond ((streq? (readtoken) "'") (advancetoken) (cond ((streq? (readtoken) "'") (advancetoken) (cond ((streq? (readtoken) "'") (advancetoken) Add arbitrary number of prime marks as operators) (else (append result 2033 ″)))) (else (append result 2032 ′))))))
For three or more prime marks, loop through them and enter them separately as operators.
(set result (resultelement "mrow" (result (resultelement "mo" 2032 ′) (resultelement "mo" 2032 ′) (resultelement "mo" 2032 ′)))) (while (streq? (readtoken) "'") (append result (resultelement "mo" 2032 ′)) (advancetoken))
(procedure (subexprchaintomathml stoptokens) (var result (nil)) (var mrow (nil)) (var mfrac (nil)) (var wrappedresult (nil)) (while (and (notnull? (readtoken)) (not (in? (readtoken) stoptokens))) (cond Parse\over
Parse\choose
(else (var node Parse subexpressions, organizing them in a tree according to precedence ) Collect results ))) (cond ((notnull? mfrac) (append mfrac result) (return wrappedresult)) (else (return result))))
(var node (call subexprtomathml)) Collect results
(cond ((notnull? mrow) (append mrow node)) ((notnull? result) (set mrow (resultelement "mrow" (result node))) (set result mrow)) (else (set result node)))
Fractions and binomial coefficients in the old T_{E}X style
are written using the \over
and \choose
commands.
The design of these commands, to put it mildly, is terrible.
These commands take their arguments on the left and right sides,
up to the edges of the current block (delimited by braces {
and }
).
They are a pain to parse because they do not behave like other
T_{E}X and L_{A}T_{E}X commands and require special cases in code.
Nevertheless some authors still use them instead of the better designed — if a little more verbose — L_{A}T_{E}X equivalents, so we must support them.
Whenever we encounter an \over
or \choose
,
the current subexpression up to that point — i.e. the left argument
to the command — is stowed away. Then an empty
mrow
is made, which will accumulate the rest
of the arguments (the right argument to the command).
At the end of the expression, the two pieces are put together.
Two separate variables, mfrac
and frac
are needed, to provide a location to store the arguments,
and to signal at the end that a fraction (or binomial coefficient)
is to be returned.
((streq? (readtoken) "\\over") (advancetoken) (set mfrac (resultelement "mfrac" (result))) (set wrappedresult mfrac) (set mrow (nil)) (set result (nil)))
((streq? (readtoken) "\\choose") (advancetoken) (set mfrac (resultelement "mfrac" (("linethickness" "0")) (result))) (set wrappedresult (resultelement "mrow" ((resultelement "mo" "(") mfrac (resultelement "mo" ")")))) (set mrow (nil)) (set result (nil)))
A required argument to a command
(such as the argument {ab}
in
the T_{E}X markup \frac{ab}{2}
is,
by definition, a piece.
An argument is almost always enclosed in between {
and }
, and books and articles on L_{A}T_{E}X always talk
about arguments as if this is always the case.
But the T_{E}Xbook, and the L_{A}T_{E}X implementation itself,
says otherwise: arguments can be atoms without the braces.
(In fact, T_{E}X itself (but not L_{A}T_{E}X) does not even
have the concept of a “command with arguments”.)
Consequently, our T_{E}X parser does not need to specifically care about arguments either — we can just parse them as pieces.
L_{A}T_{E}X does have a concept of “optional argument”
to commands. At a fundamental level optional arguments are different
from required arguments. An optional argument is not a piece,
but a series of pieces: the first piece being
the opening bracket [
, and the last
piece being the closing bracket ]
.
Note that the brackets are not syntax characters of T_{E}X,
and can be used literally in T_{E}X markup outside
of optional arguments.
All this means we must optional argument processing must be handled by a special handler (below), and the handler must be explicitly called. Fortunately, optional arguments are very rarely used, so the added complexity is not too much to bear.
(var $optionalargstoptokens (table Hard stop tokens ("]" (nil)) )) (procedure (optionalargtomathml) (cond ((strneq? (readtoken) "[") (return (nil)))) (advancetoken) (var result (call subexprchaintomathml $optionalargstoptokens)) (cond ((streq? (readtoken) "]") (advancetoken))) (return result))
("&" (nil)) ("\\\\" (nil)) ("}" (nil)) ("$" (nil)) ("\\end" (nil)) ("\\right" (nil)) ("\\bigr" (nil)) ("\\Bigr" (nil)) ("\\biggr" (nil)) ("\\Biggr" (nil)) ("\\choose" (nil)) ("\\over" (nil))
Grouping Parsing according to precedence levels
The T_{E}XtoMathML parser also does grouping of subexpressions.
This means to convert a series of symbols that occur without nesting
in T_{E}X, but really constitute a logical subexpression and should
be grouped as such in the MathML output.
For example, the T_{E}X markup
(x+y+2z)^2
should be translated
as:
<msup> <mrow> <mo>(</mo> <mrow> <mi>x</mi> <mo>+</mo> <mi>y</mi> <mrow> <mn>2</mn> <mi>z</mi> </mrow> </mo>)</mo> </mrow> <mn>2</mn> </msup>
The straightforward translation misses the proper semantic encoding, and is not good MathML markup:
<mrow> <mo>(</mo> <mi>x</mi> <mo>+</mo> <mi>y</mi> <mo>+</mo> <mn>2</mn> <mi>z</mi> <msup> <mo>)</mo> <mn>2</mn> </msup> </mrow>
(var $hardstoptokens (table Hard stop tokens )) (var $rightdelimiterstoptokens (table Hard stop tokens Right delimiters )) (procedure (heuristicsubexpression) (var result (resultelement "mrow")) (append result (call piecetomathml)) (append result (call subexprchaintomathml $rightdelimiterstoptokens)) (cond ((and (notnull? (readtoken)) (not (in? (readtoken) $hardstoptokens))) (append result (call piecetomathml)))) (return result))
In many situations, the first step for a computer to understand a language is being able to deduce a parse tree from the input (a list of tokens), in which the substructures in the input are clearly isolated.
Although basic L_{A}T_{E}X parsing can be done without a parse tree — indeed, the T_{E}X system itself does not generate parse trees to digest its input — the vast majority of math formulae and expressions, as used by practitioners of mathematics, are based on nesting substructures. An example has already been shown at the beginning of this chapter.
The heuristic semantic parsing pursued by our parser, in essence, is the deduction of a parse tree for mathematical formulae without being given the exact formal grammar that the formulae are written in. The exact formal grammar, of course, does not exist.
Nevertheless, we can try to construct the required formal grammar. We will not actually be using a parser generator on the formal grammar to write the L_{A}T_{E}X parser, because of problems with efficiency, robustness, and ambiguity. But it is still useful to write down pieces of the formal grammar, as a concise summary of what parse tree (i.e. MathML tree) we are to generate from the input.
Relation  Expr  Expr RelationOp Relation 

Expr  Term  Term AddOp Expr 
Term  Subexpr (MultiplyOp Term  Term)? 
Subexpr  Identifier  Numeral  ( Relation )  BigOperation  FunctionApplication 
FunctionApplication  Identifier ( (Identifier  Numeral  Relation  BigOperation) ) 
BigOperation  BigOp BigOperandExpr 
BigOperandExpr  Term  BigOperation 
AddOp  +  −  ⊕  ∪  ∨  … 
MultiplyOp  ⋅  ∕  ×  ⊗  ∩  ∧  … 
BigOp  ∑  ∫  ⋃  ⋂  … 
RelationOp  =  ∊  ⊂  ⊆  … 
Identifier  [azAZ]  α  …  Ω  sin  exp  … 
Numeral  −? [09]* (. [09]+)? 
The goal is to expressions containing operators mixed together in a list, but turn the output into a tree, where each node is a subexpression where the operator leaf nodes all have the same precedence level. And of course, operators that bind more tightly are to occur nearer to the bottom of the tree (away from the root).
Definition of precedence levels for common operators The general parsing procedure, for infix operators Procedure to parse invisible groups
The general procedure to parse infix operators with precedence is (surprisingly) easy and straightforward.
Consider first a simple case, an input like 3*4*5 + 6*7
,
where we are to separate the addition terms and the multiplication factors.
Suppose that the current token is at 3
.
We read that in, and process it.
Next, we look at the next token, which is *
.
This tells us that the next token (4
)
that follows is the other operand
of the multiplication *
.
So we accumulate in our buffer: 3*4
.
Then, we read the next token, the second *
operator.
This tells us again that whatever follows (5
) is an operand.
We then accumulate in the buffer: 3*4*5
.
Finally, we encounter the +
operator.
Since this is not the same precedence level as the *
operators we have been considering, we immediately flush the buffer,
and start anew. Then the +
is output,
and processing at the token 6
progresses
as at the first token 3
.
Parsing, in general, of more than two precedence levels uses stacked functions. Each level of depth of the function calls handles one level of precedence. The operation of “flushing the buffer” then is just returning from a function call with the buffered result, passing control to the function on top — the caller — which will handle the next precedence level. See the next section for a concrete example.
Here is the actual implementation:
(procedure (collectprecedencegroup operators stoptokens reader) (var result (call reader stoptokens)) (var mrow (nil)) (while (and (notnull? (readtoken)) (not (in? (readtoken) stoptokens)) (in? (readtoken) operators)) (cond ((null? mrow) (set mrow (resultelement "mrow" (result))) (set result mrow))) (append mrow (call piecetomathml)) (cond ((and (notnull? (readtoken)) (in? (readtoken) stoptokens)) (return result)) (else (append mrow (call reader stoptokens))))) (return result))
The parameter operators
is a table of operators
at the same given precedence level.
The parameter reader
is a function object
that will be called to parse tokens at the next precedence
level (deeper into the tree).
The mrow
variable points to a mrow
result element which serves as a buffer.
But if the buffer is to contain one object (subtree) only,
the mrow
wrapper will be omitted.
Most of the apparent complexity in the collectprecedencegroup
procedure is actually checking when to stop
processing. Clearly, the procedure must not loop past
the end of input, but it also must stop processing at
certain tokens (given by the parameter table stoptokens
).
For example, the procedure typically stops processing
at the }
token marking the end of the current block.
(Obviously, expression chains should not leak through a T_{E}X syntactic block.)
Compared to parsing precedence levels based on a contextfree grammar, our implementation has a serious “win” in that no explicit backtracking of tokens is required. In effect, the backtracking happens by having the function call unwind. Thus the implementation is simplified with good performance characteristics: the time to process the input remains linear in the number of tokens; and the runtime memory required is linear in the nesting depth of the mathematical structures represented by the input markup. Yet, because each parsing level can be implemented by different functions, we still retain a lot of flexibility.
This is the hook into subexprchaintomathml
that parses subexpressions with precedence levels.
Level 1: relations
(call collectprecedencegroup $relationsprecedencegroup stoptokensLevel 2: additionlike operators
(lambda (stoptokens) (call collectprecedencegroup $additionprecedencegroup stoptokensLevel 3: multiplicationlike operators
(lambda (stoptokens) (call collectprecedencegroup $multiplicationprecedencegroup stoptokensLevel 4: invisible multiplicationlike operators
collectinvisiblegroup)))))
(var $relationsprecedencegroup $relationsymbols) (var $additionprecedencegroup (table ("+" (nil)) ("" (nil)) ("\\oplus" (nil)) )) (var $multiplicationprecedencegroup (table ("*" (nil)) ("\\times" (nil)) ("\\cdot" (nil)) ("/" (nil)) ))
(procedure (collectinvisiblegroup stoptokens) (var result (call subexprtomathml)) (var mrow (nil)) (while (and (notnull? (readtoken)) (not (in? (readtoken) stoptokens)) (or (in? (readtoken) $namedidentifiers) (in? (readtoken) $leftdelimiters))) (cond ((null? mrow) (set mrow (resultelement "mrow" (result))) (set result mrow))) (append mrow (resultelement "mo" ("\u2062"))) (cond ((and (notnull? (readtoken)) (in? (readtoken) stoptokens)) (return result)) (else (append mrow (call subexprtomathml))))) (return result))
// This script was automatically generated from a literate source.
// Do not edit this file; look at the literate source instead!
//
// Greasemonkey user script to
// Display LaTeX in Web pages by transforming to MathML
//
// Home page: http://goldsaucer.afraid.org/mathml/greasemonkey/
//
// 
Copyright notice
// 
User script data
Subroutines for result trees
Workaround for Mozilla not supporting mathvariant
The following are basic metadata describing the user script, required by the Greasemonkey extension. Also sets the default Web pages where the user script is to activate.
// ==UserScript== // @name Display LaTeX // @namespace http://goldsaucer.afraid.org/mathml/greasemonkey/ // @description Display LaTeX in Web pages by transforming into MathML // @include http://goldsaucer.afraid.org/mathml/greasemonkey/ // @include http://goldsaucer.afraid.org/writings/Display_LaTeX_sandbox // @include http://planetmath.org/* // ==/UserScript==
The MathML XML namespace, needed for creating MathML elements
const mmlns = 'http://www.w3.org/1998/Math/MathML'; Create MathML result element Append MathML result element to another element’s content Prepend MathML result element before another child element Change attribute of result element Change attribute of result element Compatibility layer for the Epiphany browser Greasemonkey configuration
function result_element(tag, num_attrs) { var node = document.createElementNS(mmlns, tag); var k = 2; while(num_attrs >= 0) { if(arguments[k+1] != null) { node.setAttribute(arguments[k], arguments[k+1]); } k += 2; } for(; k < arguments.length; k++) { if(arguments[k] != null) { if(typeof(arguments[k]) == 'string') node.appendChild(document.createTextNode(arguments[k])); else node.appendChild(arguments[k]); } } return node; }
function result_element_append(parent, child) { if(parent != null && child != null) { if(typeof(child) == 'string') parent.appendChild(document.createTextNode(child)); else parent.appendChild(child); } }
function result_element_prepend(parent, child, next) { if(next == null) result_element_append(parent, child); else if (parent != null && child != null) parent.insertBefore(child, next); }
function result_set_attr(elem, attr, value) { if(elem != null && attr != null) { if(value != null) elem.setAttribute(attr, value); else elem.removeAttribute(attr); } }
function result_append_attr(elem, attr, value) { if(elem != null && attr != null) { var old_value = elem.getAttribute(elem, attr); if(old_value == null) elem.setAttribute(attr, value); else elem.setAttribute(attr, old_value + value); } }
if(!this.GM_getValue) { this.GM_getValue = function(key, value) { return value; } this.GM_log = function() {} }
if(this.GM_registerMenuCommand) { GM_registerMenuCommand("Enable native display of math images", function() { GM_setValue("patchimages", true); do_patch_images = true; patch_element(document.documentElement); }); GM_registerMenuCommand("Disable native display of math images", function() { GM_setValue("patchimages", false); }); }
mathvariant
Unfortunately, Mozilla does not support the mathvariant
attribute in MathML 2.0.
We fix this by directly substituting any Roman letters
by their variant characters.
This process is done in JavaScript and not in the SExpressions minilanguage, because the minilanguage is not powerful enough for this, and the resultant MathML is not portable anyway. (The list of character codes seems to be specific to the Mozilla and not standard.)
const char_map = { 'script': [ '\uEF35', '\u212C', '\uEF36', '\uEF37', '\u2130', '\u2131', '\uEF38', '\u210B', '\u2110', '\uEF39', '\uEF3A', '\u2112', '\u2133', '\uEF3B', '\uEF3C', '\uEF3D', '\uEF3E', '\u211B', '\uEF3F', '\uEF40', '\uEF41', '\uEF42', '\uEF43', '\uEF44', '\uEF45', '\uEF46' ], 'fraktur': [ '\uEF5D', '\uEF5E', '\u212D', '\uEF5F', '\uEF60', '\uEF61', '\uEF62', '\u210C', '\u2111', '\uEF63', '\uEF64', '\uEF65', '\uEF66', '\uEF67', '\uEF68', '\uEF69', '\uEF6A', '\u211C', '\uEF6B', '\uEF6C', '\uEF6D', '\uEF6E', '\uEF6F', '\uEF70', '\uEF71', '\u2128' ], 'doublestruck': [ '\uEF8C', '\uEF8D', '\u2102', '\uEF8E', '\uEF8F', '\uEF90', '\uEF91', '\u210D', '\uEF92', '\uEF93', '\uEF94', '\uEF95', '\uEF96', '\u2115', '\uEF97', '\u2119', '\u211A', '\u211D', '\uEF98', '\uEF99', '\uEF9A', '\uEF9B', '\uEF9C', '\uEF9D', '\uEF9E', '\u2124' ], }; const uppercase_re = /[AZ]/; function fix_mathvariant(node, style) { if(node.nodeType == node.TEXT_NODE) { if(style != null && style != '' && style in char_map) { node.data = node.data.replace(uppercase_re, function(s) { return char_map[style][s.charCodeAt(0)65] }); } } else if(node.nodeType == node.ELEMENT_NODE) { var new_style = node.getAttribute('mathvariant'); if(new_style != null && new_style != '') style = new_style; for(var i=0; i < node.childNodes.length; i++) fix_mathvariant(node.childNodes.item(i), style); } }
The user script calls patch_element
on the root
element to patch all L_{A}T_{E}X markup within the document.
Tokenizing T_{E}X input
Supplemental processing of MathML
Procedure to patch an img
Procedure to patch a text node
Procedure to patch a DOM subtree
var do_patch_images = GM_getValue("patchimages", false);
var delayed_patch = GM_getValue("delayedpatch", false);
patch_element(document.documentElement);
The patch_element
procedure takes the given node,
and replaces all L_{A}T_{E}X text in $ signs occurring within that node.
It recursively calls itself to handle the children of the given node.
The procedure can be written more succinctly, without an explicit recursion, as:
var iter = document.createNodeIterator( document.documentElement, NodeFilter.SHOW_TEXT, null, true); var n; while(n = iter.nextNode()) { patch_text(n); }
However, the DOM node iteration API is not supported by Mozilla, so we have to write the procedure the “long” way, as given below.
function patch_element(node) { if(node.nodeType == node.TEXT_NODE) patch_text(node); else if(node.nodeType == node.ELEMENT_NODE) {If the current node is a text box control of an HTML form, then do not replace L_{A}T_{E}X markup there. Not only does it not display properly (the text box only displays only plain text), the user may well be editing some L_{A}T_{E}X markup in the text box! Also, do not attempt to replace L_{A}T_{E}X markup in (JavaScript) scripts either; the dollar sign used in JavaScript confuses the L_{A}T_{E}X parser.
if(node.tagName == 'TEXTAREA'  node.tagName == 'textarea'  node.tagName == 'INPUT'  node.tagName == 'input'  node.tagName == 'SCRIPT'  node.tagName == 'script') return; if(do_patch_images && (node.tagName == 'IMG'  node.tagName == 'img')) { if(!delayed_patch) patch_img(node); else node.addEventListener("click", patch_img, false); return; } var child = node.firstChild; while(child) { var next = child.nextSibling; patch_element(child); child = next; } } }
It is easy to extend the math patching code so that it also patches the math images produced by LaTeX2HTML translator, and display them with the browser’s native MathML renderer instead.
Sometimes the browser’s MathML rendering turns out to look better than the raster images from LaTeX2HTML — especially if the user is reading Web pages at a font size smaller or larger than usual. Sometimes the MathML rendering is worse. So patching images is only done optionally.
But more importantly, by having the MathML translation code activate on LaTeX2HTML pages, we get more opportunities to test and debug our program.
function patch_img(node) {if(node.currentTarget) node = node.currentTarget; var alt = node.getAttribute('alt'); if(alt == null  /^\\includegraphics^\$\\displaystyle \\xymatrix/.test(alt)) return; var latex_string = null; Prefer use of the
patch_img
can also called from an event handler; in that case, the argument is an event object rather than a DOM node. We extract the DOM node stored in that event object.MATH
comment, if present instead ofALT
attribute if(!latex_string && /^\$.+\$$/.test(alt) && !(/\.{3} \.{3}/.test(alt))) { latex_string = alt; } if(latex_string == null) return; Do the L_{A}T_{E}XtoMathML translation if(mathml == null) return; node.parentNode.replaceChild(mathml, node); }
LaTeX2HTML has a stupid misfeature where it will
truncate a long math formula in an ALT
attribute.
Thus we cannot always recover the original math formula
from the ALT
attribute alone.
But at least, LaTeX2HTML does — most of the time — put the full formula
in a < MATH >
HTML comment
occurring before the image. Use the contents
of that instead, if the comment is present.
Check for images of formulae produced by LaTeX2HTML
if((node.parentNode.tagName == 'DIV' && node.parentNode.getAttribute('CLASS') == 'mathdisplay')  (node.parentNode.tagName == 'SPAN' && node.parentNode.getAttribute('CLASS') == 'MATH')) { var parent = node.parentNode; var previous = parent.previousSibling; const non_whitespace = /[^\s]/;Skip over whitespace nodes in the DOM
if(previous && previous.nodeType == node.TEXT_NODE && !non_whitespace.test(previous.data)) previous = previous.previousSibling;Sometimes the comment appears in the previous paragraph
if(previous && previous.nodeType == node.ELEMENT_NODE && previous.tagName == 'P' && previous.lastChild) { previous = previous.lastChild; if(previous && previous.nodeType == node.TEXT_NODE && !non_whitespace.test(previous.data)) previous = previous.previousSibling; }Extract the full T_{E}X formulae for the image from the comment, if found
if(previous && previous.nodeType == node.COMMENT_NODE) { latex_string = previous.data.replace(/^\s*MATH\s*/, '') .replace(/\s+$/, ''); } }
tokens = new Object(); tokens.list = tokenize_latex_math(latex_string); tokens.list.push(null); tokens.index = 0; var mathml = null; try { var mrow = v_subexpr_chain_to_mathml(tokens, {});Fix display of variant characters, as described in
fix_mathvariant(mrow, null); mathml = document.createElementNS(mmlns, 'math'); mathml.setAttribute("latex", latex_string); mathml.setAttribute("mathvariant", "normal"); mathml.appendChild(mrow); mathml.addEventListener("click", post_process_mathml, false); } catch(e) { GM_log("Display LaTeX failed with error " + e + " on " + latex_string); }
function patch_text(node0) { var text = node0.nodeValue; var results = /\$[^$]+\$\[tex\](.+?)\[\/tex\]/.exec(text); if(results) { var latex_string = (results[1] == null ? results[0] : '$'+results[1]+'$'); Do the L_{A}T_{E}XtoMathML translation if(mathml == null) return;Split up
var node2 = node0.splitText(results.index);node0
into two nodes, at the point where the first $ sign occursNow delete the original L_{A}T_{E}X markup
node2.deleteData(0, results[0].length);Make and insert a math element in place of the deleted L_{A}T_{E}X markup
node2.parentNode.insertBefore(mathml, node2);There may be more than one $...$ L_{A}T_{E}X block in a single text node, and we processed only one of them, so process the others by a recursive call.
patch_text(node2); } }
The L_{A}T_{E}XtoMathML translator can post the generated MathML content to a Web service. This feature is intended for advanced users and developers — for example, the author of this program uses it to activate a separate script on his computer that speaks the MathML to him (voice synthesis), and to collect the MathML into a database for the program test suite.
To use this feature, the user must initialize it manually by setting the Mozilla preference “greasemonkey.scriptvals.http://goldsaucer.afraid.org/mathml/greasemonkey//Display LaTeX.clickposturl” to the desired Web service URL. Then, to post any displayed MathML content in the Web browser, click on the MathML.
It would be more ideal if this functionality was implemented by calling an external program directly instead, but unfortunately, Greasemonkey user scripts do not have access to the XPCOM interfaces required for this task.
Warning: This feature, if used carelessly, may lead to security or privacy breaches. (Hence, it is not activated by default.)
function post_process_mathml(event) { var url = GM_getValue('clickposturl', null); if(url == null) return; var ser = new XMLSerializer(); var xhr = GM_xmlhttpRequest({ method: 'POST', url: url, headers: { 'ContentType': 'text/xml; charset=utf8', 'ContentLocation': document.location }, data: ser.serializeToString(event.currentTarget), onerror: function(details) { alert("There was an error processing the request. " + "HTTP status code " + details.status + ' ' + details.statusText); }, onload: function(details) { window.status = "Successfully posted MathML. Status: " + details.status + ' ' + details.statusText; }}); window.status = "Posting MathML to " + url + "..."; }
Trees Serializing XML Tokenizing T_{E}X input Program
String.prototype.repeat = function(n) { return new Array(n+1).join(this); } function xml_escape(s) { s = s.replace('&', '&'). replace('<', '<'). replace('>', '>'); return s.replace(/[\u0080\uffff]/, function(x) { return '&#' + x.charCodeAt(0) + ';' }); } function xml_attr_escape(s) { s = s.replace('&', '&'). replace('"', '"'). replace('<', '<'). replace('>', '>'); return s.replace(/[\u0080\uffff]/, function(x) { return '&#' + x.charCodeAt(0) + ';' }); } function serialize_mathml(tree, indent_level) { var indent_string = ' '.repeat(indent_level); if(tree instanceof PlainXMLNode) { var start_tag = '<' + tree.tag; if(tree.attrs != null) { for(var a in tree.attrs) start_tag += ' ' + a + '="' + xml_attr_escape(tree.attrs[a]) + '"'; } var empty_tag = start_tag + ' />'; start_tag += '>'; var end_tag = '</' + tree.tag + '>'; if(tree.content.length == 0) { print(indent_string + empty_tag); } else if(tree.content.length == 1 && typeof(tree.content[0]) == 'string') { print(indent_string + start_tag + xml_escape(tree.content[0]) + end_tag); } else { print(indent_string + start_tag); for(var i=0; i < tree.content.length; ++i) serialize_mathml(tree.content[i], indent_level+1); print(indent_string + end_tag); } } else if(typeof(tree) == 'string') { print(indent_string + xml_escape(tree)); } }
for(var j=0; j < arguments.length; ++j) { var input = arguments[j]; var tokens = new Object(); tokens.list = tokenize_latex_math(input); tokens.list.push(null); tokens.index = 0; print('<!'); for(var i=0; i < tokens.list.length; ++i) print('token ' + i + ': ' + tokens.list[i]); print('>'); var mathml = v_subexpr_chain_to_mathml(tokens, {}); print('<math xmlns="http://www.w3.org/1998/Math/MathML">'); serialize_mathml(mathml, 1); print('</math>'); }
PlainXMLNode result_element result_element_append result_element_prepend result_set_attr result_append_attr
function PlainXMLNode(tag) { this.tag = tag; this.content = []; this.attrs = {}; }
function result_element(tag, num_attrs) { var node = new PlainXMLNode(tag); var k = 2; while(num_attrs >= 0) { if(arguments[k+1] != null) { node.attrs[arguments[k]] = arguments[k+1]; } k += 2; } for(; k < arguments.length; k++) { if(arguments[k] != null) { node.content.push(arguments[k]); } } return node; }
function result_element_append(parent, child) { if(parent != null && child != null) { parent.content.push(child); } }
function result_element_prepend(parent, child, next) { if(next == null) result_element_append(parent, child); else if(parent != null && child != null) { for(var i = 0; i < parent.content.length; i++) { if(parent.content[i] == next) { parent.content.splice(i, 0, child); return; } } } }
function result_set_attr(elem, attr, value) { if(elem != null && attr != null) { if(value != null) elem.attrs[attr] = value; else delete elem.attrs[attr]; } }
function result_append_attr(elem, attr, value) { if(elem != null && attr != null) { if(elem.attrs[attr] == null) elem.attrs[attr] = value; else elem.attrs[attr] += value; } }
# This script was automatically generated from a literate source. # Do not edit this file; look at the literate source instead! # import LaTeX2MathMLModule import sys import re class TokenInput: def __init__(self, input): self.tokens = [] self.tokens_index = 0 self.tokenize_latex_math(input) self.tokens.append(None) Tokenize L_{A}T_{E}X input for input in sys.argv[1:]: t = TokenInput(input) sys.stdout.write("<math xmlns='%s'>\n" % LaTeX2MathMLModule.mmlns) sys.stdout.write(LaTeX2MathMLModule.v_subexpr_chain_to_mathml(t, {}).toxml("utf8")) sys.stdout.write("\n</math>\n")
import xml.dom.minidomThe MathML XML namespace, needed for creating MathML elements
mmlns = 'http://www.w3.org/1998/Math/MathML' document = xml.dom.minidom.getDOMImplementation().createDocument(None,None,None) Create MathML result element Append MathML result element to another element’s content Prepend MathML result element before another child element Change attribute of result element Change attribute of result element
def result_element(tag, num_attrs, *args): node = document.createElementNS(mmlns, tag) for i in range(0, num_attrs): if args[2*i+1] is not None: node.setAttribute(args[2*i], args[2*i+1]) for i in range(num_attrs*2, len(args)): if args[i] is not None: if isinstance(args[i], unicode): node.appendChild(document.createTextNode(args[i])) else: node.appendChild(args[i]) return node
def result_element_append(parent, child): if (parent is not None) and (child is not None): if isinstance(child, unicode): parent.appendChild(document.createTextNode(child)) else: parent.appendChild(child)
def result_element_prepend(parent, child, next): if next is None: result_element_append(parent, child) elif (parent is not None) and (child is not None): parent.insertBefore(child, next)
def result_set_attr(elem, attr, value): if (elem is not None) and (attr is not None): if value is not None: elem.setAttribute(attr, value) else: elem.removeAttribute(attr)
def result_append_attr(elem, attr, value): if (elem is not None) and (attr is not None): old_value = elem.getAttribute(elem, attr) if old_value is None: elem.setAttribute(attr, value) else: elem.setAttribute(attr, old_value + value)