Check out my pictures and video from the April 2024 total solar eclipse!
Fleas can be taught nearly anything that a Congressman can.
— Mark Twain, What is Man?
Posted on 2017-Oct-04 at 17:11:44 by Phil
Last update on 2022-Apr-24 at 21:05:00 by Phil
Semantic markup is the practice of tagging text with why that text is there, rather than simply “this is how it looks” (styling or presentation markup). This gives structure to text source, which can be useful in at least two ways:
If you wanted a certain look to, say, your chapter headings, would you rather do the following each time you started a new chapter?
My Chapter Title
right justified,
in 15pt Cooper Black.or,
<chap_start>My Chapter
Title</chap_start>
.In the second case, some sort of “style file” (such as CSS) knows how you want your chapters started and styled. If you don’t like the look of it, you change things in one place — perhaps a different typeface, or a different size. And the markup language keeps track of the chapter numbers for you. You wouldn’t believe how many people think it’s easier to just do it the first way (rather than learn something new)!
If all you’re doing is a one or two page memo, that will at most be printed out once or twice, and never updated or consolidated into some sort of collection, such manual operations are acceptable. However, for anything beyond that, you should consider a semantic markup setup. It can even be WYSIWYG editing, so long as the element buttons are semantic descriptions and not just styling. That is, buttons for “emphasis”, “citation”, etc. and not for “italic”, “bold”, “underline”, etc. Certainly for books, manuals, journal articles, and the like, markup with semantics is mandatory.
For example, most WYSIWYG editors allow you to designate text as italic or bold (or both). This is bad practice for anything beyond a brief letter or memo. Let’s say you use italics for emphasized text, for titles (citations), and for foreign words. You just wrote a nice technical report with a Word Processor, and your boss is so impressed that she asks you to submit it to a technical or scientific journal (as an article). The journal bounces back the manuscript with some style suggestions: “we use bold for emphasis, underlines for citations, and a different typeface for foreign words.” If you had written this in a markup language, it would be easy to change the definition (in the style file) of “emphasis” from italic to bold, of “citation” from italic to underlined, and “foreign” from italic to a different typeface. Alas, you are going to have to trudge through the manuscript word-by-word and manually change all italics, after figuring out why you used italics for something. Fun! If it’s a standard markup (such as HTML or LaTeX), you might be able to simply submit the markup and let the journal or publisher supply the style file. And lest you think this is an exaggeration, I’ve heard of publishers who want typewriter (fixed pitch) style submissions so they can print them out and count words, line lengths, and be double-spaced with room for editor’s marks!
Some WYSIWYG Word Processors (such as MS Word) can give you limited semantic markup (e.g., designate headings for various purposes, at different levels), but you generally cannot export them to flat text files (sometimes to HTML). Even when you can, they often come loaded up with all sorts of extra crap (font selections, sizes, etc., that are repeated over and over) that you’d really rather not have to deal with. This is not to say that WYSIWYG word processing can’t deliver good, clean markup; it’s just that it’s usually something tacked on after the fact, and it’s not really designed from the ground up to do that. There’s almost always some styling controls or tags mixed in, that you’ll need to fix, especially if generic “italic” and “bold” etc. stylings are available.
Once you have your (semantic) markup cleanly separated from the text and
from the styling, what can you do with it? Well, such text is better for screen
readers, as knowing what the text is for can clue the reader in to how
to modulate its voice. For instance, emphasized text might be read slower,
louder, and at a lower pitch. A citation might have a slight pause before and
after it. And a foreign word might be pronounced correctly (if the language
used is included somewhere, such as <foreign
lang="fr">après-ski</foreign>
embedded within English text).
Another use for flat file text source with markup (tags) is to have some processor scan through it looking for certain tags, and extracting that text into a separate file. For instance, find all citation tags to start building a bibliography for your document. Another could be to extract foreign words and phrases to start building a glossary. In both cases, the list could be sorted and manually or automatically looked over to spot possible misspellings and typos, helping to clean up your source. This must all be done manually if all you did was italicize this material.
For HTML web page markup, putting as much of the styling as possible into CSS leaves leaner (smaller), cleaner HTML text which search engines prefer to something cluttered and bloated with styling markup. For a journal, magazine, or book submission, it becomes much easier to meet their styling guidelines when style information is consolidated into one place.
Posted on 2022-Apr-07 at 09:47:00 by Phil
As an update to the above article, there are times when a bare <i> or <b> etc. might be necessary. For example, if you are giving a genus or species name such as Homo Sapiens, it’s supposed to be rendered in italics. If you are doing a lot of taxonomy work, it could be worthwhile to define a <species> tag that resolves to italicizing the word(s), but unless you’ll be using it a lot, it may not be worth the effort.
So, you will probably end up needing to have bare italic, bold, underline, etc. commands for those times that you need it for a one-off case. See this article for (among other things) some further discussion on the topic. The whole point, however, is that you should be in the habit of using the appropriate semantic commands wherever possible (and available), rather than overloading a simple italic or bold command.
Posted on 2024-Jun-02 at 17:46:00 by Phil
There is another sort of markup which may fall somewhere in-between semantic and presentation, but does not appear to yet exist, at least for HTML and CSS. This would be directions on how a voice system should pronounce words and phrases, and not just for foreign words.
I have noticed computer-generated voice-over on many Youtube videos, and something that always tremendously annoys me is the computer voice either mispronouncing unusual words and names, or pronouncing numeric strings as numeric values when they shouldn’t (or vice-versa).
An example would be “…the Messerschmitt Bf 109 earned a fearful
reputation…” where the “109” is voiced as
“one-hundred-nine” instead of the proper “one-oh-nine”.
A markup such as <digits>109</digits>
could be used to
force a more natural and proper pronunciation. In printed form, it would appear
as the simple string “109”.
For long strings of digits, it may be desirable to break (and hyphenate) them when they reach the end of a line, rather than forcing them to remain together as a single unit. This (along with whether to pronounce “0” as “oh” or “zero”) could be controlled by tag attributes and/or CSS.
There could be a corresponding <number>
tag to reassure
the voice unit that, despite the presence of thousands-separator commas,
decimal points, etc. (and watch out for different language-dependent settings
of these), a string of digits should be naturally read as a single number. For
example, “…a total of
<number>1,522</number>
passengers and crew
died…“ would print as “1,522”, while it would
be voiced as “one-thousand, five-hundred, and twenty-two”
rather than the annoying “one [pause] five twenty-two” I hear so
often. It is incredible that text-to-voice systems cannot handle such a simple
thing, but they don’t.
As for correcting the pronunciation of words (and especially, names), some
sort of <pronounce phonemes="[phoneme list, using ASCII
names]">
could work. Such a pronunciation guide should be
independent of the language being used. The tag could also indicate primary and
secondary stresses (volume of syllables), and might be combined with a guide to
unusual or desired word-breaking points (e.g., “re-cord” or
“rec-ord”). Perhaps <word sound="phonemes" split="syllable
list">
?
All content © copyright 2005 – 2024
by Catskill Technology Services, LLC.
All rights reserved.
Note that Third Party software (whether Open Source or proprietary) on this
site remains under the copyright and license of its owners.
Catskill Technology Services, LLC does not claim copyright over such software.
This page is https://www.catskilltech.com/utils/show.php?link=why-semantic-markup
Search Quotations database.
Last updated Wed, 28 Aug 2024 at 10:52 PM