CTS logo
hazy blue Catskill Mountains in distance

News:

Check out my pictures and video from the April 2024 total solar eclipse!


A Thought…

Fleas can be taught nearly anything that a Congressman can.

   — Mark Twain, What is Man?

Why semantic markup?

Posted on 2017-Oct-04 at 17:11:44 by Phil
Last update on 2022-Apr-24 at 21:05:00 by Phil

Semantic markup is the practice of tagging text with why that text is there, rather than simply “this is how it looks” (styling or presentation markup). This gives structure to text source, which can be useful in at least two ways:

  1. Rather than repeating the “how it looks” (appearance) information with every use, it is consolidated into one place for consistency and easy changes to the appearance.
  2. It can be searchable for changing or inventorying certain uses.

If you wanted a certain look to, say, your chapter headings, would you rather do the following each time you started a new chapter?

  1. Give command to skip to the top of a right-hand (odd number) page.
  2. Skip down several lines.
  3. Give the chapter number right justified, in 30pt Cooper Black.
  4. Next line, give the heading My Chapter Title right justified, in 15pt Cooper Black.
  5. Skip down several lines.
  6. Start your first paragraph.

or,

  1. Give markup such as <chap_start>My Chapter Title</chap_start>.
  2. Start your first paragraph.

In the second case, some sort of “style file” (such as CSS) knows how you want your chapters started and styled. If you don’t like the look of it, you change things in one place — perhaps a different typeface, or a different size. And the markup language keeps track of the chapter numbers for you. You wouldn’t believe how many people think it’s easier to just do it the first way (rather than learn something new)!

If all you’re doing is a one or two page memo, that will at most be printed out once or twice, and never updated or consolidated into some sort of collection, such manual operations are acceptable. However, for anything beyond that, you should consider a semantic markup setup. It can even be WYSIWYG editing, so long as the element buttons are semantic descriptions and not just styling. That is, buttons for “emphasis”, “citation”, etc. and not for “italic”, “bold”, “underline”, etc. Certainly for books, manuals, journal articles, and the like, markup with semantics is mandatory.

For example, most WYSIWYG editors allow you to designate text as italic or bold (or both). This is bad practice for anything beyond a brief letter or memo. Let’s say you use italics for emphasized text, for titles (citations), and for foreign words. You just wrote a nice technical report with a Word Processor, and your boss is so impressed that she asks you to submit it to a technical or scientific journal (as an article). The journal bounces back the manuscript with some style suggestions: “we use bold for emphasis, underlines for citations, and a different typeface for foreign words.” If you had written this in a markup language, it would be easy to change the definition (in the style file) of “emphasis” from italic to bold, of “citation” from italic to underlined, and “foreign” from italic to a different typeface. Alas, you are going to have to trudge through the manuscript word-by-word and manually change all italics, after figuring out why you used italics for something. Fun! If it’s a standard markup (such as HTML or LaTeX), you might be able to simply submit the markup and let the journal or publisher supply the style file. And lest you think this is an exaggeration, I’ve heard of publishers who want typewriter (fixed pitch) style submissions so they can print them out and count words, line lengths, and be double-spaced with room for editor’s marks!

Some WYSIWYG Word Processors (such as MS Word) can give you limited semantic markup (e.g., designate headings for various purposes, at different levels), but you generally cannot export them to flat text files (sometimes to HTML). Even when you can, they often come loaded up with all sorts of extra crap (font selections, sizes, etc., that are repeated over and over) that you’d really rather not have to deal with. This is not to say that WYSIWYG word processing can’t deliver good, clean markup; it’s just that it’s usually something tacked on after the fact, and it’s not really designed from the ground up to do that. There’s almost always some styling controls or tags mixed in, that you’ll need to fix, especially if generic “italic” and “bold” etc. stylings are available.

Once you have your (semantic) markup cleanly separated from the text and from the styling, what can you do with it? Well, such text is better for screen readers, as knowing what the text is for can clue the reader in to how to modulate its voice. For instance, emphasized text might be read slower, louder, and at a lower pitch. A citation might have a slight pause before and after it. And a foreign word might be pronounced correctly (if the language used is included somewhere, such as <foreign lang="fr">après-ski</foreign> embedded within English text).

Another use for flat file text source with markup (tags) is to have some processor scan through it looking for certain tags, and extracting that text into a separate file. For instance, find all citation tags to start building a bibliography for your document. Another could be to extract foreign words and phrases to start building a glossary. In both cases, the list could be sorted and manually or automatically looked over to spot possible misspellings and typos, helping to clean up your source. This must all be done manually if all you did was italicize this material.

For HTML web page markup, putting as much of the styling as possible into CSS leaves leaner (smaller), cleaner HTML text which search engines prefer to something cluttered and bloated with styling markup. For a journal, magazine, or book submission, it becomes much easier to meet their styling guidelines when style information is consolidated into one place.


Posted on 2022-Apr-07 at 09:47:00 by Phil

As an update to the above article, there are times when a bare <i> or <b> etc. might be necessary. For example, if you are giving a genus or species name such as Homo Sapiens, it’s supposed to be rendered in italics. If you are doing a lot of taxonomy work, it could be worthwhile to define a <species> tag that resolves to italicizing the word(s), but unless you’ll be using it a lot, it may not be worth the effort.

So, you will probably end up needing to have bare italic, bold, underline, etc. commands for those times that you need it for a one-off case. See this article for (among other things) some further discussion on the topic. The whole point, however, is that you should be in the habit of using the appropriate semantic commands wherever possible (and available), rather than overloading a simple italic or bold command.


Posted on 2024-Jun-02 at 17:46:00 by Phil

There is another sort of markup which may fall somewhere in-between semantic and presentation, but does not appear to yet exist, at least for HTML and CSS. This would be directions on how a voice system should pronounce words and phrases, and not just for foreign words.

I have noticed computer-generated voice-over on many Youtube videos, and something that always tremendously annoys me is the computer voice either mispronouncing unusual words and names, or pronouncing numeric strings as numeric values when they shouldn’t (or vice-versa).

An example would be “…the Messerschmitt Bf 109 earned a fearful reputation…” where the “109” is voiced as “one-hundred-nine” instead of the proper “one-oh-nine”. A markup such as <digits>109</digits> could be used to force a more natural and proper pronunciation. In printed form, it would appear as the simple string “109”.

For long strings of digits, it may be desirable to break (and hyphenate) them when they reach the end of a line, rather than forcing them to remain together as a single unit. This (along with whether to pronounce “0” as “oh” or “zero”) could be controlled by tag attributes and/or CSS.

There could be a corresponding <number> tag to reassure the voice unit that, despite the presence of thousands-separator commas, decimal points, etc. (and watch out for different language-dependent settings of these), a string of digits should be naturally read as a single number. For example, “…a total of <number>1,522</number> passengers and crew died…“ would print as “1,522”, while it would be voiced as “one-thousand, five-hundred, and twenty-two” rather than the annoying “one [pause] five twenty-two” I hear so often. It is incredible that text-to-voice systems cannot handle such a simple thing, but they don’t.

As for correcting the pronunciation of words (and especially, names), some sort of <pronounce phonemes="[phoneme list, using ASCII names]"> could work. Such a pronunciation guide should be independent of the language being used. The tag could also indicate primary and secondary stresses (volume of syllables), and might be combined with a guide to unusual or desired word-breaking points (e.g., “re-cord” or “rec-ord”). Perhaps <word sound="phonemes" split="syllable list">?

 

All content © copyright 2005 – 2024 by Catskill Technology Services, LLC.
All rights reserved.
Note that Third Party software (whether Open Source or proprietary) on this site remains under the copyright and license of its owners. Catskill Technology Services, LLC does not claim copyright over such software.

 

This page is https://www.catskilltech.com/utils/show.php?link=why-semantic-markup

Search Quotations database.

Last updated Wed, 28 Aug 2024 at 10:52 PM

Valid HTML 5

Thu, 12 Sep 2024 at 2:10 PM EDT