Trying for Unicode, take 1 (with a bunch of Hebrew on the web tips while I'm on the subject)

| 7 Comments

This item is about Unicode. If you don't think that Unicode matters, or if you have stayed away because it sounds too technical, I heartily recommend Joel Spolsky's "Unicode and Character Sets" page. It's complete title is "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" but don't let that stop you if you aren't a programmer. Much of this applies, in spades, to the rest of us.

I haven't had time to breathe for months. There is a lot of neat stuff that should be noted here and isn't here yet. But I thought I'd mention an especially neat item that killed this afternoon.

Max and Minka have an amazing Yiddish decoder ring on their website (go to www.maxminka.com and click on "yiddish"). This is great for people who have the simplest possible computers and just want to get some decent Yiddish onboard. Unfortunately, to avoid encoding issues, Max made up a backwards, non-standard encoding. Great for one-time use; awkward for turning into a manuscript using commercial fonts.

In the old days, I would have written a quick python script to move Max's non-standard Yiddish into my own non-standard Yiddish. Now I am working with Unicode and InDesign ME. Must get Unicode. And, I regret to say that after spending hours, I have not been able to write a recognizable Unicode file—one that could be opened in Word with the characters, not the names of the characters, displayed. So, I finally moved to the workaround.

I started each file with an HTML header and noted UTF-8 encoding. Then I wrote all of the characters to the file as HTML entities (in brief, take the Unicode hex, convert to decimal, and put into entity format. So, character 05D0, {HEBREW LETTER ALEF} becomes &s#1488;. By going the HTML route, I also had to make line-breaks explicit—HTML ignores the usual ASCII carriage return-line feed. That meant that I wrote a "<br>" after each line of (in this case) poetry.

Ugly, but it works. Here's the simple truth: Writing Unicode files may be complicated, but writing Unicode entities to an HTML file is entirely trivial.

There are some lessons worth taking from this. First, if you are composing something using the Hebrew alphabet (Hebrew, Aramaic, Ladino, Yiddish....) it is very important, if only for your own sanity, to make sure that you are using a modern (released in this century) word processor that understands unicode, and that you have fonts to match. If you are using something like MS Word, there are loads of appropriate fonts that come with. I believe that both the Mac and Windows enviornments provide reasonable fonts when the Hebrew resources are loaded (which come with the operating system, but are not loaded by default).

Second, of course, there are nifty tools like Max and Minka's Yiddish transmogrifier page or Raphael Finkel's Yidishe Shraybmashinke (Yiddish Typewriter). Both are fun to use and great solutions for many one-time or rare uses.

And, I guess another lesson is that I still need to learn how to write a file that will be opened by a Unicode-literate application and understood as Unicode. But, it is very much worth remembering that there is a shortcut: write to HTML, which is rendered in basic ASCII, and let your Word Processor put it all together.

Finally, as I think about putting more Hebrew-alphabet material on the web, it occurs to me that there are some tools for making language explicit that need to be considered. In the recent past, one might indicate hebrew with:

<meta http-equiv="content-type" content="text/html;charset=iso-8859-8">

When I first started thinking about coding webpages so that appropriate spiders would crawl the pages and indicate the correct language, I first looked at the ISO codepages. The presumption would be that if you are using the Hebrew character set, then the language of the page must be Hebrew. Of course, as we all know, Hebrew is one of several languages written with the Hebrew alphabet. This becomes somewhat moot when one starts encoding all pages with the charset "utf-8", because then you are telling a browser that any Unicode character, from Armenian to Korean might be present. So, one must take care to also include a meta language tag. As it happens, such tags exist for several relevant languages. The form is:

<meta http-equiv="content-language" content="en-us">

And here are some relevant codes:

Language  Code
Aramaicarc
Hebrewheb  or  he
Judeo-Arabicjrb
Judeo-Persianjpr
Ladinolad
Yiddishyid  or  yi

(For a complete set, see the W3C's ISO-639 page)

For some browsers, I found it necessary to also indicate text direction if I want words on the web in Hebrew-alphabet-languages to display in the proper visual order. Although this can be expressed as a meta tag, I found it worked much better as shown in this example, from the W3C's Language information and text direction page.

<q lang="he" dir="rtl">...a Hebrew quotation...</q>

I'll have to do some experimenting—I would guess that most of this should be more properly noted in a style sheet so that when you have a paragraph of Hebrew, you would define something like:

<p class="heb"> ... </p<>

Where you have a style sheet for "heb" that includes:

heb lang="yid" {direction: rtl}

where, if I understand CSS correctly, language is an attribute of the style and direction is a property.

7 Comments

Ari, are you using Mac or Windows or ?

If you are using a Mac, that might explain some of the problems you have been having with Word (what version?). Are you familiar with Mellel? http://www.redlers.com/mellel.html

John, I use both Mac and Windows. In this case, I've primarily been using Windows. I'm not sure what you mean by problems with Word? In this context, Word hasn't been a problem at all - my coding has been a problem (getting python to write Unicode-understood files). But, as noted, I got around my inadequacy as a programmer by simply writing as HTML. Word is comfortable reading the HTML, putting the contents into it's internal format, and from there I can save, or cut and paste into InDesign.

Ah, now I understand. I just picked up on your comment that you had been unable to generate a Unicode file that could be 'opened in Word', and thought the problem was on that end. I hadn't read carefully enough.

If you are using Windows XP or MacOS X and want to type Unicode Hebrew directly, but from time to time switch to Latin characters then look at http://www.tyndale.cam.ac.uk/Tyndale/TTech/TTech032.htm
he offers a keyboard solution that works well for me. Using Dreamweaver. I am sorry I am not a coder, so cannot comment on Python scripts...

Cool! I don't have time to test these - hope others will post about their experiences - but it looks like there are some neat fonts/utilities at that site.

Thanks!

Shalom Ari, I came across your page looking to see if anyone has created a patch for InDesign to be able to have reversable text direction functionallity without having to purchase the ME edition (yet another expense). I did, however, notice that you were running into the same road blocks I was having a year or so ago. I'm on a Mac, and writing unicode has never been easier (Baruch Hashem!). The first thing you need to do is trash all of those plug-in things that you've installed trying to get it to work (like the recommended plug ins for Mellel). With the last two operating systems they only stifle the built-in features of the OS. The next thing you need to do is setup your Hebrew (or take your pick) Keyboard input from your "International" preference pane in the System Prefs. I have "Hebrew-QWERTY" selected, since I've never used a standard Hebrew keyboard. Now, from any application, I can use a keystroke to "invoke" the unicode Hebrew character set. This works better in some applications than others. Native Apple applications (like TextEdit, Pages, etc.) works seamlessly, including right-to-left text orientation. Others, however, are not as good, such as InDesign and Dreamweaver. InDesign works perfectly, EXCEPT for the text direction. Dreamweaver MX is extremely difficult to edit characters once deselected, due to the directional encoding, but MX 2004 is really nice. Maybe this will help a little. Maybe not.

No, InDesign does NOT work perfectly with Hebrew except for text direction. There is far more involved in making languages like Hebrew and Arabic, or Korean, Japanese, Chinese, etc. work in a typographically useful form than simple text direction. And there is a patch or plug in - it is called the Middle Eastern version of the program. (Okay, actually, a modified version of the original program with appropriate plug-in.)

If typography doesn't matter, then a simpler application such as Mellel is fine. But once you need the sorts of refinement that InDesign offers, it is a non-trivial problem and that's why there are special versions of the software for some specific languages.

About this Entry

This page contains a single entry by Ari Davidow published on August 13, 2005 11:01 PM.

From the " word to the wise about layout" department was the previous entry in this blog.

A Yiddish-English-Russian newsletter @ KlezKamp is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.