Re: Fw: Project Gutenberg

by "Frank Boumphrey" <bckman(at)ix.netcom.com>
Date:	Mon, 7 Feb 2000 19:59:50 -0500
To:	"Arjun Ray" <aray(at)nyct.net>, "HWG Gutenberg DTDs" <hwg-gutenberg-dtds(at)hwg.org>
References:	nyct
	todo: View Thread, Original
> I think you mean encodings.  There's only one character set for XML
> documents.

Yes I do, I stand corrected!

> As a straw proposal for a policy, I'd suggest allowing people to
> *submit* stuff in various encodings, but *storing* them in one of a
> severely restricted number of such encodings.  (I'd go for UTF-8.)
> That involves a programmatic or controlled check-in process, of
> course.

I'm afraid that you loose me here. If I submit a document in one encoding,
say the Gujerati encoding, how do I then store it in another encoding? Also
I'm not clear whether it is possible to use the tagset of a dtd in one
encoding to mark up a different encoding, or will different DTD's have to be
developed for each encoding.

There is of course no need for documents all to be in the same ware house. I
would anticipate that it would be possible to find sites for storing the
different encodings if this is a problem.

Pardon my rather simplistic questions, but this is all rather new stuff for
me!

Can you suggest any sites or readings where I and other interested parties
can get up to speed on this subject?

TIA

Frank


----- Original Message -----
From: Arjun Ray <aray(at)nyct.net>
To: HWG Gutenberg DTDs <hwg-gutenberg-dtds(at)hwg.org>
Sent: Monday, February 07, 2000 6:47 PM
Subject: Re: Fw: Project Gutenberg


>
>
> On Mon, 7 Feb 2000, Frank Boumphrey wrote:
>
> > I believe that project gutenberg should be open to all character
> > sets.
>
> I think you mean encodings.  There's only one character set for XML
> documents.
>
> I sympathize with your intent, but I'll argue that agreeing to store
> objects in a (vast) variety of encodings is asking for a maintenance
> nightmare in the longer haul.
>
> > do we have to say anything at all about it in the DTD's?
>
> No.  Actually, this is an old old debate, between the people who say
> if not believe that documents/files/entities/objects/whatnot should be
> self-describing, and the people who argue that out-of-band mechanisms
> are needed/better/inevitable.  (I *think* I know which side I'm on:))
> Text files are cursed by the reality of a multitude of encodings, and
> the basic problem here is that by the time you (the program) are
> reading the document, it could be too late.  This applies to the DTD,
> too.
>
> > Can we not just declare the character set in the XML declaration.
>
> The character *encoding*:)  This is the same as the meta hack, and
> broken for the same reasons.  The XML declaration gets away with a
> compromise - two and only two, such that it isn't too late at the
> point the program reads the declaration.  Allowing arbitrary encodings
> that late won't fly, IMHO.
>
> > The default is UTF-8 any way, so am I correct in thinking that
> > there is no need to say any thing unless the DTD was going to be
> > used to mark up Hebrew or Gujerati.
>
> There may be no need to say anything at all.  Right now, the solution
> will have to be server-side.  It doesn't matter what form the *stored*
> DTD takes as long as it goes over the wire in UTF-8/16.  This could
> apply to files, too - but I'm not sure I like the implications.
> Managing a gazillion transcodings is not a happy server side prospect.
>
> > I do think however that we should get a policy on this, and i
> > think Murray and Arjun are the two guys to do this:>).
>
> Oh no, you don't!:)
>
> As a straw proposal for a policy, I'd suggest allowing people to
> *submit* stuff in various encodings, but *storing* them in one of a
> severely restricted number of such encodings.  (I'd go for UTF-8.)
> That involves a programmatic or controlled check-in process, of
> course.
>
>
> Arjun
>
>
HWG: hwg-gutenberg-dtds mailing list archives, maintained by Webmasters @ IWA