Re: Fw: Project Gutenberg
by "Frank Boumphrey" <bckman(at)ix.netcom.com>
|
Date: |
Mon, 7 Feb 2000 19:59:50 -0500 |
To: |
"Arjun Ray" <aray(at)nyct.net>, "HWG Gutenberg DTDs" <hwg-gutenberg-dtds(at)hwg.org> |
References: |
nyct |
|
todo: View
Thread,
Original
|
|
> I think you mean encodings. There's only one character set for XML
> documents.
Yes I do, I stand corrected!
> As a straw proposal for a policy, I'd suggest allowing people to
> *submit* stuff in various encodings, but *storing* them in one of a
> severely restricted number of such encodings. (I'd go for UTF-8.)
> That involves a programmatic or controlled check-in process, of
> course.
I'm afraid that you loose me here. If I submit a document in one encoding,
say the Gujerati encoding, how do I then store it in another encoding? Also
I'm not clear whether it is possible to use the tagset of a dtd in one
encoding to mark up a different encoding, or will different DTD's have to be
developed for each encoding.
There is of course no need for documents all to be in the same ware house. I
would anticipate that it would be possible to find sites for storing the
different encodings if this is a problem.
Pardon my rather simplistic questions, but this is all rather new stuff for
me!
Can you suggest any sites or readings where I and other interested parties
can get up to speed on this subject?
TIA
Frank
----- Original Message -----
From: Arjun Ray <aray(at)nyct.net>
To: HWG Gutenberg DTDs <hwg-gutenberg-dtds(at)hwg.org>
Sent: Monday, February 07, 2000 6:47 PM
Subject: Re: Fw: Project Gutenberg
>
>
> On Mon, 7 Feb 2000, Frank Boumphrey wrote:
>
> > I believe that project gutenberg should be open to all character
> > sets.
>
> I think you mean encodings. There's only one character set for XML
> documents.
>
> I sympathize with your intent, but I'll argue that agreeing to store
> objects in a (vast) variety of encodings is asking for a maintenance
> nightmare in the longer haul.
>
> > do we have to say anything at all about it in the DTD's?
>
> No. Actually, this is an old old debate, between the people who say
> if not believe that documents/files/entities/objects/whatnot should be
> self-describing, and the people who argue that out-of-band mechanisms
> are needed/better/inevitable. (I *think* I know which side I'm on:))
> Text files are cursed by the reality of a multitude of encodings, and
> the basic problem here is that by the time you (the program) are
> reading the document, it could be too late. This applies to the DTD,
> too.
>
> > Can we not just declare the character set in the XML declaration.
>
> The character *encoding*:) This is the same as the meta hack, and
> broken for the same reasons. The XML declaration gets away with a
> compromise - two and only two, such that it isn't too late at the
> point the program reads the declaration. Allowing arbitrary encodings
> that late won't fly, IMHO.
>
> > The default is UTF-8 any way, so am I correct in thinking that
> > there is no need to say any thing unless the DTD was going to be
> > used to mark up Hebrew or Gujerati.
>
> There may be no need to say anything at all. Right now, the solution
> will have to be server-side. It doesn't matter what form the *stored*
> DTD takes as long as it goes over the wire in UTF-8/16. This could
> apply to files, too - but I'm not sure I like the implications.
> Managing a gazillion transcodings is not a happy server side prospect.
>
> > I do think however that we should get a policy on this, and i
> > think Murray and Arjun are the two guys to do this:>).
>
> Oh no, you don't!:)
>
> As a straw proposal for a policy, I'd suggest allowing people to
> *submit* stuff in various encodings, but *storing* them in one of a
> severely restricted number of such encodings. (I'd go for UTF-8.)
> That involves a programmatic or controlled check-in process, of
> course.
>
>
> Arjun
>
>
HWG: hwg-gutenberg-dtds mailing list archives,
maintained by Webmasters @ IWA