From David Fitch
Dear Paul,
some comments on your nomenclature plan. To reduce confusion as well
as simplify databasing, all nomenclatural rules could be reduced to:
1. KEEP SPECIES PREFIXES, BUT MAKE THEM 3 LETTERS LONG.
2. KEEP THE 3-LETTER + INTEGER DESIGNATION FOR ALL GENES, REGARDLESS
OF SPECIES.
The reasons follow:
1. Orthologs will be given the same
name but with a species prefix. For example,
cb-tra-1 is the C. briggsae ortholog
of C. elegans tra-1. In some cases, there will be
paralogs and some confusion; we expect this to be minor compared to
the convenience of having orthologs having the same names. For
paralogs, a "dot and number" can be appended to distinguish paralogs,
e.g., hsp-16.1, hsp-16.2.
--Known single orthologous gene pairs should indeed share the same
name. However, there are may well be MULTIPLE ORTHOLOGS as well as
multiple paralogs (see attached GIF file "Orthoparalogy.gif"). For
example, Cja-gen-1 might be orthologous not only to Cbr-gen-1, but
also to Cbr-gen-2, but Cbr-gen-1 could be paralogous to Cel-gen-2.
About using decimal points, isn't this format ALREADY USED to
designate alternative splice products? If so, using the same
nomenclature for different genes would be confusing.
I suggest simply different gene numbers for different genes,
regardless of orthology/paralogy (which can be difficult to
determine). Given the prefix, we will know what species the gene
comes from. As long as the gene identifiers are unique FOR EACH
SPECIES, there should be no problem, if it is also stated upfront that
there is NO IMPLICATION OF ORTHOLOGY/PARALOGY between genes with
similar names in different species. This is simply because otherwise
we would have to establish orthology/paralogy a priori, a task which
will just set up a huge opportunity for massive confusion.
Also, a 2-letter species id code is not sufficient (nor is it
standard). "Cb" could as well refer to C. bovis as to C. briggsae.
The standard used (e.g. in many phylogenetic papers and restriction
enzymes) is a 3-letter code. The first letter (capitalized) is the
first letter of the genus epithet; the second and third letters (lower
case) are the first and second letters of the species epithet. For
example, "Eco" stands for E. coli (as in EcoRI); "Cbr" would stand for
C. briggsae.
2. When a gene is identified in
another species that belongs to a gene class with a clear equivalent
in C. elegans, it should be given the same gene class
name, but with a unique symbol as a postfix to the number. The symbol
will include one or more letters followed by a number. For
example, C. briggsae genes could be
dpy-cb1 OR dpy-B1 OR
dpy-CAENORHABDITISBRIGGSAE000000001 etc. The organism's community
should decide on the exact implementation; this choice will be tracked
by the CGC or WormBase. A species prefix could be added but will be
redundant, e.g., cb-dpy-cb1 OR
cb-dpy-B1. OR ce-dpy-1.
--Yes, any unique identifier (unique within a species) could be used.
Simple integers are a fine code--they work well for C. elegans; they
should work well for other species. I would vote to stick with simple
integers without additional modifiers.
2a. We propose that dpy-cb1 be the
form for C. briggsae since lower case is easier to type. (Paul
Sternberg, Bhagwati Gupta have so far expressed this preference.)
--Mixing numerals with letters might be okay, but integers are simpler
and still unique identifiers. Also, they are easily sorted in simple
databases. In this case, the species prefix must be used, since it
would not be redundant. (Also, it makes sense to have the species
identified before the gene which occurs in that species.)
3. Gene classes with no equivalent
in C. elegans or other species will be given unique
three-letter-number names.
--fine.
4. For alleles, strains,
polymorphisms, rearrangements, transgenes, and other variants, unique
numbers (unique across all species) will be assigned by the relevant
laboratory using the standard C. elegans
nomenclature. In all cases, a species prefix can be used, but is
redundant. For example, "syIs802" is an integrated
transgene in C. briggsae from the Sternberg
laboratory; it could be referred to as
cb-syIs802. syIs802will never be used for
something else, especially a C. elegans transgene.
Existing gene classes used in other
species could be retained since there are not too many of them
(e.g., ped), but should be retired if possible.
--Yes, but KEEP THE SPECIES PREFIXES AND DROP THE EXTRA STUFF IN THE
ALLELE OR REARRANGEMENT NAME (again, for simplicity of databasing and
to keep it simple and consistent with current standards).
4a. We propose that the C.
briggsae classes be retired
--NOT NECESSARY. These are already unique gene identifiers.
Actually, these "classes" are sometimes quite arbitrary. For example,
not all alleles of let-7 are lethal. And the same gene may have
different developmental or other roles in different species. As long
as each gene has a unique name in a particular species, there is no
problem. We can always figure out which genes are orthologues,
paralogues, xenologues, etc. by phylogenetic analysis later.
Responsibility for the numbering of a
gene class will reside with the assigning laboratory, unless
transferred by them to WormBase and the
Caenorhabditis Genetics Center. (As in the present practice,
in some cases, if desirable, a small block of numbers can be assigned
to another laboratory.)
--yes.
Notes.
Bird and Riddle (1994 J. Nematol.)
proposed nomenclature for parasitic nematodes. They suggested
following the C. elegans guidelines but with
designations in parallel. It would be desirable for one source
(CGC/Wormbase) to enforce uniqueness of lab and allele designations.
--yes. There should be a central clearinghouse for gene nomenclature,
and the CGC is the place to do this.
Philosophy and constraints:
a. From an informatician's
perspective, each genetic entity should have a unique name, and there
should be an authority to maintain uniqueness.
b. From a researcher's perspective,
the names should be easy-to-use and intuitive, and not generate
confusing nicknames (think about what you would write on the side of
your Petri plate). Sub-communities (e.g., those working on
Pristionchus or C. briggsae) would tend to
drop lengthy identifiers.
c. If possible, the names should not
stifle creativity.
d. From a classical geneticist's point
of view, there should be names that can be used for decades before the
molecular identity of a locus is known.
e. From a molecular geneticist's point
of view, orthology should be obvious from the name. There can
multiple homologs for a gene, and orthology might not be clear,
especially if full genome sequence is not available.
--I ONLY DISAGREE WITH POINT E. THERE MAY WELL BE MULTIPLE
ORTHOLOGUES. BECAUSE THERE IS NO ONE-TO-ONE CORRESPONDENCE BETWEEN
ORTHOLOGOUS GENES, THERE CAN BE NO ONE-TO-ONE CORRESPONDENT
NOMENCLATURE. I think it is okay to abandon this ideal. Wormbase
could set up a database that matches orthologous relationships among
genes if necessary.
f. However, the name should not
confuse relationships among genes.
g. Other species names should not
crowd out those in C. elegans.
Uniqueness (a) is the overriding
concern. Ease of use is the second priority. Depending on the
researcher, (b,d) or maximizing (e) and minimizing (f) is more
important.
N. B. There are millions of nematode
species.
--Yes. Also, there are several species that will have identical
two-letter prefix designators. Establishing three-letter prefixes
would help prevent this problem.
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~ - -
~ David H. A. Fitch ~ \ / / /
~ Associate Professor ~ \/ / /
~ Department of Biology ~ \ / /
~ New York University ~ [] / /
~ Main Building, Room 1009 ~ \ / /
~ 100 Washington Square East ~ \/ /
~ New York, NY 10003 ~ \ /
~ U S A ~ \/
~ Tel.: (212) 998-8254 ~ \
~ Fax: (212) 995-4015 ~ \
~ e-mail:
david.fitch@nyu.edu
~ \
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
http://www.nyu.edu/projects/fitch/