From David Fitch
Dear Paul,
some comments on your nomenclature plan.  To reduce confusion as well as simplify databasing, all nomenclatural rules could be reduced to:


The reasons follow:

1.  Orthologs will be given the same name but with a species prefix.  For example, cb-tra-1 is the C. briggsae ortholog of C. elegans tra-1.  In some cases, there will be paralogs and some confusion; we expect this to be minor compared to the convenience of having orthologs having the same names.  For paralogs, a "dot and number" can be appended to distinguish paralogs,  e.g., hsp-16.1, hsp-16.2.

--Known single orthologous gene pairs should indeed share the same name.  However, there are may well be MULTIPLE ORTHOLOGS as well as multiple paralogs (see attached GIF file "Orthoparalogy.gif").  For example, Cja-gen-1 might be orthologous not only to Cbr-gen-1, but also to Cbr-gen-2, but Cbr-gen-1 could be paralogous to Cel-gen-2.

About using decimal points, isn't this format ALREADY USED to designate alternative splice products?  If so, using the same nomenclature for different genes would be confusing.

I suggest simply different gene numbers for different genes, regardless of orthology/paralogy (which can be difficult to determine).  Given the prefix, we will know what species the gene comes from.  As long as the gene identifiers are unique FOR EACH SPECIES, there should be no problem, if it is also stated upfront that there is NO IMPLICATION OF ORTHOLOGY/PARALOGY between genes with similar names in different species.  This is simply because otherwise we would have to establish orthology/paralogy a priori, a task which will just set up a huge opportunity for massive confusion.

Also, a 2-letter species id code is not sufficient (nor is it standard).  "Cb" could as well refer to C. bovis as to C. briggsae.  The standard used (e.g. in many phylogenetic papers and restriction enzymes) is a 3-letter code.  The first letter (capitalized) is the first letter of the genus epithet; the second and third letters (lower case) are the first and second letters of the species epithet.  For example, "Eco" stands for E. coli (as in EcoRI); "Cbr" would stand for C. briggsae.

2. When a gene is identified in another species that belongs to a gene class with a clear equivalent in C. elegans, it should be given the same gene class name, but with a unique symbol as a postfix to the number.  The symbol will include one or more letters followed by a number. For example, C. briggsae genes could be dpy-cb1 OR dpy-B1 OR dpy-CAENORHABDITISBRIGGSAE000000001 etc.  The organism's community should decide on the exact implementation; this choice will be tracked by the CGC or WormBase.  A species prefix could be added but will be redundant, e.g., cb-dpy-cb1 OR cb-dpy-B1.  OR ce-dpy-1.

--Yes, any unique identifier (unique within a species) could be used.  Simple integers are a fine code--they work well for C. elegans; they should work well for other species.  I would vote to stick with simple integers without additional modifiers.

2a.  We propose that dpy-cb1 be the form for C. briggsae since lower case is easier to type.  (Paul Sternberg, Bhagwati Gupta have so far expressed this preference.)

--Mixing numerals with letters might be okay, but integers are simpler and still unique identifiers.  Also, they are easily sorted in simple databases.  In this case, the species prefix must be used, since it would not be redundant.  (Also, it makes sense to have the species identified before the gene which occurs in that species.)

3.  Gene classes with no equivalent in C. elegans or other species will be given unique three-letter-number names.  


4.  For alleles, strains, polymorphisms, rearrangements, transgenes, and other variants, unique numbers (unique across all species) will be assigned by the relevant laboratory using the standard C. elegans nomenclature.  In all cases, a species prefix can be used, but is redundant.  For example, "syIs802" is an integrated transgene in C. briggsae from the Sternberg laboratory; it could be referred to as cb-syIs802.   syIs802will never be used for something else, especially a C. elegans transgene.

Existing gene classes used in other species could be retained since there are not too many of them (e.g., ped), but should be retired if possible. 

--Yes, but KEEP THE SPECIES PREFIXES AND DROP THE EXTRA STUFF IN THE ALLELE OR REARRANGEMENT NAME (again, for simplicity of databasing and to keep it simple and consistent with current standards).

4a.  We propose that the C. briggsae classes be retired

--NOT NECESSARY.  These are already unique gene identifiers.  Actually, these "classes" are sometimes quite arbitrary.  For example, not all alleles of let-7 are lethal.  And the same gene may have different developmental or other roles in different species.  As long as each gene has a unique name in a particular species, there is no problem.  We can always figure out which genes are orthologues, paralogues, xenologues, etc. by phylogenetic analysis later.

Responsibility for the numbering of a gene class will reside with the assigning laboratory, unless transferred by them to WormBase and the Caenorhabditis Genetics Center.  (As in the present practice, in some cases, if desirable, a small block of numbers can be assigned to another laboratory.)



Bird and Riddle (1994 J. Nematol.) proposed nomenclature for parasitic nematodes.  They suggested following the C. elegans guidelines but with designations in parallel.  It would be desirable for one source (CGC/Wormbase) to enforce uniqueness of lab and allele designations. 

--yes.  There should be a central clearinghouse for gene nomenclature, and the CGC is the place to do this.

Philosophy and constraints: 

a. From an informatician's perspective, each genetic entity should have a unique name, and there should be an authority to maintain uniqueness. 

b. From a researcher's perspective, the names should be easy-to-use and intuitive, and not generate confusing nicknames (think about what you would write on the side of your Petri plate). Sub-communities (e.g., those working on Pristionchus or C. briggsae) would tend to drop lengthy identifiers. 

c. If possible, the names should not stifle creativity.

d. From a classical geneticist's point of view, there should be names that can be used for decades before the molecular identity of a locus is known.

e. From a molecular geneticist's point of view, orthology should be obvious from the name.  There can multiple homologs for a gene, and orthology might not be clear, especially if full genome sequence is not available. 

--I ONLY DISAGREE WITH POINT E.  THERE MAY WELL BE MULTIPLE ORTHOLOGUES.  BECAUSE THERE IS NO ONE-TO-ONE CORRESPONDENCE BETWEEN ORTHOLOGOUS GENES, THERE CAN BE NO ONE-TO-ONE CORRESPONDENT NOMENCLATURE.  I think it is okay to abandon this ideal.  Wormbase could set up a database that matches orthologous relationships among genes if necessary.

f. However, the name should not confuse relationships among genes.

g. Other species names should not crowd out those in C. elegans.

Uniqueness (a) is the overriding concern.  Ease of use is the second priority.  Depending on the researcher, (b,d)  or maximizing (e) and minimizing (f) is more important.

N. B.   There are millions of nematode species. 

--Yes.  Also, there are several species that will have identical two-letter prefix designators.  Establishing three-letter prefixes would help prevent this problem.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~    ~       -   -
~ David H. A. Fitch            ~   \  /       /   /
~ Associate Professor          ~    \/       /   /
~ Department of Biology        ~     \      /   /
~ New York University          ~      []   /   /
~ Main Building, Room 1009     ~       \  /   /
~ 100 Washington Square East   ~        \/   /
~ New York, NY  10003          ~         \  /
~ U S A                        ~          \/
~ Tel.:  (212) 998-8254        ~           \
~ Fax:   (212) 995-4015        ~            \
~ e-mail: ~             \