The Atlas website structure

Philippe Dessen (Database Director) This email address is being protected from spambots. You need JavaScript enabled to view it.
Jean Loup Huret (Editor) This email address is being protected from spambots. You need JavaScript enabled to view it.
May 2017



I- Main page: 

II-1. Foreword 1: There are various types of items developed in the Atlas:
1- Genes ( )
1-1. Annotated genes (papers/cards written by authors) URLs:[name-of-gene]ID[number]ch[location].html (1 493 papers/cards, e.g.; and
1-2. Automated cards on genes (more or less like GeneCards); URLs:[name-of-gene].html (28 377 cards);
2- Leukemias ( ) (681 annotated papers/cards)
3- Solid tumors ( ) (217 annotated papers/cards)
4- Cancer-prone diseases ( ) (114 annotated papers/cards)
5- Case reports in hematology ( ) (88 papers/cards)

All these cards are structured from templates (e.g. Submission form for GENES: with the addition of a HEADER with tags or tracking devices allowing for indexing of the form in different parts of the data base (e.g. TRI_PAR_CHROMOSOME -> to which chromosome page (red arrow)? CATEGORY-> to which Cell Biology page (red arrow)?); see also: ;
- and EXTERNAL LINKS (bottom of each paper/card).
There are also
- Deep Insights (traditional papers) (113 Deep)
- Chromosome pages ([chromosome-number].html e.g. ) and
- Chromosome band pages ([band].html e.g. ),
- Cell biology pages ([category-name] e.g.
- ICD-O pages (International Classification of Diseases - Oncology WHO/OMS) e.g. and
- Atlas status (thesaurus of the Atlas: and sub-pages)
- and various other pages (Backpage: Recent papers , Educational items , Genes partners, International cancer programs etc. (see Main page)

II-2. Foreword 2: Editorial process:
This is an important part, as the Editorial database processing must take it into account. See "Editorial workflow in the Atlas": .
In particular, critically important, Tables are used 1- to identify all/each relevant item (Table 1 herein below); 2- to dialogue with authors (Table 2). Examples:




ID Atlas



05;00§ tri 5/NHL or chronic Lympho






05;00§ MDS with isolated del (5q)






05;00§ del(5)(q32q33) TNIP1/PDGFRB






… about 1,000 items/lines






99;99§ Extraosseous plasmacytoma






99;99§ Florid follicular hyperplasia PTLD













Florid follicular hyperplasia PTLD


3rd paper (leuk.) + 1 paper (gene)



del(5)(q32q33) TNIP1/PDGFRB


Reminder 2017/06/26; 2017/03/21   "Yes, will have this to you shortly"; Reminder 2016/11/19; Reminder 2016/06/17; 2016/01/17 no deadline ("soon"); 2015/10/14: OK



del(X)(p22p22) (P2RY8/CRLF2)


Spontaneous proposal

Note: Tables used to identify all/each relevant item must be related (bijective type relation) with cards/papers; e.g. 05;00§ MDS with isolated del (5q) / ID 1134  <-->

Finally, we also have to format the cards/papers into word for the "scientific journal" version (see (e.g. , equivalent of ) of the Atlas ("Export word" arrow), using a database other than the herein described ( : an almost fully operational database, under Microsoft environment (with ISS 7 on Windows Server 2008 R2 and SQL server 2008 R2). However, this database is too rigid, and does not allow much biological nor bioinformatics developments. This must either be modified or replaced by a new one, open source preferably).

III- Website Structure
III-1. Entities
The main goal at the origin of the project was to present several sets of monographies for Genes, Leukemias, Tumors, Cancer-prone diseases. The need of a database management was not crucial at this time. That why the Atlas is not a real database (e.g. mySQL etc.) but is organized around a set of structured Cards and numerous relations by the use of Indexes (generated with Perl scripts).

    Validation of txt files
    Preprocessing for Genes
    Processing of cards in hypertext
    General indexation
    Generation of chromosome pages
    Generation of chromosomal bands
    Tables of status, categories , authors ...
Interfaces with external data (Mitelman, COSMIC, Entrez gene, HGNC, UCSC ..)

III-2. Cards processing:
1. author: -> they send ".doc" files
2. editing from ".doc" to structured ".txt" file; with the addition of hyperlinks
3. Validation step
    • Edition of bibliography in alpha order from PMID (and search in PubMed)
    • Correction of special characters following a thesaurus of octal codes
    • Test of blocs and fields
    • Test of correct hyperlinks
4. Transformation into hypertext files
    Using specific scripts (, , ..)
5 . For Annotated Genes: addition of specific external links
    (specific management in parallel for the list of genes)
5 bis. For other genes: automatic creation from updated data (genes_g[cn].txt)
6. Addition of internal hyperlinks 
    (specific management in parallel)

III-3. Organisation of directories
All data is organized in two main directories
1. "cytatlas" (for managment)
2. " chromcancer" (with internet access)

1. cytatlas
    ./Genes0 (for expert txt)
    ./Genes (after txt processing)
    Each directory has some other subdirectories for Images, xxLinks …
    ./Scripts (all bash and perl + references     data)

2. chromcancer
(with subdirectories for Images ..)
    ./Indexbychrom (Chromosomes pages)

III-4. Indexation of Cards 1: script (in cytatlas/Scripts)
Used for re-indexation after new files or new data
1. Generation of all automatic genes
2. Generation of the main index file for all documents (ObjDB.txt)
3. Generation of a catalog (text file with the information from the HEADER, see:
4. Generation of some others indexes (ObjDBxx.txt)
5. Transformation of the catalog (and "for sale" - "to be written" files) in tables with concatenation in a catalog_full.txt file)
6. Indexations of Genes (Geneliste.html), Leukemias (Anomliste.html), etc.
7. Indexation by chromosomes
8. Indexation by authors (different IndxAuthxx.txt / html in Collab) (IndxAuth3.txt is the main index for authors and affiliations)
9. Generation of Categories (several files are maintained before in parallel) for Cell Biology items
10. Generation of status (Genes .. Authors . etc.):
11. Generation of Recent (last 2 years documents):
12. Generation of COSMIC projects and TCGA/ICGC projets
13. Statistics (
Possibility of mysql indexation for some items (query in the home page)

III-4 bis. Indexation of Cards 2: Generation of external links for all genes
1. Maintenance of 2 specific tables (genes_gc.txt and genes_gn.txt) for genes with more than 80 informations
All genes in the Atlas are extracted form Entrez Gene (NCBI) by ftp (each week) (60200)
and compared to UCSC genes (refGene.txt file for hg38). Only genes with a genomic location are conserved (27580 at this time).
Potentially cancer genes are specified with the presence of a list of terms in desription or generif:
"cancer","tumour","tumor","neoplasm","metastas","translocation","carcinogen","carcinom »,"lymphom","oncogen","repair","leukemia","transforming","melanoma","neuroblastoma","sarcoma","adenom","glioma","mitogen","fusion","proliferation","rearrangement","malignan"
External data are from HGNC, UniProt, UCSC, Ensembl, COSMIC, etc.) and are processed semi-automatically (an important step to be better formalized)
See: and 

III-5. Generation of internal hyperlinks and Cards
script (in cytatlas/Scripts)
1. Generation of internal hyperlinks
    In each card, hyperlinks are defined as a tag with the format: <CC: TXT: xxxxxxxxxxx  ID: yyy>
    TXT content correspond to the visible txt in the hypertext file;
    ID is the Atlas ID of the object;
    A compete file of hyperlinks is generated.
Map of one set towards another: injectivity/surjectivity:

Item Internal hyperlink toward
1 Gene n1 Leukemias
  n2 Solid tumors
  n3 Cancer-prone
1 Leukemia n4 Genes
  n5 Cancer-prone
1 Solid tumor n6 Genes
  n7 Cancer-prone
1 Cancer-prone n8 Genes

n9 Leukemias

  n10 Solid tumors


Item Hyperlinks toward
Gene NUP214 Leukemia t(6;9)(p23;q34) DEK/NUP214
    Leukemia t(9;9)(q34;q34) SET/NUP214
    Leukemia T cell ALL
    Solid Tumor Lung Adenocar. t(9;9)(q34;q34) PRRC2B/NUP214
Gene KIT Leukemia trisomy 4
    Solid Tumor Melanoma
    Cancer Prone Piebaldism
Leukemia t(6;9)(p23;q34) DEK/NUP214 Gene NUP214
    Gene DEK
Cancer Prone Tuberous sclerosis Gene TSC1
    Gene TSC2
    Solid Tumor Renal carcinoma
    Solid Tumor Ependymomas

2. For each card (e.g. Genes) generation of the links from the other types (ex: AnomLinks, TumorsLinks etc.) in an hypertext format (to be added when generation of hypertext cards)
3. generation of all cards
    ./ (for non cancer genes)
    ./ (for genes potentially cancer)
    ./ (for expertized genes)
    ./ (expertized genes are defined as filename or as standard: GC_symbol.html)
    ./ (for Leukemia, Tumors or Kprones not written - forsale)
    ./, ./, ./, ./, ./, ./

III-6. Generation of chromosomal bands
script index_byband2  (in cytatlas/Scripts)
The generation of chromosomal bands (2 sections, Anomalies and Genes) needs some previous processing with different sources of data (Mitelman, COSMIC, FusionDB, TICdb , ChimerDB
All sources are preprocessed with the same format.
These pages are updated at each time a new version of Mitelman (3 by year) or COSMIC (each 3 months) 
1. Processing of the Mitelman database
2. Processing of the COSMIC database
3. Integration with other sources

III-7. Other integrations
ICD-O: Topographical Classification (WHO/OMS)
ICD-O: Morphological Classification (WHO/OMS)
Drugs and Therapies
International Cancer Programs
Genomic Data Commons
ICGC Program
TCGA program
IntoGen Portal
OASIS Portal
COSMIC studies
Tumour cell lines

III-8. Atlas website structure statistics ( ).


IV- Perspective and Evolution
The Atlas needs to evolve toward a real database with 2 goals:

1. An editorial management for interactive submission of documents (+++): authors would fill an application form directly formatted in the database and submitted to the Editor and/or the Section Editor (see ), who would validate/ask for modifications/reject these "ready to use" cards/papers.
2. A structured database for all cards and documents.

This database should use open source free software.

To be more integrated in the new era of cancer genomics, one needs the development of new tools, in particular graphical interfaces, and new integrated data in the domain of cancer cytogenomics, in relation with personalized medicine programs.