Debugging the bug/Mapping Reactions by Gene Predicates

From FreeBio

Contents

Methodology

The overall goal is to associate one or more EcoCyc reactions with each iAF1261 reaction using geneAssociations. Each reaction is associated with one or more blattner numbers. Since EcoCyc also uses blattner numbers, mapping reactions using this method seems promising.

However, the mapping is not straightforward because one reaction may be catalyzed by multiple gene products. Furthermore, the reaction can be catalyzed by each gene product independently, or they may form an enzyme complex which requires all genes to catalyze the reaction. Second, each gene product may catalyze multiple reactions.


This requires some design choices as to the best way to represent genes of a reaction. One solution is to treat genes of a reaction as a simple set. Another solution is to treat genes of a reaction as a boolean predicate where members of a protein complex are conjuctions (AND) and isoenzymes are disjunctions (OR). We decided to approach the problem of automating a mapping procedure from several angles, ranging from most sensitive (low false negative) to most specific (low false positive rate).


Mapping by Gene Sets

  1. For each gene, determine all EcoCyc reactions catalyzed by its product.
  2. For each gene, determine all iAF1261 reactions catalyzed by its product.
  3. For each iAF1261 reaction, associate all EcoCyc reactions which share a gene in common it.
  4. Choose the EcoCyc reaction(s) that have the most compounds in common with the iAF1261 reaction


For example, ACGAptspp (N-Acetyl-D-glucosamine transport via PEP:Pyr PTS) is catalyzed by the following gene products:

 ( B2417  B1101 B2415  B2416 B0679 B2415 B2416 )

Using this approach, the algorithm matches all of the following EcoCyc reactions:



This approach has the advantage of being extremely sensitive, in the sense that if two reactions share any genes in common, then they will be mapped to each other. However, it suffers because it is unable to distinguish between isoenzymes of a reaction and members of a protein complex.

Mapping by assigning EcoCyc genes of a reaction to iAF1261 gene predicates

  1. Represent the iAF1261 geneAssociation as if it were a boolean predicate, where members of an enzyme complex are AND'ed together, and independent isoenzymes are OR'ed together
  2. Represent the genes of EcoCyc reactions as if it were a set of truth-value assignments.
  3. For each iAF1261 geneAssociation, apply the EcoCyc truth-value assignments to the boolean predicate.
  4. If predicate evaluates to True, then the EcoCyc reaction is mapped to the iAF1261 reaction.


Under this algorithm, ACGAptspp (N-Acetyl-D-glucosamine transport via PEP:Pyr PTS) has the following geneAssociation:

 ( ( B2417 AND B1101 AND B2415 AND B2416 ) OR ( B0679 AND B2415 AND B2416 ) )


it only matches TRANS-RXN-167 (Transport of N-acetyl-D-glucosamine) and TRANS-RXN-157 (Transport of Beta-D-Glucose). The genes of TRANS-RXN-167 (when converted to blattner numbers) are:

("B0679","B2416","B2415","B1817","B1818","B1819")

The genes of TRANS-RXN-157 (when converted to blattner numbers) are:

("B2417", "B1101", "B2416", "B2415", "B1817", "B1818", "B1819")

Both of these gene sets satisfies the boolean predicate, even though the set of proteins that catalyze ACGTAptspp is not identical the the set of genes that catalyze TRANS-RXN-167 or TRANS-RXN-157

TRANS-RXN-167 also happens to be the reaction that was matched by Markus's compound matching algorithm. Therefore, we have improved specificity (by reducing the false positive rate) compared to the old gene-matching algorithm without losing sensitivity (worsening the false negative rate) compared with Markus's compound matching algorithm.

Furthermore, initial results seem to indicate this method has assigned at least one EcoCyc reaction to 766 UCSD reactions that were marked as unmapped by Markus's method.

Using Description Logics to map reactions

  1. Represent blattner numbers as primitive classes.
  2. Represent EcoCyc geneAssociations as classes defined as intersections (AND) and unions (OR) of blattner numbers.
  3. Represent iAF1261 gene associations as instances of a class defined as intersections (AND) and unions (OR) of blattner numbers.
  4. Run the reasoner to see which iAF1261 reactions get classified as members of EcoCyc reactions.

The code:

File Formats

Shows all genes that have been assigned to at least one reaction in iAF1261, but have not been assigned to any reaction in EcoCyc. Candidate EcoCyc reaction assignments can also be given.

the following files map at least one EcoCyc reaction to an iAF1261 reaction by gene name. This may also help to map some of the remaining unmapped compounds.

Results


We simplified our task by breaking down our unmapped reactions into increasingly finer disjoint sets. There were 4 categories of unmapped reactions:

  1. Unmapped Exchange reactions across the system boundary
  2. Unmapped Diffusion reactions (facilitated by membrane transport proteins)
  3. Unmapped reactions due to one or more unmapped compounds
  4. Unmapped reactions for other reasons

The results of this initial analysis are described below.


Full reaction set

Mapping by Gene Set

Reactions can be downloaded Media:Map-by-gene-set.zip

  • iAF1261-ecocyc-rxn-mapping.txt: 2381
    • category0-rxns-w-genes.txt: 1919
      • category0-matched-rxns-by-gene-best.txt: 1814
      • category0-unmatched-rxns-w-gene.txt: 105
    • category0-rxns-wo-genes.txt: 462
      • category0-rxns-wo-genes-transport.txt: 66
      • category0-rxns-wo-genes-extracellular.txt: 305
      • category0-rxns-wo-genes-periplasmic.txt: 19
      • category0-rxns-wo-genes-cytoplasmic.txt: 73

Mapping by assigning EcoCyc genes of a reaction to iAF1261 gene predicates

The columns of this file are as follows:

  • abbreviation: UCSD id
  • officialName: IUPAC name
  • equation: UCSD equation
  • geneAssociation: UCSD gene associaition
  • ecocyc-rxn: EcoCyc reaction Frame
  • analysis:
    •  :cpd-mapping reaction was mapped by matching substrates of the reaction.
    •  :gene-predicate-match reaction was mapped by assigning Ecocyc genes of a reaction to UCSD gene predicate
  • ecocyc-equation: EcoCyc equation
  • unmatched-ecocyc-substrates: EcoCyc substrates of a reaction that do not match the UCSD reaction substrates. If the EcoCyc substrate has a map to a UCSD compound, it displays the UCSD ID (which is lower case)
  • unmatched-ucsd-substrates: UCSD substrates of a reaction that do not match the EcoCyc reaction substrates.
  • matched-substrates: EcoCyc substrates of a reaction that match UCSD substrates of the reaction


Media:gene-predicate-matches-no-cpd-match5.xls

  • abbreviation: UCSD ID
  • officialName: IUPAC name
  • formula: UCSD chemical formula
  • charge: UCSD charge
  • casNumber:
  • ecocyc-id: EcoCyc frame id
  • analysis:
    • (:MANUAL :PROTEIN-INSTANCE): 7
    • (:UNMATCHED :AS-OF-RELEASE-10.6): 94
    • (:MANUAL): 50
    • (:MANUAL :METACYC): 65
    • (:MANUAL :I-O-C): 140
    • (:MANUAL :DISPUTE): 4
    • (:MANUAL :POLYMER-SECTION): 1
  • metacyc-id: MetaCyc frame id
  • notes

Media:iAF1261-ecocyc-cpd-mappings.xls Media:ecocyc-class-hierarchy.lisp

Using Description Logics to map reactions

Unmapped Exchange reactions across the system boundary

Category 1 reactions are outside the scope of EcoCyc, but are necessary for flux balance analysis to work. In SBML terms, they would be considered species with boundaryCondition=True In Elementary modes terms they, are the external metabolites.


Reactions can be downloaded Media:Map-by-gene-set.zip

Unmapped Diffusion reactions (facilitated by membrane transport proteins)

Category 2 reactions are almost all catalyzed by the outer membrane proteins OmpC, OmpN, OmpF, and OmpE. In all 219 of those reactions were mapped to an instance of class: RXN0-2481


Reactions can be downloaded Media:Map-by-gene-set.zip

Unmapped reactions due to one or more unmapped compounds


233 category3 reactions were unmapped due to one or more unmapped compounds. Of those, 184 had a gene associated with it, and 49 had no gene associated with it.

Of the reactions that were not enzyme-catalyzed, 14 were transport reactions, 20 were cytoplasmic reactions, 9 were reactions that occurred entirely in the periplasm, and 5 occurred in the extracellular matrix.

Of the 49 reactions that were enzyme-catalyzed, 140 were able to be mapped to at least 1 EcoCyc reaction catalyzed by one or more of the same genes. 44 were unable to be mapped to a specific EcoCyc reaction, but were able to be mapped to an EcoCyc protein comment which can be used to aid in resolving the discrepancy.

Reactions can be downloaded Media:Map-by-gene-set.zip

Unmapped reactions for other reasons

Reactions can be downloaded Media:Map-by-gene-set.zip

Substrates of category 4 category4 reactions were mapped to EcoCyc compounds, but were unable to be mapped to an EcoCyc reaction. Of these reactions, category4.genes had at least 1 gene associated with it, and category4.nogenes had no gene associated with it.


Of the reactions that were not enzyme-catalyzed, category4.nogene.transport were transport reactions, category4.nogene.secretion were secretion reactions, category4.nogene.cytoplasmic were cytoplasmic reactions, category4.nogene.periplasmic were reactions that occurred entirely in the periplasm, and category4.nogene.extracellular occurred in the extracellular matrix.

Of the category4.gene reactions that were enzyme-catalyzed, category4.gene.rxns=416 were able to be mapped to at least 1 EcoCyc reaction catalyzed by one or more of the same genes. category4.gene.comments were unable to be mapped to a specific EcoCyc reaction, but were able to be mapped to an EcoCyc protein comment which can be used to aid in resolving the discrepancy .