Debugging the bug
Developing complete and consistent models of metabolism for bacterial organisms.
The initial goal of this project is to integrate two highly curated E. coli resources, EcoCyc and iJR904, developed by SRI and the Palsson lab, respectively. We would like this project page to be a repository for discussions, code developed, tools and techniques, and links to useful resources.
Benefits of Integration
Although requiring significant upfront effort, each group stands to gain from an integrated model. One of the primary goals of Biocyc is to "support computational studies of the metabolism, such as design of novel biochemical pathways for biotechnology, studies of the evolution of metabolic pathways, and simulation of metabolic pathways." By integrating with iJR904, this goal will substantially be achieved by enabling simulation of the metabolic pathways described in EcoCyc. In addition, constraint-based models of metabolism act as a rigorous test of the underlying data and data representations, thereby improving the quality of curation efforts.
Similarly, one of the goals of the Palsson group is "building organism specific models [and] providing them to the community". By integrating with EcoCyc, users of the iJR904 model will easily be able to check individual genes, reactions, and metabolites against the literature and compare it with other pathway databases for consistency. In addition, future development of the iJR904 model can directly benefit from the full-time curators of the EcoCyc database without duplication of effort.
One of the goals of the Computational & Mathematical Lab is to perform genome scale thermodynamic analysis of E. coli metabolism. By resolving the metabolites in iJR904 with the atomic structures Biocyc Open Chemical database, the group contribution method to calculate Gibbs free energies can be greatly improved.
Finally, one the goals of the Church lab is to develop a bioinformatics pipeline to go from annotated genomes to metabolic flux models. By integrating iJR904 with EcoCyc, it will significantly increase the rate at which new metabolic flux models can be generated. In addition, by standardizing the representation of the FBA model, additional network debugging, analysis and visualization tools can be brought to bear on the problem.
- Markus and Ingrid (SRI) - creating a mapping for all of the compounds and reactions in the Palsson lab model to the EocCyc database. From here, they will generate a list of reactions, compounds and genes that are in EcoCyc and not the Palsson lab model and vice versa.
- Chris Henry (Computational & Mathematical Biotechnology Laboratory at Northwestern University) - tabulating the delta G values for new compounds and reactions in the model. From here, we will decide on what thermodynamic analysis we want to include in the model.
- Adam Feist and Jennie Reed (SBRG, UCSD) - currently going through a very short list of new genes to add to the model and identifying gaps in the model.
- Gavin Thomas and Jeremy Zucker (York University and Church lab) Put together a document with all the feedback on the model from our mapping and model building in Buchnera.
- Alan Ruttenberg and Jeremy Zucker (BioPAX) Represent the two data sets in OWL and use automated reasoners to perform the integration and identify inconsistencies.
- Bala Ganesan (USU for compound mapping)
- Ian Paulsen (TIGR) Will take a look at the transport reactions with genes and without genes.
The following section describes how Alan and Jeremy used automated reasoners to map compounds and reactions between the Simpheny database and EcoCyc
To identify inconsistencies in the iJR904 E coli compounds, we first assumed functional and inverse functional property restrictions on cas Number and Kegg compound ID. In other words, cas Number and Kegg compound ID's uniquely identify a compound (inverse functional) and every compound has a unique cas ID and Kegg compound ID (functional).
In the first version we assumed that chemical formula was inverse functional. to see how that assumption breaks down. We removed that in the run described here.
Here's what the OWL ontology looks like in the OWL abstract syntax:
Class(a:databaseEntry partial) DatatypeProperty(a:hasID Functional domain(a:databaseEntry) range(xsd:string) ) Class(a:casEntry partial) Class(a:keggEntry partial) SubClassOf(a:casEntry a:databaseEntry) SubClassOf(a:keggEntry a:databaseEntry) Class(a:chemicalStructureDescription partial) Class(a:chemicalFormula partial) SubClassOf(a:chemicalStructureDescription a:chemicalFormula) DatatypeProperty(a:hasFormula Functional domain(a:chemicalFormula) range(xsd:string) ) DatatypeProperty(a:hasCharge Functional domain(a:chemicalFormula) range(xsd:int) ) Class(a:smallMolecule partial) ObjectProperty(a:hasChemicalFormula domain(a:smallMolecule) range(a:chemicalFormula) ) ObjectProperty(a:hasCAS InverseFunctional domain(a:smallMolecule) range(a:casEntry)) ObjectProperty(a:hasCAS Functional) ObjectProperty(a:hasKEGG InverseFunctional domain(a:smallMolecule) range(a:keggEntry)) ObjectProperty(a:hasKEGG Functional) DatatypeProperty(a:hasName domain(a:smallMolecule) range(xsd:string) )
Given these definitions, we now translate the source file into instances. For example, the spreadsheet line:
258 23dhb No "2,3-Dihydroxybenzoate" C7H5O4 -1 303-38-8 C00196 KEGG - C00196 Neutral MF = C7H6O4 SMP CHS
Is translated into the following 5 individuals: An individual for the CAS number and KEGG numbers. Two individuals for the charge -1, and charge 0 formulas, and finally the individual for the small molecule.
Individual(<http://www.biopax.org/xref/cas*303-38-8> type(a:casEntry) value(a:hasID "303-38-8"^^xsd:string)) Individual(<http://www.biopax.org/xref/kegg*c00196> type(a:keggEntry) value(a:hasID "C00196"^^xsd:string)) Individual(<http://www.biopax.org/xref/formula*c7h5o4*-1> type(a:chemicalFormula) value(a:hasFormula "C7H5O4"^^xsd:string) value(a:hasCharge "-1"^^xsd:int)) Individual(<http://www.biopax.org/xref/formula*c7h6o4*0> type(a:chemicalFormula) value(a:hasFormula "C7H6O4"^^xsd:string) value(a:hasCharge "0"^^xsd:int)) Individual(<http://gcrg.ucsd.edu/ijr904/258> type(a:smallMolecule) value(a:hasKEGG <http://www.biopax.org/xref/kegg*c00196>) value(a:hasCAS <http://www.biopax.org/xref/cas*303-38-8>) value(a:hasName "_2,3-Dihydroxybenzoate_"^^xsd:string) )
Notes: In OWL, inverse functional properties can only be asserted on individuals. So we create individuals for each of the things we may use in this way. We want each individual corresponding to a CAS number to be the same individual, so we ensure that the URL for it is the same. For CAS 303-38-8 we generate a URL of http://www.biopax.org/xref/cas*303-38-8
Once the whole spreadsheet is translated, we load it into Pellet, and ask if it is consistent. The answer is no. For example, we get:
Individual http://www.biopax.org/xref/kegg*c02917 has more than one value for the functional property http://karma.med.harvard.edu/wiki/Debugging_the_bug/merge.owl#hasID
To debug this we look at the source for the item which has kegg C02917
Individual(<http://www.biopax.org/xref/cas*4254-14-2> type(a:casEntry) value(a:hasID "4254-14-2"^^xsd:string)) Individual(<http://www.biopax.org/xref/kegg*c02917> type(a:keggEntry) value(a:hasID "C02917"^^xsd:string)) Individual(<http://www.biopax.org/xref/formula*c3h8o2*0> type(a:chemicalFormula) value(a:hasFormula "C3H8O2"^^xsd:string) value(a:hasCharge "0"^^xsd:int)) Individual(<http://gcrg.ucsd.edu/ijr904/1889> type(a:smallMolecule) value(a:hasKEGG <http://www.biopax.org/xref/kegg*c02917>) value(a:hasCAS <http://www.biopax.org/xref/cas*4254-14-2>) value(a:hasName "_(S)-Propane-1,2-diol_"^^xsd:string) )
I search for the cas number 4254-14-2 and find another entry with the same CAS number
Individual(<http://www.biopax.org/xref/cas*4254-14-2> type(a:casEntry) value(a:hasID "4254-14-2"^^xsd:string)) Individual(<http://www.biopax.org/xref/kegg*c02912> type(a:keggEntry) value(a:hasID "C02912"^^xsd:string)) Individual(<http://www.biopax.org/xref/formula*c3h8o2*0> type(a:chemicalFormula) value(a:hasFormula "C3H8O2"^^xsd:string) value(a:hasCharge "0"^^xsd:int)) Individual(<http://gcrg.ucsd.edu/ijr904/2181> type(a:smallMolecule) value(a:hasKEGG <http://www.biopax.org/xref/kegg*c02912>) value(a:hasCAS <http://www.biopax.org/xref/cas*4254-14-2>) value(a:hasName "_(R)-Propane-1,2-diol_"^^xsd:string) )
(R)-Propane-1,2-diol and (S)-Propane-1,2-diol both are give the same CAS number. After some googling it appears that the CAS is for (R)-Propane-1,2-diol, and (S)-Propane-1,2-diol's CAS number is 4254-15-3. In this case the inconsistency has pointed out an error in the source data.
After reviewing a number of the inconsistencies, some of them boil down to the fact that KEGG compound ids sometimes denote classes of molecules, generics. For example, in the source file, six compounds are assigned KEGG ID C03892: pgp181, pgp180, pgp160, pgp141, pgp140, pgp120. We need to properly model this, but for the moment we do a workaround, described in the next section.
The rest of the inconsistencies are because of the source data having errors. In order to manage this, we create a patching mechanism. We need to get the input data to be consistent, at least, before we merge it. Garbage in, garbage out. So we create a patch file. Other than comments (lines starting with a #), these are pairs of lines, with the first being the original spreadsheet line and the second being the replacement. For the (S)-Propane-1,2-diol case, for example it looks like:
# # Repair CAS. 4254-14-2 => 4254-15-3 # 1889 12ppd-S No "(S)-Propane-1,2-diol" C3H8O2 0 4254-14-2 C02917 KEGG - C02917 SMP CHS 1889 12ppd-S No "(S)-Propane-1,2-diol" C3H8O2 0 4254-15-3 C02917 KEGG - C02917 SMP CHS
The full patch file is here
Note that the ability to patch the file before conversion to OWL is a luxury. In the usual case we will be receiving a BioPAX file and need to fix it after the fact. Ideally BioPAX will define a mechanism for sharing and applying such patches so that we can share tools for managing them. One possibility is to use mechanism that SWOOP has, and we will investigate that as a candidate.
Later we'll propose how to model the generics. In order to do that we need to know which compounds are generics. To start getting this information we parse the compound file from KEGG and generate two files:
compound-generics, computed to be any compound that has a formula which contains the string "R" or the string ")n". All of these are generics.
compound-no-formula computed to be any compound that doesn't have any formula. Most of these are generics or markers saying the compound was transferred to the drug database. Consider later parsing drug and snapping those pointers.
Now, when generating the OWL abstract syntax, we don't add the hasKEGG property for compounds which are generics. A log file listing these cases is FBA_model-1-12-06-compounds.log
Mapping Gene Predicates
Compounds as classes
Given the ambiguity due to differing levels of description for compounds, we felt it was safer (and saner) to represent compounds as classes, rather than as individuals. To perform the unification, we converted the manually curated EcoCyc-iJR904 compound mappings performed by Ingrid Kesseler and Markus Krummenacker to the FaCT++ lisp syntax.
Representing Reactions with QCRs
Based on advice received from Alan Rector at the University of Manchester, we decided to represent reactions using Qualified Cardinality Restrictions (QCRs). Although QCR's have been approved for inclusion into OWL 1.1, none of the publicly available reasoners except for FaCT++ are capable of reasoning with QCRs. A QCR representation of a reaction looks like this:
(equal_c |http://www.ecocyc.org/reactions/phosphoglucmut-rxn| (and (atmost 1 left) (atleast 1 left |http://www.ecocyc.org/compounds/glc-1-p|) (atmost 1 right) (atleast 1 right |http://www.ecocyc.org/compounds/alpha-glc-6-p|)))
Note that using exactly instead (atmost ...) and (atleast ...) would have been terser, but for some reason, the FaCT++ reasoner does not consider these to be equivalent restrictions.
Ignoring compartment and hydrogen balance
Because reactions are not balanced in EcoCyc, no attempt was made to unify the reactions based on charge or proton balance. Similarly, we chose to ignore compartment for the purpose of integration. Below are EcoCyc reactions and iJR904 reactions. We also Debugging_the_bug/EcoCyc-iJR904-reaction-map#EcoCyc-reversed reversed the direction of EcoCyc reactions in order to find additional matches.
Mapping Compounds by Xref
The goal of this exercise was to learn how to use OWL reasoners to automagically discover inconsistencies in the data based on constraints we believed to hold true.
OWL files, patches, and matches
With a workaround for generic compounds, and the patches file repairing the errors in the source data, we finally have an OWL file that is consistent: FBA_model-1-12-06-compounds.absowl . Review the FBA_model-1-12-06-compounds.patches to see all the bugs we caught in the source data.
Note that the patches for both these files also include inconsistencies that were found when the files were both loaded into the same ontology. These are marked as such in the comments in those files.
Also note that the class and property definition differ slightly from what is described in the exposition section above, which needs to be updated to the latest.
The combined OWL file is combined-2006-02-22.absowl
The pairs of molecules considered the same by the reasoner are here (tab delimited, two columns) sames-2006-02-22.txt (240 of them)
A list of compounds that are probably the same because their names are similar, stated as sameIndividual assertions is sames-by-name-2006-02-22.absowl. After being reviewed by a curator, this would be considered a mapping file, imported into the combined ontology. (150 of them)
Did the sames-by-name import. Patches have been updated as has sames-by-name-2006-02-23.absowl since one equivalence was due to a typo in a name. The new combined OWL file is combined-2006-02-23.absowl. There are 384 mapped compounds in this version. 12 more inconsistencies were found and patched as a result of this process.
Next steps: Review manual mapping and see where further opportunities for making correspondences are. Start to translate reactions and get matches between them.
Bugs found in Buchnera
Buchnera aphidocola is an endosymbiont of the pea Aphid. Its main claim to fame is that its genome is a proper subset of E. coli. As a result, time spent manually checking an integrated model of Buchnera will directly benefit attempts to merge EcoCyc and iJR904. Below is a list of the bugs found so far:
- apoACP is the same as ACP in EcoCyc, but they are different compounds with different chemical formulas in iJR904
- No biochemical evidence for the AHC reaction in E. coli. It uses the activated methyl cycle to recycle adenosylmethionine using the pfs and luxS gene products (PMID: 11932438)
- FARNESYL-PP in EcoCyc has two CAS numbers, CAS:146-14-5 , CAS:13058-04-3. However, CAS:146-14-5 is the CAS number for FAD. Found by Alan Ruttenberg
Mapping compounds by chemical structure
Mapping Reactions by Gene predicates
Mapping Reactions by Gene Set
Of the 50 we were assigned, we were able to automatically map 25 of them. These are listed in 50.xls We also went ahead and mapped 210 additional reactions. These are in all-hits.xls
The others had specific problems that we've categorized:
Generic compounds in reactions
This was by far the main issue. The pattern is as follows: a reaction from EcoCyc contains a generic compound (with an R group). a reaction from iJR904 contains a specific compound. The generic compound subsumes the specific compound. This was the case for the following clusters: 56 57 60 65 75 78 83 88 91 92 95 98 99 100
We have added the compound taxonomy to our classification system so we can now do the subsumption for many of these. I'll send this out in a follow-up email.
. The following lipopolysaccharide compounds in iJR904 contain comments such as: Ecoli scaled up (50x) of phosphatidylserine with an average 2 fatty acid chain (C1686H3148O100) Checked N JLR 12dgr_EC, cdpdag_EC, cpe_EC, cpg_EC, pa_EC, pe_EC, pg_EC, pgp_EC, ps_EC Reactions that these compounds participate in have fractional stoichiometries:
Orphan enzymes/no evidence for reactions
An interesting case is where a reaction appears in iJR904 that is not catalyzed by any gene product, is not corroborated with any other E. coli database, and for which no citation or reference is given.
In a few cases, a reaction by Palsson describes the overall stoichiometry of several Biocyc reactions.
weirdness in Biocyc representations.
iJR904 genes have additional functions that are not listed in EcoCyc 9.5
- Number of reactions: 2382