User:JeremyZucker

From FreeBio

 Is there something on my chin?

Contents

Background and Interests

  • Genomics: I've been researching genome-scale computational modeling of metabolic networks for the last 4 years.
  • Computing: My undergraduate degrees were in Computer Science and Applied Mathematics, with a minor in Electrical Engineering.
  • Economics: I took a Financial Decision Theory course at MIT. I co-founded a small hedge fund from 1999-2000. Read the Economist weekly.
  • Society: Essay tutor for undergraduate courses in Gender studies, Political Science, Religious Studies.

Goals and Project ideas

One potential outcome of this course is to develop a biotechnology solution to a core socio-economic problem facing the world. I was blown away by the article on Widescale Biodiesel Production from Algae. The beauty of the article is that all the usual criticisms of alternative energy were addressed up-front and convincingly in a well-presented engineering case study. I am very interested in understanding the issues behind the few remaining technical hurdles and seeing if our combined expertise can resolve them. It seems that the economic case for Biodiesel could be fleshed out using the heuristic back-of-the envelope arguments in the article as the basis for a more rigorous model. Furthermore, the environmental impact of a large-scale deployment of Biodiesel could be examined more closely, if any climate models of carbon emissions are freely available. The metabolic engineering of algae is of great interest to me, and according to the DOE Aquatic Species Program, one of the main obstacles to widescale deployment is the high cost of algae production. The final project for this course could be the development of a model that predicts radical increases in production rates using metabolic flux analysis optimization techniques. Of course, in order to do this, we need to develop a model through a metabolic reconstruction, and then carefully debug it through iterative refinement.

Biomass Oil Analysis: Research Needs and Recommendations

The material on research needs and applications has been copied to its own page. Further development of the area should be made there, not here.

June 2004 NREL/TP-510-34796

Since there is a large existing infrastructure for obtaining biodiesel from vegetable oils in the US and Canada, the focus of the report was on how to scale it up to make it competitive with petrodiesel. This can be accomplished by utilizing current petrodiesel distillation technologies to distill the biomass oils, reducing the costs of oil extraction by improving technology, and finding industrial uses for byproducts such as glycerol and glycerine.

The section that was most relevant to our discussion has been reproduced in Closed-loop micro-organism production systems below.

Executive Summary

US Department of Energy (DOE) Office of Energy Efficience and Renewable Energy (EERE) Office of the Biomass Program (OBP)

Goals:

  • Dramatically reduce, or even end, dependence of foreign oil;
  • Spur the creation of a domestic bioindustry

Outcomes:

  1. Establish commercial biorefinery technology by 2010
  2. Commercialize at least four biobased products

Findings:

  • The technology and infrastructure used to make biodiesel can also be used to make various biomass-oil products that can replace comparable products made from petroleum.
  • Biomass-oil production results in a number of coproducts, most notably, glycerin and meals (e.g. corn meal). Investing in new applications for coproducts could make biomass-oil production more profitable, and thus more competitive with petroleum.
  • Processing of biomass-oil feedstocks (canola, etc.) into biodiesel is already highly efficient. It thus is an area of low-priority for research.
  • On the other hand, reducing the cost of feedstocks (such as by increasing their concentration of lipids or by decreasing the cost of sowing and harvesting them) is a worthwhile area for federal investment.
  • The number-one area for improvement is in bio-distillation, or the conversion of biomass oils to hydrocarbons.



Recommendations:

  1. Demonstrate and optimize commercial bio-distillate production (industrial partnership)
  2. Demonstrate and optimize CO2 oil extraction technology (program R&D and/or solicitation)
  3. Develop and optimized fixed base and acid-base esterification catalysts that reduce glycerine refining costs program R&D or soliciitation)
  4. Support industry development of coproducts from glycerol or glycerine (solicitation)
  5. Support industry development of industrial products from meals (solicitation)
  6. Increase oil supplies by developing closed-loop microorganisms production systems (program and solicitation)

Closed-loop micro-organism production systems

Yeasts, molds, fungi, and bacteria can be genetically optimized and used to produce oils in closed manufacturing systems using inexpensive biomass substrates such as crop residues, wood wastes, MSW biomass, ore even pyrolysis oils. The non-oil portions of these organisms can be recycled back into production systems, making them truly closed looped. These organisms offer a couple of key benefits compared to the previous EERE micro algae prgram--major land resources nad water resources are not required and the genetically modified organisms are not exposed to the open environment, wildlife, or accidental release. In addition, many of these organisms do not require sunlight for photosynthesis.

Since closed loop production of micro organisms resembles manufacturing rather than agriculture, it is one feedstock supply research role that might be best suited to DOE. Particularly since DOE has already invested research in some of these areas in the past and has significant body of knowledge to start from. Some inexpensive stage gate analysis and solicitations could be undertaken in the near term to collect information and assess possible pathways for closed loop production of microorganisms. This will lay a foundation for program elements when they become necessary. If these early analyses reveal major benefits (significant oil supplies at exceptionally low costs) then the priority of this program element can be raised and research accelerated.

Yeasts Fat Content % Molds Fat Content %
Candida lipolytica 36 Eutomophtora virulenta 26
Trichosporum cutaneum 45 Aspergillus flavus 28
Candida curvata 58 Pytium ultimum 49
Lipomycens lipferus 63 Fusarium bulbigenum 50
Endomyces vernalis 65 Aspergillus fischeri 53
Rhodotorula glutinis 71 Penicillium lilacinum 56
Mucor circinecelloides 65

Oil Content of Yeasts and Molds

There are a number of novel feedstocks that do not require land--yeasts, bacteria, fungi, and molds. Yeasts and molds can contain up to 70% lipids. Arthrobacter AK 19, a soil bacterium, can produce up to 78.3% of its cell dry matter as oil using short chain hydrocarbons (similar to what is found in pyrolysis oils) as substrates. The presence of lipases (fat splitting enzymes) results in various combinations of mono, di-, and triglycerides and free fatty acids in these oils. The oils contain mostly oleic and palmitic acids similar to other vegetable oils. Fat producing organisms are grown in a carbohydrate rich environment. The fat producing portion of their growth cycle is triggered by depriving them of nutrients, typically nitrogent-based nutrients. Until that point, nitrogen based fertilizer is rquired to promote population growth. Microorganisms are regarded as fat-producers if they contain ATP that enables themto quickly produce acetyl coenzyme A. Conversion rates usign glucose feedstocks will rarely exceed 25% (glucose to fat). The key to using microorganisms for oil production is to identify low cost feedstocks for the microorganisms and optimize their conversion efficiency (italics mine).

Most testing has used glucose and other expensive substrates, which limit economics. Other sugars and cellulosic feedstocks have not been explored in as much detail. Microorganisms utilizing biomass substrates available for $50/ton (2.5 cents/pound) may produce a pound of oil for as little as 10 cents per pound (feedstock cost only) if the entire biomass could be converted. More realistically, a microorganisms may only feed on select fractions of biomass substrates. However, if oil could be extracted without leaving poisonous residues (super critical CO2 extraction, for example), then the substrates could be inoculated again with different types of organisms better suited the remaining portions of the biomass. In a slightly different vein, the lignocellulosic wastes from biomass ethanol production may be suitable substrates for microorganism oil production, eliminating or reducing drying costs. Other food, paper, and animal processing waste streams may be suitable for this approach.

Most microorganisms have not been optimized for fat production, although the EERE invested significant resources in a micro algae program from 1977 through 1996. Micro algae production suffered from a number of failings. First and foremost, the micro algae was grown to produce fat and a byproduct animal feed. The higher the fraction of fat, the higher percentage fo total production costs allocated to the fat. Thus, it was not clear that fat production costs could be minimized through this system, but perhaps oil supplies could be expanded. Internal NREL reports projected that fat production costs from microalgae would be on par with the cost of producing soybean oil. Second, large amounts of land and water are required for these phtosynthesis dependent systems. Some researchers postulated that large tracts of the American Southwest Desert could be converted into raceway systems using underlying brackish water. The environmental impacts of these systems were never fully considered. These open-air systems wuld contain genetically modified micro algae; vunerable to infestation and crosses with wild algae, exposure to migratory bird populations and flash flood events. Last but not least, these systems would need to be located in close proximity to coal fired power plants for easy access to CO2 (to enhance growth) and other pollutants that could be used as nutrients. Insufficient flat land with cheap water (the West has very little "cheap" water), near power plants was readily accessible for these systems.

The concept of using molds, yeasts, fungi and bacteria offers some of the same challenges that micro algae systems posed (nutrients, costs, microorganisms, and system optimization), but do not require large amounts of water (using water recycling) or land. In fact, using genetically modified organisms in a self contained environment offers minimal public safety risks. The carbohydrate fraction of the organisms can even be recycled back into the system to prevent release and to provide key nutrients for growth. Because these organisms are easy to modify, a number of high value chemicals could be coproduced in these systems, hopefully producign enough profit to make future investments in these systems viable.

The research goal for these systems should not be to produce oil at the same cost as petroleum diesel or less (too heroic) but to produce oils at a cost less than the cost of producing soy oil. These opportunities should be explored in more detail.

Metabolic Engineering tutorial

Introduction


From the perspective of a metabolic engineer, single-cell organisms are tiny self-reproducing biochemical reaction systems that act on their own behalf. The goal of metabolic engineering is to get those biochemical reaction systems to act on our behalf.

According to Gregory Stephanopoulos, author of Metabolic Engineering,

 Metabolic Engineering  The rational redesign - 
 typically using genetic engineering - of organisms to 
 meet commercial objectives, by the modification of 
 existing - or the introduction of entirely new - 
 metabolic pathways. 


One of the basic concepts in metabolic engineering is that of the metabolic flux, which can be thought of as the rate at which a particular metabolite pool is produced or consumed.

Metabolic flux analysis(MFA) is a methodology for determining important cellular physiological characteristics from the network stoichiometry and constraints such as thermodynamics, biomass composition, and nutrient media. It differs from other modeling methodologies in that it does not require kinetic parameters, which are typically difficult to measure and unavailable or unreliable for most biochemical reactions.

The first step in metabolic flux analysis is performing a metabolic reconstruction.

Once a model has been constructed, the next step is to debug it.

Once a model has been debugged, MFA can provide important insights into the cell physiological characteristics, such as:

What is a metabolic flux?

Imagine a series of pools connected by waterfalls such as in the following picture:  What is a metabolic flux?

A few observations about this picture:

  • The direction of flow between two connected pools is determined by gravity
  • For each pool of water, the total amount of water flowing into it is equal to the total amount of water flowing out of it.
  • The size of each pool is independent of the amount of water flowing into and out of the pool.

Water flow is a good analogy for a metabolic flux. The size of the pool represents the concentration of a particular metabolite. Each waterfall connecting the pools represents a biochemical reaction. The direction of flow represents a thermodynamic irreversibility constraint on the reaction. Lastly, the metabolite concentration is independent of the size of the fluxes that consume and produce that metabolite.

To be more mathematically precise, consider the following biochemical reactions:

Failed to parse (unknown error): A \rightarrow B ,

At any given time, the rate of formation of Failed to parse (unknown error): B

is equal to the rate of consumption of A, and can be called the rate of reaction, or flux, v

Failed to parse (unknown error): v = -\frac{d[A]}{dt} = \frac{d[B]}{dt}


For reactions of the form

Failed to parse (unknown error): 2 B \rightarrow C + F


the flux is defined as

Failed to parse (unknown error): v = -\frac{1}{2}\frac{d[B]}{dt} = \frac{d[C]}{dt} = \frac{d[F]}{dt}


Nonlinear dynamics reduce to linear fluxes in steady state

Nonlinear dynamics reduce to linear fluxes in steady state

Initially, the introduction of Aext into the figure above causes transient dynamics, which can be modelled using Michaelis-Menten assumptions:

  1. A_{ext} + P_1 \leftrightarrow P_1 A_{ext} \rightarrow A + P_1
  1. Failed to parse (unknown error): A + P_2 \leftrightarrow P_2 A \rightarrow B + P_2


  1. Failed to parse (unknown error): A + P_3 \leftrightarrow P_3 A \rightarrow C + P_3


This leads to the nonlinear dynamical equations below:

\frac{d[A]}{dt} = k_1\frac{[A_{ext}][P_1]}{[A_{ext}] + K_{m,1}} - k_2\frac{[B][P_2]}{[B]+K_{m,2}} - k_3\frac{[C][P_3]}{[C]+K_{m,3}}


However, as time progresses the nonlinear dynamics reach steady state and the system can be approximated with linear fluxes.

Failed to parse (unknown error): \frac{d[A]}{dt} = v_1 - v_2 - v_3 = 0


What is metabolic flux analysis?

Metabolic Flux Analysis (MFA) is a methodology for determining important cellular physiological characteristics from the network stoichiometry, and additional constraints, such as thermodynamics, biomass composition, and nutrient media.

For example, suppose we have the following biochemical reaction network:

What is metabolic flux analysis?


Starting with the metabolic network of an organism, one can divide all the metabolic fluxes into several categories.

  • Uptake fluxes- those fluxes whose net flow is into the cell.

Failed to parse (unknown error): R1: A_{ext} \rightarrow A


Failed to parse (unknown error): R5: E_{ext} \rightarrow E


  • Intracellular fluxes - those fluxes that interconvert between intracellular metabolites

R2: A  \rightarrow  B

Failed to parse (unknown error): R3: A \rightarrow C


Failed to parse (unknown error): R4: B + E \rightarrow 2D


Failed to parse (unknown error): R6: 2 B \rightarrow C + F


R7: C \rightarrow D

  • Export fluxes - those fluxes whose net flow is out of the cell

R8: D \rightarrow D_{ext}

R9: F \rightarrow F_{ext}

This graphical representation can also be represented mathematically as an Mxn matrix where M is the number of metabolites, n is the number of reactions, and the coefficient Failed to parse (unknown error): s_{ij}

is the stoichiometry of the ith metabolite in the Failed to parse (unknown error): j

th reaction. such as the one below:

 Stoichiometric representation

The steady-state assumption states that for each metabolite the sum of the fluxes producing that metabolite is equal to the sum of the fluxes consuming that metabolite. Representing this statement mathematically, \sum_j^n s_{ij}v_j = 0, or in matrix notation, S \cdot \vec \boldsymbol{v} = 0. For our example network, the equations are represented explicitly:

 Stoichiometric representation: more unknowns than equations

Using elementary modes to analyze the feasible space

In our example, note that the stoichiometric matrix is underconstrained because number of columns (unknowns) is greater than the number of rows (equations). Therefore, there will be many flux distributions that satisfy the stoichiometric constraints. If all reactions were reversible, then the flux distributions could be described by any set of linearly independent vectors that span the null space of the stoichiometric matrix. However, thermodynamic irreversibility conditions further constrain the set of flux distributions to a convex subset of the null space known as the feasible space. This means we can only describe our feasible space with positive linear combinations of flux distributions. Furthermore, because each flux represents an enzyme, we would like it if each independent flux distribution contained the smallest number of enzymes that can still achieve a valid steady state. These conditions together describe the set of elementary modes. The set of elementary modes for the example metabolic network are shown below:  Using elementary modes to analyze the feasible space.

Note that removing any enzyme from an elementary mode will result in a flux distribution that is incapable of achieving steady state. In our example metabolic network, we have three independent elementary modes. Furthermore, every valid steady state can be described by a positive linear combination of these elementary modes.

How to predict the wild-type flux distribution

Of all the valid steady state flux distributions in the feasible space, which one does the organism actually choose? If the organism is under selective pressure to grow as fast as possible under a specific nutrient condition, then this optimal growth rate can actually be calculated using Flux Balance Analysis (FBA). To do this, we need two more pieces of information. The first is the set of nutrient conditions. This is represented by maximum uptake rates on the transport reactions. In our example metabolic network, this is represented by a maximum uptake rate of 10 units per hour on flux Failed to parse (unknown error): v_1

to take metabolite A from the outside into the cell. 

The second piece of information is to identify the objective. In general, the cellular objective is to produce another copy of itself. This objective can be approximated by measuring the relative proportions of essential compounds necessary to produce one unit of biomass. The relative proportions are represented as fractional stoichiometries of a biomass reaction where the reactants are the essential compounds, and the product is 1 unit of biomass. For our example metabolic network, the objective is to maximize the production of D. We can see that a maximum uptake rate of 10 units of A per hour results in an organism that is capable of producing metabolite D at 20 units per hour.

Optimal wild-type flux distribution


Note that even if the maximum uptake rate of flux v5 was greater than 10, the stoichiometric constraints on the network prevent metabolite E from being imported into the cell any faster than metabolite A.

How to predict a Mutant flux distributions

Predicting a mutant flux distribution is tricky. There is no reason to suppose that a mutant organism is optimizing its biomass growth.

 Suboptimal flux distribution

Keeping that caveat in mind, we now explore the OptKnock bilevel optimization program described in Costas Maranas's OptKnock paper

 Maximize lipid_production_flux over (boolean vector of gene_knockouts)
 Subject to: (
      Maximize cellular_objective over (real vector of fluxes)
      Subject to: (
          ( 0 <= fluxes ) and
          (fluxes <= real vector of upper_bounds ) and
          (stoichiometric matrix of MbarCyc ) * fluxes == 0 )
      and sum of gene_knockouts <= limit.

Using Opt-knock, we use bi-level optimization to couple the organism's imperative to grow with our own desire to maximize the yield of a particular product. In our example, assume that D is the biomass growth, and that F is the desired product. What genes should we knock out in order to maximize the yield of F? One answer is shown below:

 Opt-knock mutant flux distribution

However, we are really interested in total production rate which is equal to the product of growth (units of biomass per unit time) and yield (ratio of output flux to input flux). Note that if we had control over the relative flux constraints, we could do better by sacrificing a bit of yield to get a better growth rate, leading to an overall higher production rate.

In the above example, by knocking out v3 and Failed to parse (unknown error): v_4 , we achieve a yield of Failed to parse (unknown error): v_9/v_1 = 1/2 . The total production is 5 biomass units per hour * Failed to parse (unknown error): * (5 biomass units ) = 25 units of F per hour. Working backwards, we can ask what is the optimal flux distribution between Failed to parse (unknown error): v_4

and Failed to parse (unknown error): v_6
that would produce the maximum yield of F.

Let Failed to parse (unknown error): x

be the amount of Failed to parse (unknown error): v_2
that is diverted to flux Failed to parse (unknown error): v_6

. Then, v9 = x / 2, and Failed to parse (unknown error): v_8 = 2(v_2 - x) + x/2 . We want to maximize the total production of F, which is defined as the total yield of F times the rate of biomass growth which is given by v9 / v1 * v8 = x / 2(2(v2x) + x / 2). Taking the derivative and setting it to zero results in the following reduction:

\frac{d}{dx}(v_2 x -x^2 + x^2/4) = 0

Failed to parse (unknown error): v_2 - \frac{3}{2}x = 0


Failed to parse (unknown error): x = \frac{2}{3}v_2


Thus,

v_9 = \frac{v_2}{3}

and

v_8 = 2(v_2 - \frac{2}{3}v_2) + \frac{v_2}{3} = v_2

Thus, for an uptake rate of 10 units per hour, then F will be produced at v9 = 10 / 3 = 3.33 units per hour, but its total yield will be Failed to parse (unknown error): v_8 * v_9 = 10*10/3 = 33.33

units per hour.  Understanding the

[MISSING]

Definitions of complexity & randomness


We want a metric that helps us measure complexity and randomness in a way that is congruent with our intuition of complex systems. Some examples of real-world test cases:

  • Autonomous agents (systems that act on their own behalf) should have a high measure
  • Stock markets should have a high measure.
  • Natural language should have a high measure
  • Single-celled organisms should have a high measure
  • Multicellular organisms should have a higher measure
  • Some large-scale software systems should have a high measure
  • Ecosystems should have a high measure
  • Internet traffic should have a high measure.
  • Ideal gases should have a low measure
  • Perfect crystals should have a low measure.
  • White noise should have a low measure
  • A constant pitch tone should have a low measure
  • Stephen Wolfram's Rule 110 should have a low measure
  • Stephen Wolfram's Rule 30 should have a lower measure.
  • irreducible, structureless, incompressible strings should have a low measure
  • repeated patterns of irreducible, structureless, incompressible strings should have a higher measure.
  • Cryptographically secure pseudo-random number generators should have a low measure
  • The digits of pi should have a low measure
  • Quantum randomness should have a low measure
  • Output produced from the logistic equation should have a low measure.


The problem with defining complexity and randomness is that they both already have very precise technical definitions, relating them to algorithmic information theory and Godel's Incompleteness Theorem in surprising and insightful ways.

GregoryChaitin defines the complexity of something as "the size of the smallest program which computes it or a complete description of it"

Irreducible randomness is therefore defined as "those binary strings of length n that require programs of about length n" to compute.

In the complexity measure defined by Crutchfield and Young, they attempt to define a measure of the chaos in a system using statistical mechanics. While this is useful for describing the complexity of different parameters of the logistic equation, our intuitive notion of complexity says that nothing the logistic equation can produce could really be considered complex, simply because the equation that produces it is so simple. We want a measure of complexity that enables us to identify when a system acts as an autonomous agent.

According to Investigations , Stuart Kaufman defines an molecular autonomous agent as a self-reproducing system that is capable of producing at least one thermodynamic work cycle. This definition correctly classifies all living things as autonomous agents, and some artifacts that can only be produced by living beings.

However appealing that definition, it is not rigorous enough for our purposes. We wish to have measure, and this does not make it obvious how one would calculate the "degree of life".

We need a better word than complexity, because it has so many meanings. We choose the word interrelatedness, because we what makes a system complex is that it is composed of many interrelated parts, and that the relations result in a system whose overall description is simpler than the description of each of the parts. Furthermore, the definition should include the notion that complex systems tend to have organization at multiple scales. That is, the interrelatedness measure should be able to take into account measures taken at different resolution levels.

In Towards a mathematical definition of "life", Chaitin incorporates a measure that takes into account multiple scales and mutual information. One first measures the mutual information of the whole ensemble, identifying clustered regions of interrelatedness at every scale. For example, the amount of genetic information in DNA is well approximated by the mutual information of each cell because all the other uncorrelated positions and momenta will cancel each other out. Therefore, there is an important cluster of mutual information whose diameter is approximately the size of a cell. At larger scales, clusters of mutual information at the characteristic sizes of tissues and organs will appear. A complex system is therefore one in which clusters of mutual information exist at all scales. Equivalently, Fourier transforms of complex systems exhibit structure at many different wavelengths. Finally, he is able to show that the resulting measure is capable of assigning higher complexity measures to sequences with bilateral symmetry and large repeated sequences than it does to simple repeated sequences and large random numbers.

Assignments

101 Week 1

See Goals and Project Ideas

101 Week 2

See Definitions of Complexity & Randomness

101 Week 3

See Metabolic Engineering tutorial

101 Week 4

See Biomass Oil Analysis: Research Needs and Recommendations

101 Week 5

101 Week 6

101 Week 7

See Problem set 1

101 Week 8

See Problem set 2

101 Week 14

See www.personalgenome.org API

Problem set 1

Code

#!/usr/bin/perl -w
# Name: Jeremy Zucker
# email: zucker@research.dfci.harvard.edu
# Problem Set: #1
$DNA_seq = "CATTACGATGCATTG ATTTTTCAAAGGAAT GTACTATCGAAATCA CAAGTCGTGGACTAC GGTTTGCAGTGGAGG  AATCGCAGTCTTTGC AGGCTCACGCCTTTC TTGATAAGTCGTTGT TTCAAACGTTTAATT TTCAGGGTGATTCAG ATGGGGATACATATA  TGTTCCAGACGATGA TTTCACCT";
$DNA_seq =~ s/\s+//g;
print "Cleaned up DNA sequence of length " . length($DNA_seq) .":\n$DNA_seq\n";
$RNA_seq = transcribe( $DNA_seq );
print "\nRNA sequence: \n$RNA_seq\n";
print "\nTranslated sequence: \n";
#
for($j = 0; $j<3; $j++) {
 %protein = translate($RNA_seq, $j, length($RNA_seq));
  print "\n\tReading Frame $j:\n" . $protein{"sequence"} . "\n";
  print_histogram( %protein);
}
#
$reverse_compliment = reverse_compliment( $RNA_seq);
print "\nReverse-complemented RNA sequence:\n $reverse_compliment \n";
print "\nTranslation of reverse-complemented RNA sequence:\n";
for($j = 0; $j<3; $j++) {
   %protein = translate($reverse_compliment, $j, length($RNA_seq));
   print "\n\tReading Frame $j:\n" . $protein{"sequence"} . "\n";
   print_histogram( %protein );
}
sub print_histogram {
 my %protein = @_;
 print "\nHistogram\n";
 foreach $amino_acid (sort keys %protein) {
   if(($amino_acid ne "sequence") && ($amino_acid ne "")){
     print "$amino_acid: " . ("*" x $protein{$amino_acid}) . "\n";
   }
 }
}
#
sub reverse_compliment {
 my $sequence = shift;
 $sequence = reverse( $sequence );
 $sequence =~ tr/AUGC/UACG/;
 return $sequence;
}
#
sub translate {
 my ($sequence, $reading_frame, $len) = @_;
 my %protein = ();
   for($i=$reading_frame; $i < $len; $i+=3) {
     $codon = substr( $sequence, $i, 3);
     $amino_acid = translate_codon( $codon );
     $protein{$amino_acid}++;
     $protein{"sequence"} .= $amino_acid;
   }
 return %protein;
}
#
sub transcribe { 
 my $sequence = shift; 
 $sequence =~ s/T/U/gi; 
 return $sequence; 
}
# 
sub translate_codon {
     if ($_[0] =~ /GC./i) {return Ala;}  
# If the codon matches G followed by C followed by anything, return Alanine;
     if ($_[0] =~ /UGC|UGU/i) {return Cys;}  
# If the codon matches U followed by G followed by U or C, return Cysteine
     if ($_[0] =~ /GAC|GAU/i) {return Asp;}  # If the codon matches G followed by A followed by U or C, return Aspartic Acid;
     if ($_[0] =~ /GAA|GAG/i) {return Glu;}  # If the codon matches G followed by A followed by A or G, return Glutamine;
     if ($_[0] =~ /UUC|UUU/i) {return Phe;}  # If the codon matches U followed by U followed by U or C, return Phenylalanine;
     if ($_[0] =~ /GG./i) {return Gly;}      # If the codon matches G followed by G followed by anything, return Glycine;
     if ($_[0] =~ /CAC|CAU/i) {return His;}  # If the codon matches C followed by A followed by U or C, return Histine;
     if ($_[0] =~ /AUA|AUC|AUU/i) {return Ile;}  # If the codon matches A followed by U followed  by A, U or C, return Isoleucine;
     if ($_[0] =~ /AAA|AAG/i) {return Lys;}      # If the codon matches A followed by A followed by A or G, return Lysine;
     if ($_[0] =~ /UUA|UUG|CU./i) {return Leu;}  # If the codon matches U followed by U followed by A or G or if the codon matches C followed by U followed by anything, return Leucine;
     if ($_[0] =~ /AUG/i) {return Met;}          # If the codon matches A followed by U followed by G, return Methionine;
     if ($_[0] =~ /AAC|AAU/i) {return Asn;}      # If the codon matches A followed by A followed by U or C, return Asparagine;
     if ($_[0] =~ /CC./i) {return Pro;}          # If the codon matches C followed by C followed by anything, return Proline;
     if ($_[0] =~ /CAA|CAG/i) {return Gln;}      # If the codon matches C followed by A followed by A or G, return Glutamine;
     if ($_[0] =~ /AGA|AGG|CG./i) {return Arg;}  # If the codon matches A followed by G followed by A or G or if te codon matches C followed by G followed by anything, return Arginine;
     if ($_[0] =~ /AGC|AGU|UC./i) {return Ser;}  # If the codon matches A followed by G followed by C or U or if the codon matches U followed by C followed by anything, return Serine;
     if ($_[0] =~ /AC./i) {return Thr;}          # If the codon matches A followed by C followed by anything, return Threonine;
     if ($_[0] =~ /GU./i) {return Val;}          # If the codon matches G followed by U followed by anything, return Valine;
     if ($_[0] =~ /UGG/i) {return Trp;}          # If the codon matches U followed by G followed by G, return Tryptophan;
     if ($_[0] =~ /UAC|UAU/i) {return Tyr;}      # If the codon matches U followed by A followed by C or U, return Tyrosine;
     if ($_[0] =~ /UAA|UGA|UAG/i) {return "***";}  # If the codon matches U followed by A  followed by A or  G or if the codon matches U followed by G followed by A, return a Stop Codon;
}

output

Cleaned up DNA sequence of length 188: 
  CATTACGATGCATTGATTTTTCAAAGGAATGTACTATCGAAATCACAAGTCGTGGACTACGGTTTGCAGTGGAGGAATCGCAGTCTTTGCAGGCTCACGCCTTTCTTGATAAGTCGTTGTTTCAAACGTTTAATTTTCAGGGTGATTCAGATGGGGATACATATATGTTCCAGACGATGATTTCACCT
RNA sequence: 
CAUUACGAUGCAUUGAUUUUUCAAAGGAAUGUACUAUCGAAAUCACAAGUCGUGGACUACGGUUUGCAGUGGAGGAAUCGCAGUCUUUGCAGGCUCACGCCUUUCUUGAUAAGUCGUUGUUUCAAACGUUUAAUUUUCAGGGUGAUUCAGAUGGGGAUACAUAUAUGUUCCAGACGAUGAUUUCACCU
Translated sequence: 
    Reading Frame 0:
HisTyrAspAlaLeuIlePheGlnArgAsnValLeuSerLysSerGlnValValAspTyrGlyLeuGlnTrpArgAsnArgSerLeuCysArgLeuThrProPheLeuIleSerArgCysPheLysArgLeuIlePheArgValIleGlnMetGlyIleHisIleCysSerArgArg***PheHis
    Reading Frame 1:
IleThrMetHis***PhePheLysGlyMetTyrTyrArgAsnHisLysSerTrpThrThrValCysSerGlyGlyIleAlaValPheAlaGlySerArgLeuSer******ValValValSerAsnVal***PheSerGly***PheArgTrpGlyTyrIleTyrValProAspAspAspPheThr
    Reading Frame 2:
LeuArgCysIleAspPheSerLysGluCysThrIleGluIleThrSerArgGlyLeuArgPheAlaValGluGluSerGlnSerLeuGlnAlaHisAlaPheLeuAspLysSerLeuPheGlnThrPheAsnPheGlnGlyAspSerAspGlyAspThrTyrMetPheGlnThrMetIleSerPro
    Reverse-complemented RNA sequence:
  AGGUGAAAUCAUCGUCUGGAACAUAUAUGUAUCCCCAUCUGAAUCACCCUGAAAAUUAAACGUUUGAAACAACGACUUAUCAAGAAAGGCGUGAGCCUGCAAAGACUGCGAUUCCUCCACUGCAAACCGUAGUCCACGACUUGUGAUUUCGAUAGUACAUUCCUUUGAAAAAUCAAUGCAUCGUAAUG 
Translation of reverse-complemented RNA sequence:
    Reading Frame 0:
 Arg***AsnHisArgLeuGluHisIleCysIleProIle***IleThrLeuLysIleLysArgLeuLysGlnArgLeuIleLysLysGlyValSerLeuGlnArgLeuArgPheLeuHisCysLysPro***SerThrThrCysAspPheAspSerThrPheLeu***LysIleAsnAlaSer***
    Reading Frame 1:
 GlyGluIleIleValTrpAsnIleTyrValSerProSerGluSerPro***LysLeuAsnVal***AsnAsnAspLeuSerArgLysAla***AlaCysLysAspCysAspSerSerThrAlaAsnArgSerProArgLeuValIleSerIleValHisSerPheGluLysSerMetHisArgAsn
    Reading Frame 2:
 ValLysSerSerSerGlyThrTyrMetTyrProHisLeuAsnHisProGluAsn***ThrPheGluThrThrThrTyrGlnGluArgArgGluProAlaLysThrAlaIleProProLeuGlnThrValValHisAspLeu***PheArg***TyrIleProLeuLysAsnGlnCysIleValMet

Problem set 2

Store DNA with 2 bits per base pair instead of 8. Of course, storage is only useful if you can access it, so might as well write some routines that unpack the representation and allow you to scan it. It seems the slice operations in python might be useful here.


Of course, with good compression algorithms, you might be able to get it lower than that.

Code

Output


Example association table usage

For more ideas on scripts to write, see: More ideas on scripts to write

View all associations

#!/usr/bin/perl -w 

use strict; 
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
 
my $group = "a"; 
#my $group = "b"; 
#my $group = "c"; 

my $query = "http://personalgenome.org/101/all-associations.php?group=$group";
              
my $req = HTTP::Request->new(GET => "$query");
$req->header(Accept => 'text/html');
        
# send request to server and get response back
my $res = $ua->request($req);
            
# check the outcome
if ($res->is_success) {
print $res->content;  # for instance
} else {
   print "Error: " . $res->status_line . "\n";
}


Insert Association

Inserts a new association that decides when a genotype has a disease using a secret formula.

#!/usr/bin/perl -w 

use strict; 
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $group = "a";        # uncomment appropriate group
#my $group = "b"; 
#my $group = "c"; 

my $id;                 # unique identifier for association record (int)

my @genotypes = (1, 2, 3, 4);

sub has_disease {
  my $genotype = shift;
  my $diseased = 1;   # id of diseased phenotype (int)
  my $healthy = 2;     # id of healthy phenotype (int)


  if( ( $genotype % 2 ) == 0 ) {
     return $diseased;
  } else {
     return $healthy;
   }
}


for (my $i = 1; $i < 10; $i++) {  
  my $genotype_id = $i;
  my $phenotype_id = has_disease( $genotype_id );
  
  my $insert = 'http://personalgenome.org/101/insert-association.php?'. 
             "group=$group".
             "&id=$i".                  # we use $i as the id
             "&genotype_id=$genotype_id".
             "&phenotype_id=$phenotype_id";

  my $req = HTTP::Request->new(GET => "$insert");
  $req->header(Accept => 'text/html');
        
  # send request to server and get response back
  my $res = $ua->request($req);
            
  # check the outcome
  if ($res->is_success) {
  print $res->content;  # for instance 
  } else {
     print "Error: " . $res->status_line . "\n";
  }

}

PGUtils.pm

Documentation and code available at Personal Genome API

ELSI considerations of the Personal Genome Project

Reading George's article on the Personal Genome Project inspired me to consider some of the Ethical, Legal, and Societal Implications (ELSI) of the PGP. The first thing that people should realize by volunteering to make their genome publicly available is that they will likely be uninsurable for the rest of his or her life. If that risk is acceptable, he or she should also be prepared for a certain kind of celebrity status. After all, as more and more phenotypic data gets matched to your genotype, everyone will have access to more and more of your intimate details, and without your knowledge. If the PGP really starts rolling, then one could imagine signing consent forms on one's driver's license for making one's genome and medical records publicly available in the event of death, similar to the current practice of donating organs. This would greatly aid genomics research, and would have far fewer ELSI concerns.