From FreeBio


Personal information

Jason holding a banana while singing

Hello all, I am a first-year PhD student in Biophysics. On 15 October 2005, I will be awarded an MPhil in the History and Philosophy of Science from Cambridge University, and in 2003 I graduated from Williams College with a degree in Chemistry. This past summer, I worked as an intern for the National Academies in Washington, DC. I have a strong interest in the relation between science and policy, and hope in this class to see what doing science with an active eye toward its social, political, and economic effects is like.

Work for Biophysics 101


For 13 Dec

Perl that's the same as below, but works w/ the database. I had the ?library? problems that Shawn mentioned, but couldn't get his solution to them to work. Thus the code has not been run.

For 29 Nov

I wrote perl code that does the generation tasks specified in Biophysics_101_Tasks. I've written the code with E. coli K-12's chromosome in mind, although the code could be easily edited to accomodate any input chromosome. The annotation file is the kind that KEGG uses, so the code that forbids mutations in start codons ("Level 2") relies on the particular format of those files.

Input files, code, and output files are in a

Discussion with Shawn about some bugs is here.

For 15 Nov

Made and ran Ubuntu Live CD.

For 03 Nov

My perl code is [here]. It's called .pdf only because the wiki won't let me upload .txt files.

Ways to code for insertions and deletions

First, some criteria that should guide how we go about coding for things other than substitutions.

  • Generalizability. The coding system should work for full explicit genomes as well as for our abbreviated sequences that indicate differences from a canonical reference genome. It should also work for an alphabet of 00, 01, 10, and 11 as opposed to A, C, G, and T.
  • Economy. The system should be concise. A deletion of a single base pair thus should not be coded for by considering every base pair after the deleltion to be substituted, for example.
  • Locality. If a bp or some segment of a non-reference genome is queried, we should be able to tell whether that bp is deleted, or is part of an insertion, without having to look at the rest of the contig.

It may well be that not all of the above criteria can be met by a single system. Here's a proposal that sounds as good as anything else to me:

  • To represent deletions in the non-reference genomes, follow the system we have for substitutions, but instead of indicating with a letter the new base, simply leave that empty. Thus:
1593 A
9483 T

indicates that the nucleotide at 1593 is replaced by an adenine, the nucleotide at 3905 is deleted, and the nucleotide at 9483 is replaced by a thymine.

  • Advantages: This works equally well with the ACGT system as with the two-bits-per-base system. The two-bits-per-base system has no more room for new "characters". Also, you don't have to know where a deletion starts or stops to know that you're in the middle of one.
  • This has the disadvantage that a deletion of length N will have N entries. An alternative scheme, which would indicate the beginning of a deletion and the end of a deletion, would have to introduce new symbols, and would also not tell you that you're at a deleted bp if you query somewhere in the middle.
  • To represent insertions in the non-reference genome, simply replace one of the nucleotides in the un-mutated gemone at the ends of the insertions with the insertion. Thus:

indicates that whatever was at position 4632 is replaced by TAGTCCTTGAGGA. By convention we might say that we should substitute the bp at the beginning of the insertion, so in this case, the reference genome would have a T at position 4632, and the insertion is of AGTCCTTGAGGA between positions 4632 and 4633.

Microbial Genomics

For 06 Dec

As George mentioned in class last Thursday, there is far greater genetic variability within a prokaryote species than within, say, humans. This makes a bacterial equivalent of HapMap rather hard, although Howard Ochman kindly responded to an email I sent him, and pointed me in the right direction, toward MLST (multilocus sequence typing) databases. These databases store information that has been assembled by researches working on a handful (sevenish to a few dozen) of genes in certain well-studied species and strains of bacteria. I have found two general ones, and one that deals specifically with pathogenic E. coli.
- General, hosted at Oxford: [1]
- General, hosted at Imperial College: [2]
- Pathogenic E. coli, hosted at Michigan State: [3]

Presentation for 22 Nov:, on the KEGG and Biocyc databases.


Papers reviewed and found for later review

  • Bioremediation readings. Examined two articles on bioremediation that pertain to our reserach into biofuels and commented on their relevance and implications. Fact-checked various quantitative claims. (REVIEWED)

For 1 Nov

More updates on fixing links in Briggs's Paper. The article that discusses the cost of making algae ponds has moved, to here[4].

Misc. work

Comment on Matthew's contribution on projected energy use.

Moved material on my and Jeremy's pages to separate Biomass-oil_Needs_and_Recommendations page.

Added to Morten's metrics for biofuels.

Comments on the bioremediation article and how it relates to metabolic engineering:

I read the Nature paper[5] on bioremediation, mentioned in 101 Tasks. Much of the discussion there could also apply to engineering a population to produce biomass oil. Bioremediation engineering attempts to maximize the net input of a substance; biomass-production engineering attempts to maximize the net output of a substance. The mathematics in the introduction to metabolic engineering that Jeremy presented is symmetric with regard to inputs versus outputs, and so it could be used for bioremediation just as well as for biomass production.
Important points as they relate to biomass oil production:
Not just the organism but also the environment can be manipulated (37).
Gene transcription is one of many factors controlling bioremediation rates (38).
Complex environment requires holistic view of metabolism (40).
Environmental proteomics (42).
Could several kinds of microorganisms be employed as a team, to regulate factors in the metabolism of oil-producing algae? Some of these factors may be metabolites themselves, such as dissolved oxygen, carbon dioxide, and sources of N, P, and S, and some may be more "environmental" factors such as salinity, pH, etc. A metabolic flux analysis and subsequent engineering on the system to maximize the production of a certain product needs to consider not only the organism and its environment but also auxiliary organisms.

For 4 Oct: Current and potential with biodiesel, political hurdles that can be lowered by technical advances, non-algal sources of biodiesel

Here's a page of links to articles or abstracts of articles that discuss problems with biodiesel. UPDATE: People have shuffled this page around, and the links are now circular. I'm not sure where this all has gone. Here's an old copy of the page.

Below is some brainstorming as to drawbacks of biodiesel and its production that might not have been considered by Briggs[6]. Potential solutions in italics after the problem.

  • Political Problems:
    • Algae farms are culturally alien.
    • Vulnerability to terrorism: It's probably easier to pollute a pond than it is to pollute a field. Diffuse production. No single pond is critical to the energy infrastructure.
    • Inferior acceleration of diesel engines turns some buyers off. American auto makers have used most of the improvements in engine efficiency for power rather than for fuel efficiency. Improvements in acceleration in diesel engines. Can diesel be combined with turbo technology? Also, a hybrid-diesel that could get a lot of power out of the battery.
  • Economic Problems:
    • How water-intensive are the ponds? In many areas of the U.S. (and the rest of the world), water supplies are getting tighter. Water may be more expensive than Briggs's implicit estimates.Build algae farms where water is more plentiful. Develop technologies to reduce water use or to recycle it somehow.
    • Areas near the algae farms may be unpleasant to live and work in, reducing property values.
    • Chicken-and-egg problem of a high demand for biodiesel cars and the availability of the fuel at service stations.

For 29 Sept

Added thoughts to the project ideas page.

Personalized Medicine

Papers reviewed and found for later revies

Misc. work

For 9 Nov

Posted link to DOE's Chromosome viewer to get an idea of which interesting genes are located on particular chromosomes.

For 23 Oct

Comment on the potential Wikipedia entry for personalized medicine.

For 12 Oct

Proposal of Personzlizedness Quotient to assess how personalized a treatment is.


November 1

Preliminary stuff

For 29 Sept: Assignment of complexity and entropy to test cases

I've made an Excel document that is my quick take on how complex and entropic certain entities are.

For 27 Sept: Definitions of complexity and randomness

There seem to be at least two criteria for a "good definition" of a term. One is that the definition of a term should match our intuitive understanding of the concepts behind the term. Thus, the way to check the quality of the definition is to see whether the definition includes some things that seem like they should be excluded, or excludes some things that seem like they should be included. What "seems" appropriate, however, of course depends on the community using the term. Thus, "average Joes" in North America would not approach a forest/field of bamboo and say, "look at the grass!", but for botanists it would be perfectly natural to call bamboo a grass.

Another criterion is a more teleological way of defining terms. A definition is a tool, and like any other tool, it should be fashioned in a way that maximizes its usefulness. There abound countless features of a tool that make it useful, but an off-the-cuff list might include:

  • Ease of use
  • Generalizability/universality
  • Economy/efficiency
  • Adaptability

And zeroing in more on definitions relevant to this class, a good tool offers:

  • Quantifiability
  • Computability
  • Correspondence to something observable (both theoretically and practically)

The second criterion seems to be the one more often used in science, although there are surely instances where the first criterion has influenced scientific terminology, for better or for worse. I lack the background to fully understand the details of Crutchfield and Young's paper, but from what I can understand, their measure of complexity is better than anything I can think up, at least in terms of the second criterion.

I'll try to come up with some original stuff, or at least original criticisms of exisiting definitions, before the next class meeting. --Jleith 16:14, 26 Sep 2005 (EDT)

Moving to trying to define random, I think really the only way to go here is with something that somehow captures both the thermodynamic and the information-theoretic understandings of entropy. Crutchfield and Young evidently do something like this.

A problem with defining randomness in terms of predicability, which Morten suggests on his User: page,

Trying to define what is random: a collection of elements is random when the correlation of the elements behavior/value is unpredictable - however, the properties of the total system may be straight forward to describe. Ex.: A random string of numbers between 1 and 9. There should be no correlation between the number at position a and a+k, but the average value of all the numbers will be appr. 5.

is that a entity's ability to predict something depends on the specific content of the information it possesses. Even if we modify this to say that, okay, a string of numbers coming out of some source is random if the information about what the next number will be does not exist until the number comes out, which would certainly satisfy unpredictability, it may still be that the source is biased in what numbers it outputs. Imagine something like a Schroedinger's-Cat contraption in which incoming photons are sent not just up or down, but up, down, left, and right, and the cat dies on an up, down, or right, and lives on a left. Which way the photon goes is random, but we can still predict whether the cat will be dead or alive when we open the box.

Taking an idea from Shawn, we might imagine a random-number generator as a system, and someone trying to predict its output as the adversary. Thus, the adversary knows everything there is to know about the system. It seems to me that the only way the adversary could not be able to predict better than chance the output of the machine is for the machine to be non-deterministic, which in turn requires that its operation include some process whose observable outcome requires a quantum-mechanical description.

So far, we have a necessary condition for random generation. I'd tentatively venture to say that it's also a sufficient condition for random generation that the adversary not be able to predict each output of the system better than chance. Is random generation, however, the same as "randomness"? Random generation, as I have described it at least, relates to the process of generating strings, whereas randomness is generally thought of as an intrinsic property of a string. I'd be inclined to argue that the idea of randomness as a sort of "state property" of a string is misleading.

We could just say that randomness is a less jargony-sounding way of saying entropy. This may or may not be very enlightening. But the word "random", to me at least, sounds like it has more to do with unpredictability than with disorder. Also, I'm not sure there's a useful concept of randomness that is distinct from entropy.

Consider: Suppose we have a string of numbers that is the ouput of some chaotic but deterministic system. We also, by chance, have the same output from a truly random (that is, non-deterministic) system. The strings of numbers are indistinguishable; they differ only in their means of generation. If we happen to have complete knowledge of the chaotic system, we could write a program to generate it that would be shorter than the string itself (I'm pretty sure the gist of this sentence is correct; corrections, computer scientists?). It's an identifiable feature of the string that it has some order in it and thus is not maximally entropic, but it's not an identifiable feature that the string was generated randomly or deterministically. I propose, therefore, that we reserve "random" for processes and "(maximally) entropic" for data, records, states, etc.

So, to conclude, randomness is a property of systems that generate output. What we might be inclined to call the randomness of the output itself really is thermodynamic or information-theoretic entropy. The concept of randomness as a property of an output really isn't distinct from the concept of entropy. --Jleith 23:23, 26 Sep 2005 (EDT)

22 Sept

Discussed the initial "human experiment". 101_Human_Experiment_1#Comments