From FreeBio


Biophysics 101 Assignments

101 Week 1

About Me

101 Week 2

Definitions of Randomness and Complexity

101 Week 3

Issues Facing Personalized Medicines

Stem Cell Basics

101 Week 4

Review of “Systems Biology, Proteomics, and the Future of Health Care: Toward Predictive, Preventative, and Personalized Medicine?

101 Week 5

Personalized Medicine Project Ideas

101 Week 6

Cost of Medical Misdiagnosis

101 Week 7

Review of “Systematic Survey Reveals General Applicability of ‘Guilt-by-Association? within Gene Coexpression Networks?

101 Week 8

Perl Environment

Review of "Deeper into the genome commentary on HapMap milestone and where to go next"

101 Week 9

Presentation on ENCODE and OMIM

101 Week 10

Aligned Genomes and Statistically Significant Mutations

Aligned Genomes and Statistically Significant Mutations

open(REFGEN, "<newrefseq.txt") or die "Cannot open file1\n";
#my $refgen= <INFILE>;

my $refgen;
my $toalign;

#make refgen into string

while (<REFGEN>)
		$refgen  = $refgen . $_;


#print "Refgen: $refgen";

#make toalign into string as well

open (TOALIGN, "mut_seq_new.txt") or die "Cannot open file2\n";
#my $toalign= <ALIGNGEN>;

while (<TOALIGN>)
		$toalign  = $toalign . $_;


#print "To Align: $toalign";

#get rid of any extraneous spaces/line breaks
#chomp $refgen;
#chomp $toalign;

my $length_refgen;
my $length_toalign;

print "The length of refgen is $length_refgen\n";

print "The length of toalign is $length_toalign\n";

if ($length_refgen == $length_toalign)
  {print "The lengths of the Reference Genome and the Genome to be Aligned are equal";}
  {die "These are not copy genomes! Their lengths do not match!\n";}

#Assign string to characters in an array

#@refgen = split(//, $refgen);
#@toalign = split(//, $toalign);

my $aligned = "";

#compare every base pair of $refgen to $toalign using a for loop with if statements;

#in $aligned- if bp same, record a 0, else, record the bp from $toalign to $aligned' I don't however think this is a good system. Maybe the differing base pairs should be stored as numerical values instead? I know we were discussing this in class, so other ideas?

my $count =0;
my $refgenbp;
my $toalignbp;

for ($count=0;$count <= $length_refgen; $count++)
  $refgenbp= substr ($refgen, $count, $count+1);
  $toalignbp= substr ($toalign, $count, $count+1);
  if ($refgenbp eq $toalignbp)
		$aligned .= '*';

		$aligned .= $toalignbp;

#print out $aligned;
print "Aligned Sequence: $aligned\n";
$length_aligned= length($aligned);
print "Length aligned: $length_aligned";

I wrote some code (predominantly for the Level 1). But I brainstormed ideas for Level 2 of the Analysis. I would like to get input on these before I write the code.

Here's the code (it's in PDF format).

And here's the Powerpoint.

Presentation on ENCODE and OMIM

Here's my presentation on two existing genomic databases: ENCODE and OMIM. The presentation tends to focus more on ENCODE, as I found this resource was more useful for our projects, than OMIM.

Ok, I uploaded the file in PDF format, but can't seem to link to it here. It can be found under uploaded files.

Review of "Deeper into the genome commentary on HapMap milestone and where to go next"

Take-away Points:

Current Status of HapMap

  • Categorized 1 million SNPs in 269 people
    • Identified a subset of the estimated 3 million common SNPs in genotyping assays
    • Identified 'tag' SNPs, which link common haplotypes DNA chunks
    • However, SNP’s don’t directly alter gene product/function
    • But these SNP’s can still serve as tools in identifying disease-causing genes
  • The power of SNP’s: A 2004 study showed that as few as 75 SNPs could be used to identify an anonymous patient

Current HapMap Scope

  • HapMap tools work best when the functional variants involved occur in the diseased population with a frequency greater than 5%


  • When disease-causing functional variants fall below 5% in the diseased population, approaches aided by HapMap lose power

Future Trends

  • Build a functional-variant database by using MRS targeted to the coding regions of all 20,000 known human genes, plus an additional segment of the presumed promoter (gene-control) region of each
    • Medical Resequencing (MRS) -Technique to identify disease-causing genes by comparing suspected genes from patients and the control group.
  • This project would require improved technologies for MRS
    • Namely, significantly decreased costs

Other useful resources:

My Perl Environment

For the purposes of writing/testing Perl programs, here is a description of what my computer is capable of:

  • Windows XP 2002
  • 1.5 GHz Intel Processor
  • 496 MB RAM
  • 80 GB

And my Perl Environment:

  • I'm currently using Notepad and logging in through fas' Secure CRT

Review of “Systematic Survey Reveals General Applicability of ‘Guilt-by-Association? within Gene Coexpression Networks?

Review of Paper

This paper describes a study by the authors to quantify the breadth of Guilt-by-Association (GBA) genes in Gene Ontology (GO). The researchers performed coexpression measurements across multiple genes using microarrays and subsequently assigned co-expression P-values to all possible pairs of genes using hypergeometric distribution. The authors subsequently used this data to generate a network of coexpression relations between pairs of genes in an organism (and this can even be extended across organisms). The authors conclude, “by combining coexpression measurements across multiple genes in the module, there is a systematic and reproducible signature of functional association? and that guilt-by-association is an important factor.

This study is clearly applicable to our personalized medicines project as it can add an additional parameter with which to predict the effect of a drug or environmental factor. We can consider/quantify the effect of a given gene expression taking into account changes in other GBA genes using some existing tools.

Definitions (from Wikipedia)

Guilt-by-Association Genes: Coexpressed genes, whose function is controlled by a common functional module.

Gene Ontology- A trio of controlled vocabularies that are being developed to aid the description of the molecular functions of gene products, their placement in and as cellular components, and their participation in biological processes.

Related Resources

Gene Ontology- A number of tools that may be of use in comparing similarities and differences across organisms (I’m not too sure which ones may be of use, but I thought I’d put the resource up anyway).

Medical Misdiagnosis

Moved to Biophysics 101 Project page--JeremyZucker 20:56, 19 January 2006 (EST)

Personalized Medicine Project Ideas

In the paper I reviewed last week, (“Systems Biology, Proteomics, and the Future of Health Care: Toward Predictive, Preventative, and Personalized Medicine?), the authors basically proposed the following: The future of health care lies in preventative medicines, which will necessarily require modeling and understanding of the body as a dynamic system instead of the relatively snapshot-like method currently used in drug development.

This appears to be a reasonable prediction in my opinion. So, perhaps we can construct the framework for a website integrating lifestyle data (including dietary habits, sleeping habits, and other environmental factors – we would have to figure out which environmental factors are of greatest use) along with an individual’s personal genome to predict and prevent future diseases (as the personalized medicine people have been discussing).

Ideally, a drop of blood sample will be obtained from the patient. Because this blood sample should contain the relative amounts of all chemicals, molecules and proteins present in the body, analysis of this sample will be sufficient to determine how these chemicals will interact with DNA.

In order to do this, we would basically need a highly complex logic diagram. Such a diagram would intake all of the data and depending on the relative presence of certain proteins, chemicals and other molecules (from the lifestyle data), it would analyze the combined effect of these entities on the expression of the DNA and would thus provide better insight on the expectance of a given disease. Logic diagrams for some genes in sea urchins and yeast have recently been constructed.

Along the same lines, we could design a web tool which allows the user to input data online and get a prediction.

Immediate steps to take (please add any others)

1. Analyze the models that have already been constructed. What are their strengths, weaknesses? Links to the yeast paper and the sea urchins paper.

2. Determine the most important environmental factors affecting gene expression (perhaps some combination of those on to base our logic diagram on.

I think these two should immediately be looked into, and we can go from there.

Review of “Systems Biology, Proteomics, and the Future of Health Care: Toward Predictive, Preventative, and Personalized Medicine?

Review of Paper

The central point of this paper can be summarized by the following: The future of health care lies in preventative medicines, which will necessarily require modeling and understanding the body as a dynamic system instead of the relatively snapshot-like method currently used in drug development.

The authors predict that health care will undergo a “paradigm shift within the next two decades.? Namely, health care will progress from being reactive to one that can probabilistically predict and prevent aliments. Ideally, health care will rely on a large number of input factors, such as environmental cues, protein-protein interactions, mRNA-protein interaction, minute changes in pH, etc. being fed into a powerful multiparameter analysis and output a probabilistic model for the onset of a condition. However, this fantasy requires simultaneous advancement and increasing overlap in the areas of computational models and greater biological understanding (which is dependent on advances in nanotechnology, microfluids (small fluid sample sizes – as is typically the case for bodily fluids), proteomics (the inter- and intra- actions of highly complex proteins)).

Existing Framework:

  • Foundation for a genetic database created by the human genome project
  • High-throughput instruments
  • Internet as a means of disseminating/sharing data
  • Ever-increasing computational power

Biggest current weaknesses:

  • Lacking significantly in our complete understanding of biological processes (such as a firm understanding of protein interactions).

Towards this the idealization of health care, the authors visualize the creation of a logic diagram to explain the relatively complex expression patterns of biological molecules. One such example from the paper is provided below. Clearly, such a feat demands immense computational power and data sets. But it is certainly an interesting idea.


Some Thoughts & Questions

  • Reading this paper reminded me of the issue of randomness we discussed towards the beginning of class. Although the authors suggest a theoretically powerful model, how practical is it? If one attempts to make predictions based on millions of readily changeable variables, then do the probabilistic outcomes really give any useful information?
  • “There has been an overwhelmingly large amount of data generates, but these data are fragmented in different databases, with high error rates estimated? – Could we do some sort of a project to address this issue? What is the possibility of creating a centralized database? Could we do it?
  • “There is no simple way to compare different interaction data sets. Synchronizing these databases will facilitate efforts to exploit this information…must be an immediate priority in the field…? The authors go onto write that the Human Proteome Organization developed such a system in 2002. I’d be interested in seeing how well this database is working, what some flaws are, etc.

Issues Facing Personalized Medicines

Can we create a centralized database of biological data to accelerate the progress of pharmacogenomics through better dissemination of information?

Pharmacogenomics are highly dependent on large amounts of data, including the precise location of specific genes, the controlled expression of these genes, the effects of single alleles on various traits. Daily, research around the world is generating huge quantities towards answering these questions. However, availability of this information remains a challenge.

  • How would intellectual property considerations be taken into account?
  • What incentive would an individual/research group have in sharing their hard-earned results?
  • How would the validity of the data be determined/monitored?

What is the current/future economic viability of pharmacogenomic drugs?

Sure, the concept of personalized medicines catered to one’s unique genome is revolutionary. But, the economic reality is that treatment affordability remains a key issue. Currently, the cost of sequencing an average mammalian genome is $10 - $50 million. Clearly, this cost must be significantly slashed before personalized medicines become practical. Towards this end, NIH is currently spending $39 million to bring the cost of sequencing a human genome to less than $1000.

  • What is the projected cost of developing/producing personalized medication likely to be?
  • How much is it expected to reduce by in 5/10/50 years from now?
  • Will people actually be willing to pay as much for the medication as pharmaceutical companies are anticipating?

Stem Cells

The most comprehensive, although relatively basic, resource regarding Stem Cells is the NIH’s (National Institute of Health’s) Stem Cell Information. I think this resource will be especially helpful if you have absolutely no prior knowledge of the scientific basis of stem cells. Below, I have pulled out some of the key points, which should give you a basic idea, especially if you don’t have time to go through the website yourself. More details are of course available on the website.

Stem Cell Basics

What are Stem Cells?

  • All stem cells—regardless of their source—have three general properties:
    • They are capable of dividing and renewing themselves for long periods.
    • They are unspecialized
    • They can give rise to specialized cell types.

Two fundamental properties of greatest interest include:

  1. Why can embryonic stem cells proliferate for a year or more in the laboratory without differentiating, but most adult stem cells cannot
  2. What are the factors in living organisms that normally regulate stem cell proliferation and self-renewal?

Types of Stem Cells

  • Embryonic stem cells - Derived from embryos that develop from eggs which have been fertilized in vitro—in an in vitro fertilization clinic—and then donated for research purposes with informed consent of the donors. They are not derived from eggs fertilized in a woman's body.
  • Adult/Somatic stem cell - An undifferentiated cell found among differentiated cells in a tissue or organ, can renew itself, and can differentiate to yield the major specialized cell types of the tissue or organ. The primary roles of adult stem cells in a living organism are to maintain and repair the tissue in which they are found. Some scientists now use the term somatic stem cell instead of adult stem cell.

The Potential of Stem Cells

  • Generation of cells and tissues that could be used for cell-based therapies
  • Test new drugs for safety on differentiated cells generated from human pluripotent cell lines.
  • Possibility of a renewable source of replacement cells and tissues to treat diseases including Parkinson's and Alzheimer's diseases, spinal cord injury, stroke, burns, heart disease, diabetes, osteoarthritis, and rheumatoid arthritis.
    • ie. In people who suffer from type I diabetes, the cells of the pancreas that normally produce insulin are destroyed by the patient's own immune system.
    • ie. Possible to generate healthy heart muscle cells in the laboratory and then transplant those cells into patients with chronic heart disease.

What still needs Work?

To be useful for transplant purposes, stem cells must be reproducibly made to:

  • Proliferate extensively and generate sufficient quantities of tissue.
  • Differentiate into the desired cell type(s).
  • Survive in the recipient after transplant.
  • Integrate into the surrounding tissue after transplant.
  • Function appropriately for the duration of the recipient's life.
  • Avoid harming the recipient in any way.

Also, to avoid the problem of immune rejection, scientists are experimenting with different research strategies to generate tissues that will not be rejected.

Existing Federal Regulations

  • Federal v. Private Funding:

Our knowledge of stem cells is still in the incipient stages. Although this concept holds much potential, there are currently gaping holes in our understanding of the exact molecular mechanism of stem cells, and thus stem cells are still predominantly confined to the research domain. This proves a large barrier in acquiring funds for stem cell research. Generally, venture capitalists invest their money based on the potential for large returns. However, because the practical application of stem cells (ie. in the form of personalized medicines) is still a somewhat abstract concept, investing in research does not provide desired returns. Normally, the government serves as a significant source of funding, especially for cutting-edge, abstract research – however, this has been cut off in the United States.

Recent Developments

A giant leap in stem cell reseach was made this past August with the development of hybrid stem cells. Scientists at Harvard University led by Dr. Kevin Eggan were successful in converting adult skin and bone cells into embryonic stem cells by fusing adult cells with pluripotent embryonic stem cells. The potential implications of this breakthrough are huge, most notably implying more Federal funding for stem cell as this method mitigates ethical concerns. Although minor technical kinks such as extracting the excess adult DNA remain, these are not expected to pose a significant barrier. The complete study is available in the August 2005 issue of Science magazine.

Randomness and Complexity

Philosophical Approach

I won’t attempt to redefine randomness or complexity here, because I think Morten has already provided a fairly good overview. However, I’ve tried to integrate randomness & complexity through a more philosophical lens. I feel that a system appears random because it is highly complex. So I believe that randomness and complexity are actually highly intertwined through a causal relationship.

Philosopher Pierre-Simon Laplace is quoted to have said, "If we imagine an intellect which at any given moment knew all the forces that animate Nature and the mutual positions of the beings that comprise it…nothing could be uncertain and the future just like the past would be present before its eyes." If this is the case, then perhaps our limited math and science give rise to highly complex, so-called ‘random events’.

However, if the “beings? in our universe are infinite but time is finite, then accounting for all of their properties is theoretically impossible. Regardless of scientific advances, we will never be able to account for these unlimited influences. Moreover, even the most sophisticated mathematical models may be unable to predict some extremely complex phenomenon, such as human emotions. Such unpredictability would perpetuate uncertainty.

A lot of work on this subject has been done at the University of Maryland’s top ranked Chaos Department.

Models of Randomness & Complexity

If randomness is truly random, then no models will be able to predict with certainty. In order to attempt to quantify randomness, let’s suppose that it is an amalgamation of complex mathematical formulas. So, here are some resources on randomness modeling I found of interest:

This paper is a cross between psychology and randomness modeling in how people discover structure embedded in noise (it may be of especial importance to Jeff).

A paper attempting to model randomness (or rather to minimize its effects), this time in a biological context.

A little dated, but interesting nonetheless.

About Me


I am a Sophomore at Harvard College concentrating in Biochemistry and Economics. I am especially interested in this class becasue I intend on eventually pursuing a Business or Law degree (maybe even both), which I hope to apply to better decision-making within the scientific realm. You can contact me at