Biophysics 101 Tasks

From FreeBio


Association studies of HapMap data

  • Reproduce or create an association study of HapMap data. (AMD) Age-related Macular degeneration used HapMap data
  • What amino acid substitutions are in the HapMap data?
  • What aa subs are associated with an OMIM disease? (Null hypothesis is yes)
  • What HapMap aa subs are in OMIM? (Null hypothesis is no)
  • What snc are in child but not in either parent? (Null hypothesis is few or none)

Choose a task below and put your name by it

  1. Upload build 34 into our db. Mark
  2. Put the HapMap data into our db. (what does strand column mean?) Chiki
  3. Create an API to access the data. Jeremy
  4. OMIM interface to db. Jeremy
  5. Need an Amino Acid and conservation API.
  6. Position in codon. (0-3)
  7. How well conserved (Since mouse)? (high hanging fruit)

Statistical tests

Matthew used relative risk to identify (correctly) the index of the mutated base associated with the mutant phenotype. The question is whether this is an appropriate statisitical method, or whether something like case control or cohorts study would be better?

Analysis Ideas (December 13, 2005)

  • Association table linking genotype IDs to phenotype IDs.
  • Subpopulations:
    • 1. Classify mutations according to significant associations with phenotypes.
    • 2. Given a new genotype, assign to a phenotype class.
  • Trace geneology.

Reference materials

Nov 29 Programming Tasks

Group pairings

  • Jeffrey/Matthew (human) - Team B
  • Mark/Chiki (human)
  • Jason/Morten (bacterial)

File formats

  • Reference genome - use an existing mentioned in class format for now (nonbinding suggestion)
    • GFF: an Exchange Format for Feature Description. GFF is a format for describing genes and other features associated with DNA, RNA and Protein sequences. The current specification can be found here
  • Copy genome - come up with a way to store only the differences (position and base change) based on reference genome (inter-group collaboration is encouraged here, since we will likely settle on a common format)

Genome Generation Tasks

  • Level 1
    • Input: Take a reference genome as input (you can start with a single chromosome, or section of a chromosome if the genome is too big)
    • Output A: Generate 100 copy genomes with an realistic mutation rate
    • Output B: Generate 100 copy genomes with an unrealistic, HIGH mutation rate
    • Output C: Generate 100 copy genomes with a realistic mutation rate, except at 1% specific positions, increase mutation rate to 20%
  • Level 2
    • Input: Reference genome/chromosome, and a form of annotation (e.g. HapMap data)
    • Output: Generate 100 more "realistic" copy genomes, which include original mutation rate, plus make sense in the context of input annotation (e.g., SNPs should be consistent with common Haplotypes, or if you delete an important start codon, the genome shouldn't be considered viable).

Genome Parsing/Analysis Tasks

  • Level 1
    • Input: Reference genome and 1 copy genome
    • Output: Copy genome in reference genome format
  • Level 2
    • Input: Reference genome and 100 copy genomes
    • Output: List of statistically significant mutation spots


  • Keep in mind that we will scale up these numbers soon. Small genome sections and few copy genomes are just so we have something tractable to discuss on Tuesday.
  • High mutation rates are so we can keep copy number small and still be able to detect

Nov 22 Presentation Topics

Use --~~~~ to sign up for a topic. Email us if you think of something better that you'd like to do. --smd 09:09, 18 November 2005 (EST)

  • Human genome resources background: UCSC browser[1], OMIM[2] --Cgupta 09:45, 18 November 2005 (EST)
  • HapMap[3],SNPs --Kaganov 14:33, 18 November 2005 (EST)
  • Microbial genome resources: KEGG[4], MetaCyc[5] --Jleith 15:24, 20 November 2005 (EST)
  • Simulated genomes
  • data representation 2 bit and difference map encoding.
  • exact match searching, MatthewMeisel 23:00, 21 November 2005 (EST)
  • point changes and rearrangment detection
  • prioritization of mutations for most likely to impact function (pos or neg) (PolyPhen) - Morten

Perl Exercises

  • write a Perl script to read in a DNA sequence from a file and print the following
    • original sequence
    • length of sequence
    • reverse complement (antisense)
    • GC content (%)
    • positions of any EcoRI restriction sites

You can write separate scripts for each task, or put everything in the same script.

If anyone wants to see a snapshot of some interesting genes on their chromosome, the DOE's Office of Science has a handy "chromosome viewer"[6] --Jleith 13:26, 9 November 2005 (EST)

Download data from

  • Choose a human chromosome you plan to use (for example, as input for the program in the exercise above)
    • write a perl script that reads the ASCII chromosome data and encodes the chromosome in memory (2 bits per basepair)
    • output this binary description as a file
    • read in the binary description from a file
    • output the ASCII description again
    • convince us your program is correct!

Please work together on the wiki—you do not need to be the sole author of the programs you run—but you should run programs yourself on "your" chromosome.

Papers to review

  • Please take information from this section and move it to your user page (or wherever you feel is appropriate). After you are finished your review (a few paragraphs) be sure to mention what you have done (and where you did it) in your report of weekly activities for the class. Remember to keep Biophysics 101 Background up-to-date also.
  • Feel free to add papers here (for other people to review.)

Personalized Medicine