Biophysics 101 Software

From FreeBio


Personal Genome API


use PGUtils;

Put this at the top of your perl program, and put the file somewhere that @INC can find it. Then you will be able to access all the functions described below.


$sequence = substitute( $sequence, $position, $base_pair ); 


  • Takes a sequence,
  • a position (1-based index), and
  • base pair.

Returns the sequence with the base pair substituted.


$contig = get_contig_of_genotype($genotype_id, $contig_id, $group );


  • genotype_id,
  • contig_id
  • group

Returns the full sequence of the contig for that genotype.


@substitutions = get_substitutions_of_contig( $contig_id, $genotype_id, $group );
foreach $sub (@substitutions) {
   print $sub->{id} . "\t" . $sub->{contig_id} . "\t" . 
       $sub->{genotype_id} . "\t" . $sub->{position} . "\t" . $sub->{bp} . "\n";


  • contig_id
  • genotype_id,
  • group.

Returns the substitutions as an array of hashes.


 $sequence =  get_sequence_of_contig($contig_id, $group);


  • contig_id
  • database group,

Returns the sequence associated with that contig_id.


 @genotypes = get_genotypes_of_phenotype( $phenotype_id, $group );
 foreach $association (@genotypes) {
     print $association->{id} . "\t" . $association->{genotype_id} . "\n";


  • phenotype_id
  • a database group,

Returns all genotype_ids that have that pheontype_id as an array of hashes.


 @phenotypes = get_phenotypes_of_genotype( $genotype_id, $group ); 
 foreach $association (@phenotypes) {
     print $association->{id} . "\t" . $association->{phenotype_id} . "\n";


  • genotype_id and
  • a database group,

Returns all phenotype_ids that have that geontype_id as an array of hashes.


 @contigs = get_all_contigs_of_genotype($genotype_id, $group);
 foreach $contig (@contigs) {
   print "$contig\n";


  • genotype_id and
  • a database group,

Returns a list of contig_id's.


 $boolean =  has_sequence($genotype_id, $sequence, $group);


  • genotype_id
  • a sequence,
  • group database

Returns true if the sequence matches any subsequence of any contig, false otherwise.


 $subseq = get_sequence($genotype_id, $contig_id, $begin, $end, $group);


  • genotype_id,
  • contig_id,
  • begin and end positions,
  • database group

Return the subsequence of that genotype's contig sequence from $begin to $end-1.


 @result = result( $url );


  • a url,

Returns the csv values as an array of hashes, where each hash is keyed by the column name.


 $request = request($url);


  • a url,

Returns the csv values as a big string.


#!/usr/bin/perl -w

package PGUtils;
use strict;
require Exporter;
use LWP::UserAgent;
use CSV::Template;
our @ISA = qw(Exporter);
our @EXPORT = qw(substitute

sub substitute {
  my ($sequence, $ref, $value) = @_;
  substr( $sequence, $ref-1, 1) = $value;
  return $sequence;

# Given a genotype_id, returns the full sequence of the contig for that genotype.
sub get_contig_of_genotype{
  my ($genotype_id, $contig_id, $group) = @_;
  my $contig = get_sequence_of_contig( $contig_id, $group );
  my @res = get_substitutions_of_contig( $contig_id, $genotype_id, $group );
  for my $substitution ( @res ) {
    $contig = substitute( $contig, $substitution->{"position"}, $substitution->{"bp"} );
  return $contig;

sub get_substitutions_of_contig {
  my ($contig_id, $genotype_id, $group) = @_;
  my $url = "$group&contig_id=$contig_id&genotype_id=$genotype_id";
  return result(  $url ) ;

sub get_sequence_of_contig {
  my ($contig_id, $group) = @_;
  my $url = "$group&id=$contig_id";
  my @result =  result( $url );
  my $res = $result[0];
  return $res->{"sequence"};

# Given a Chromosome, create refrence contigs and put into contigs table.

# Given a Chromosome, create substituions for this chromosome against the refrence chromosome.

# Given a contig_id, sequence and a genotype_id, computes the substitutions against the reference contig_id and stores it in the substitutions table.
sub  put_contig_of_genotype{
  my ( $contig_id, $genotype_id, $sequence ) = @_;

# Given a phenotype_id, returns all genotype_ids that have that pheontype_id
sub get_genotypes_of_phenotype {
  my $phenotype_id = shift;
  my $group = shift;
  my $url = "$group&phenotype_id=$phenotype_id";
  return result( $url );

sub get_phenotypes_of_genotype {
  my $genotype_id = shift;
  my $group = shift;
  my $url = "$group&genotype_id=$genotype_id";
  return result( $url );

sub get_all_contigs_of_genotype {
  my ($genotype_id, $group) = @_;
  my $url = "$genotype_id&group=$group";
  my %contigs = ();
  foreach my $result (result( $url ) ) {
    if( !exists($contigs{$result->{"contig_id"}}) ) {
      $contigs{$result->{"contig_id"}} = 0;
  return keys %contigs;
# Given a genotype_id and a sequence, returns true if the sequence matches, false otherwise.
sub has_sequence {
  my ($genotype_id, $sequence, $group) = @_;
  foreach my $contig_id (get_all_contigs_of_genotype($genotype_id, $group) ) {
    my $contig_sequence = get_contig_of_genotype( $genotype_id, $contig_id, $group );
    if ($contig_sequence =~ m/\Q$sequence/i ) {
      return 1;
  return 0;

sub get_sequence {
 my ($genotype_id, $contig_id, $begin, $end, $group)  = @_;
 my $sequence = get_contig_of_genotype( $genotype_id, $contig_id, $group );
 return substr( $sequence, $begin, $end - $begin );  

### utilities to call server
## Utilities

sub result {
  my $url = shift;
  my $lines = request( $url );
  my ($db, $headers, @lines)  = split( "\n", $lines );
  my @results = ();
  my ($cols, $header) = split( "=", $headers);  
  my @column_header = split( ',', $header );

  foreach my $line (@lines ) {
    my %row = ();
    my @columns = split( ',', $line );
    #print join("\t", @columns) . "\n";
    for my $i (0..$#column_header) {
      $row{$column_header[$i]} = $columns[$i];
    push @results, \%row;
  #print join("\n", @results);
  return @results;

sub request {
  my ($url) = @_;
  my $ua = LWP::UserAgent->new;
  my $req = HTTP::Request->new(GET => "$url");
  $req->header(Accept => 'text/html');
  my $res = $ua->request($req);
    print $res->content;
    return  $res->content;
  } else {
    die "Error: " . $res->status_line . "\n";



=head1 NAME

PGUtils - A convenience library for accessing tables in

  use PGUtils;

Relevant Bioperl modules

Bio::PopGen for calculating Population Genetics

In order to calculate some interesting statistics on the Personal Genome data, I started looking into the Bio::PopGen BioPerl package.

Mapping the Bio::PopGen Object model to the 101 database schema

Bio::PopGen Object 101 Database
Bio::PopGen::Individual genotypes
Bio::PopGen::Genotype substitutions
Bio::PopGen::Genotype->marker_name() substitutions.position
Bio::PopGen::Genotype->individual_id() substitutions.genotype_id
Bio::PopGen::Genotype->get_Alleles() substitutions.bp

Note that we may want to refactor our database schema so that it conforms with the fields that are expected by the Bio::PopGen classes. For example, many genotypes will have the same marker, so it is probably a good idea to normalize the database to reflect that fact before we move too far forward.

It should be quite straightforward to populate the Bio::PopGen objects using the library.

Performing Population statistics with Bio::PopGen::Statistics

Now the fun begins. Given a Bio::PopGen::Population of Bio::PopGen::Individuals, you can calculate some Bio::PopGen::Statistics. If you decide to add some Bio::PopGen::Markers, you can then calculate some Bio::PopGen::PopStats. Finally, you can run a Bio::PopGen::Simulation by deciding how to Bio::PopGen::Simulation->addMutations().

Bio::PopGen::IO for reading and writing population data

Since this module can handle CSV format, it may be worthwhile creating a php script that produces data in the form Bio::PopGen::IO expects.

If we want to use actual HapMap data, there does not appear to be an IO module for it, but perhaps we could write one.

Finally, SNP data in prettybase format can be read into a Bio::PopGen::IO object.

Bio::Phenotype for handling of sophisticated phenotype data

At present, we are using very simple phenotypes. We may want to keep it that way. However, we discussed in class the idea of using more sophisticated Phenotypes, such as measured fluxes. We may also want to use real phenotypic data, such that we can actually map our genotypes to real diseases. Our database schema must be flexible enough to accommodate both microbial and human phenotypes Our database schema must also be able to assign phenotypes to combinations of markers.

After investigating the object model for the Bio::Phenotype module, I had several observations.

  • The object model has support for OMIM, MeSH, and biochemical Measures
  • It is possible to define your own genotypes and assign them to a phenotype.
  • It is possible to define your own measures and assign them to a phenotype.
  • As our database schema evolves, it would probably benefit us to adopt the Bio::Phenotype Object model.

--JeremyZucker 00:55, 18 December 2005 (EST)

Chinese Wall code

Random DNA sequence generator


Matthew's relative risk algorithm

2 bit DNA encoding

Matthew's second shot at 2-bit DNA encoding

Hotspot location algorithms

Matthew's Hotspot location algorithms

Aligned Genomes and statistically significant mutations

Chiki's Aligned Genomes and Statistically significant mutations