User:MatthewMeisel

From FreeBio

Contents


Linked genes?

On Mark's suggestion, I've created a short program that can be used to search for linked genes (actually SNPs) on different chromosomes. It uses the HapMap genotypes (redundant filtered) as the genotypic data.

The core code can be placed in a nested loop to search through as many SNPs as desired. Note, of course, that the running time of the program increases with order n2, where n is the number of SNPs on each chromosome that the program is testing.

Each comparison of two SNPs does the following. It takes as its input the genotypes of some number of individuals (95 in this case). It categorizes each of the individuals in one of nine categories: homozygous at one locus and homozygous at the other, homozygous at one locus and heterozygous at the other, homozyogus at one locus and homozygous with the other allele at the other, etc. These correspond to the final variables $XX, $XY, $XZ, etc., that are reported at the end of the program.

In the course of programming, however, I realized that there's a problem of analyzing the raw data. FindNuc uses a simple relative risk calculation, but this is not possible to do because of the possibility of heterozygosity. The challenge is this: when an individual is heterozygous for a certain trait, is there any way of telling, a priori, which allele (or alleles) is dominant? I don't know the answer, which is why the program just reports the nine possible values without further analysis.

The code is below:

#!/usr/bin/perl -w 

use strict;



####################################
# Begin nested loops
####################################

my $x;
my $y;

for ($x = 0; $x < 20; $x++) {
	for ($y = 0; $y < 20; $y++) {


####################################
# Parse a line from each chromosome, and read genotypes into hashes
####################################

my $linein;

my $numIndivs = 95;	# dependent on HapMap input file

my $filenameA;
my $filenameB;

my $chrAaa;
my $chrAab;
my $chrAbb;

my $chrBaa;
my $chrBab;
my $chrBbb;

my %chrAgenos;
my %chrBgenos;

my $pos;
my $limit;

my $c;

$filenameA = "<genotypes_chr20_CEU.b35.txt";
$filenameB = "<genotypes_chr21_CEU.b35.txt";

open(CHR_A, $filenameA);

for ($c = 0; $c < $x+1; $c++) {
	<CHR_A>;	# skip header line
}

$linein = <CHR_A>;
$linein = $linein . " ";	# append an extra space for parsing purposes

$pos = 0;

# skip 1 word
for ($c = 0; $c < 1; $c++ ) {

	$limit = index($linein," ",$pos);
	# print( substr($linein,$pos,$limit - $pos) . "\n");
	$pos = $limit + 1;

}

# parse second word for nucleotides
for ($c = 0; $c < 1; $c++ ) {

	my $temp;

	$limit = index($linein," ",$pos);
	$temp = substr($linein,$pos,$limit - $pos);
	$pos = $limit + 1;

	$chrAaa = substr($temp,0,1) . substr($temp,0,1);
	$chrAab = substr($temp,0,1) . substr($temp,2,1);
	$chrAbb = substr($temp,2,1) . substr($temp,2,1);

}

#skip 9 words
for ($c = 0; $c < 9; $c++ ) {

	$limit = index($linein," ",$pos);
	# print( substr($linein,$pos,$limit - $pos) . "\n");
	$pos = $limit + 1;

}

# parse 95 genotypes
for ($c = 0; $c < numIndivs; $c++ ) {


	$limit = index($linein," ",$pos);
	$chrAgenos{$c} = substr($linein,$pos,$limit - $pos);
	# print($chrAgenos{$c} . " ");
	$pos = $limit + 1;

}

close(CHR_A);


open(CHR_B, $filenameB);

for ($c = 0; $c < $y+1; $c++) {
	<CHR_B>;	# skip header line
}

$linein = <CHR_B>;
$linein = $linein . " ";	# append an extra space for parsing purposes

$pos = 0;

# skip 1 word
for ($c = 0; $c < 1; $c++ ) {

	$limit = index($linein," ",$pos);
	# print( substr($linein,$pos,$limit - $pos) . "\n");
	$pos = $limit + 1;

}

# parse second word for nucleotides
for ($c = 0; $c < 1; $c++ ) {

	my $temp;

	$limit = index($linein," ",$pos);
	$temp = substr($linein,$pos,$limit - $pos);
	$pos = $limit + 1;

	$chrBaa = substr($temp,0,1) . substr($temp,0,1);
	$chrBab = substr($temp,0,1) . substr($temp,2,1);
	$chrBbb = substr($temp,2,1) . substr($temp,2,1);

}

#skip 9 words
for ($c = 0; $c < 9; $c++ ) {

	$limit = index($linein," ",$pos);
	# print( substr($linein,$pos,$limit - $pos) . "\n");
	$pos = $limit + 1;

}


# parse 95 genotypes
for ($c = 0; $c < numIndivs; $c++ ) {


	$limit = index($linein," ",$pos);
	$chrBgenos{$c} = substr($linein,$pos,$limit - $pos);
	# print($chrBgenos{$c} . " ");
	$pos = $limit + 1;

}

close(CHR_B);






####################################
# Calculate correlation values.
# Note that NN genotypes are skipped.
####################################

my $XX = 0;
my $XY = 0;
my $XZ = 0;
my $YX = 0;
my $YY = 0;
my $YZ = 0;
my $ZX = 0;
my $ZY = 0;
my $ZZ = 0;

for ($c = 0; $c < numIndivs; $c++) {

	my $tempAgeno = $chrAgenos{$c};
	my $tempBgeno = $chrBgenos{$c};

	if (($tempAgeno eq $chrAaa) && ($tempBgeno eq $chrBaa)) { $XX++; }
	if (($tempAgeno eq $chrAaa) && ($tempBgeno eq $chrBab)) { $XY++; }
	if (($tempAgeno eq $chrAaa) && ($tempBgeno eq $chrBbb)) { $XZ++; }

	if (($tempAgeno eq $chrAab) && ($tempBgeno eq $chrBaa)) { $YX++; }
	if (($tempAgeno eq $chrAab) && ($tempBgeno eq $chrBab)) { $YY++; }
	if (($tempAgeno eq $chrAab) && ($tempBgeno eq $chrBbb)) { $YZ++; }

	if (($tempAgeno eq $chrAbb) && ($tempBgeno eq $chrBaa)) { $ZX++; }
	if (($tempAgeno eq $chrAbb) && ($tempBgeno eq $chrBab)) { $ZY++; }
	if (($tempAgeno eq $chrAbb) && ($tempBgeno eq $chrBbb)) { $ZZ++; }

}

print($XX . "\n");
print($XY . "\n");
print($XZ . "\n");
print($YX . "\n");
print($YY . "\n");
print($YZ . "\n");
print($ZX . "\n");
print($ZY . "\n");
print($ZZ . "\n");




####################################
# An aborted attempt to calculate a single relative
# risk value through a 3x3 matrix determinant
####################################

my $determinant;


#$determinant = ($XX * $YY * $ZZ) + ($XY * $YZ * $ZX) + ($XZ * $YX * $ZY) - ($XZ * $YY * $ZX) - ($YZ * $ZY * $XX) - ($ZZ * $XY * $YX);
#print("det is " . $determinant);




####################################
# An aborted attempt to calculate four relative risk values.
# "dominant dominant" assumes that the first
#	listed nucleotide on each chromomsome 
#	is dominant, etc.
####################################

my $rrDD;
my $rrDR;
my $rrRD;
my $rrRR;

# DD case:

$rrDD = ($XX + $XY + $YX + $YY) * $ZZ - ($XZ + $YZ) * ($ZX + $ZY);
$rrDR = ($XX + $YX) * ($ZY + $ZZ) - ($XY + $XZ + $YY + $YZ) * $ZX;
$rrRD = ($XX + $XY) * ($YZ + $ZZ) - ($YX + $YY + $ZX + $ZY) * $XZ;
$rrRR = $XX * ($YY + $YZ + $ZY + $ZZ) - ($XY + $XZ) * ($YX + $ZX);

#print($rrDD . "\n");
#print($rrDR . "\n");
#print($rrRD . "\n");
#print($rrRR . "\n");




####################################
# nested loop closed braces
####################################

	}
}

FindNuc: a relative risk algorithm

Pasted below is a program that takes as its input tables of contigs, substitutions, and associations. (In this case, the only associations it can handle are a "yes" or a "no" value for each genotype.) The output is the nucleotide position that has the greatest relative effect on the phenotype.

For each position p, the program queries the database to tabulate four values:

  • A, the number of genotypes that are wildtype (match the reference sequence) at position p and have phenotype 0
  • B, the number of genotypes that are wildtype at position 'p' and have phenotype 1
  • C, the number of genotypes that are mutated (do not match the reference sequence) at position p and have phenotype 0
  • D, the number of genotypes that are mutated (do not the reference sequence) at position p and have phenotype 1

The relative risk ratio r is calculated for that position p, where r is the relative probability that the nucleotide at 'p' is responsible for the phenotypic change. The formula for r is r = AB-CD.

Finally, FindNuc outputs the position p that gives the highest relative risk ratio r.

#!/usr/bin/perl -w 

use strict; 
use LWP::UserAgent;





####################################
# Request and process contigs
####################################

my $ua = LWP::UserAgent->new;

my $query_id;
my $group; 

my $query;
my $req;
my $res;

my $refseq;
my $temprefseq;

my $resLocal;
my $thirdLine;

$query_id = 1;
$group = "b";

#
# construct URL of contig query
#

$query = 'http://personalgenome.org/101/all-contigs.php?'.
	"group=$group";

$req = HTTP::Request->new(GET => "$query");
$req->header(Accept => 'text/html');

#        
# send request to server and get response back
#

$res = $ua->request($req);

#            
# check the outcome
#

if ($res->is_success) {
	#print $res->content . "\n\n";  # for instance
	$temprefseq = $res->content;
} else {
	print "Error: " . $res->status_line . "\n";
}

#
# Parse the contigs
#

open(REFINPUT, "+<", \$temprefseq);

<REFINPUT>;
<REFINPUT>;

my $c;
my $pos;
my $limit;
my $nextLetter;

my $numMutants;
my $mutRate;

$c = 0;
$pos = 0;

while ($thirdLine = <REFINPUT>) {

	#print $thirdLine . "\n";

	$limit = index($thirdLine,",",$pos);
	$pos = $limit + 1;

	#
	# Finish the parsing by taking all the characters to the end of the line.
	#

	while ( ($nextLetter = substr($thirdLine,$pos,1)) ne "," ) {
		$refseq .= $nextLetter;
		$pos++;
	}

}

#print $refseq . "\n";;

close(REFINPUT);






####################################
# Request and process associations
####################################

#
# Initialize all hash table values to zero (pheno = 0 implies wildtype).
# This code should use the hash-specific forall function
# (or equivalent) at some point in the future.
#

my %pheno;

my %wildGwildP;
my %wildGmutP;
my %mutGwildP;
my %mutGmutP;

my $constMutants = 100;

my $reflength;

$reflength = length($refseq);

for ($c = 0; $c < $reflength; $c++ ) {
	$wildGwildP{$c} = $wildGmutP{$c} = $mutGwildP{$c} = $mutGmutP{$c} = 0;
}

for ($c = 0; $c < $constMutants; $c++ ) {
	$pheno{$c} = "0";
}


#
# construct URL of association query
#

$query = 'http://www.personalgenome.org/101/all-association.php?'. 
	"group=$group"; 
             
#print "query $query\n";
          
    
$req = HTTP::Request->new(GET => "$query");
$req->header(Accept => 'text/html');

#        
# send request to server and get response back
#

$res = $ua->request($req);

#            
# check the outcome
#

if ($res->is_success) {
	#print $res->content . "\n\n";  # for instance
	$resLocal = $res->content;
} else {
	print "Error: " . $res->status_line . "\n";
}

#
# Parse the associations
#

my $id;
my $genotype_id;
my $contig_id;
my $position;
my $bp;

my $phenotype_id;


open(ASSOCINPUT, "+<", \$resLocal);

<ASSOCINPUT>;
<ASSOCINPUT>;

$c = 0;
$pos = 0;

while ($thirdLine = <ASSOCINPUT>) {

	$pos = 0;

	$phenotype_id = "";

	#print $thirdLine . "\n";

	$limit = index($thirdLine,",",$pos);
	$id = substr($thirdLine,$pos,$limit - $pos);
	#print "id is " . $id . "\n";
	$pos = $limit + 1;

	$limit = index($thirdLine,",",$pos);
	$genotype_id = substr($thirdLine,$pos,$limit - $pos);
	#print "genotype_id is " . $genotype_id . "\n";
	$pos = $limit + 1;

	#
	# Finish the parsing by taking all the characters to the end of the line.
	#

	while ( ($nextLetter = substr($thirdLine,$pos,1)) ne "\n" ) {
		$phenotype_id .= $nextLetter;
		$pos++;
	}

	#print "phenotype_id is " . $phenotype_id . "\n";

	if ($phenotype_id eq "23") {
		$pheno{$genotype_id} = "1";	# mutant
	}

}

close (ASSOCINPUT);

#for ($c = 0; $c < $constMutants; $c++ ) {
#	print $c . ": " . $pheno{$c} . "\n";
#}



####################################
# Request and process substitutions
####################################


#
# construct URL of substitution query
#

$query = 'http://personalgenome.org/101/all-subst.php?'. 
	"group=$group"; 
             
#print "query $query\n";
          
    
$req = HTTP::Request->new(GET => "$query");
$req->header(Accept => 'text/html');

#        
# send request to server and get response back
#

$res = $ua->request($req);

#            
# check the outcome
#

if ($res->is_success) {
	#print $res->content . "\n\n";  # for instance
	$resLocal = $res->content;
} else {
	print "Error: " . $res->status_line . "\n";
}


#
# Parse the substitutions
#

$c = 0;

$numMutants = 0;

open(INPUT, "+<", \$resLocal);

<INPUT>;
<INPUT>;

while ($thirdLine = <INPUT>) {

	$pos = 0;

	#print $thirdLine . "\n";

	$limit = index($thirdLine,",",$pos);
	$id = substr($thirdLine,$pos,$limit - $pos);
	#print "id is " . $id . "\n";
	$pos = $limit + 1;

	$limit = index($thirdLine,",",$pos);
	$genotype_id = substr($thirdLine,$pos,$limit - $pos);
	#print "genotype_id is " . $genotype_id . "\n";
	$pos = $limit + 1;

	$limit = index($thirdLine,",",$pos);
	$contig_id = substr($thirdLine,$pos,$limit - $pos);
	#print "contig_id is " . $contig_id . "\n";
	$pos = $limit + 1;

	$limit = index($thirdLine,",",$pos);
	$position = substr($thirdLine,$pos,$limit - $pos);
	#print "position is " . $position . "\n";
	$pos = $limit + 1;

	#
	# Finish the parsing by taking all the characters to the end of the line.
	#

	while ( ($nextLetter = substr($thirdLine,$pos,1)) ne "\n" ) {
		$bp .= $nextLetter;
		$pos++;
	}

	#print "genotype_id is " . $genotype_id . "\n";

	if ($pheno{$genotype_id} eq "0") { $mutGwildP{$position}++; }
	if ($pheno{$genotype_id} eq "1") { $mutGmutP{$position}++; }

	$numMutants++;

}

close(INPUT);

#
# Do meaty stuff here.
#

my $temphighrelrisk;
my $highrelRisk;
my $highnuc;

my $highrelrisk = 0;

for ($c = 0; $c < $reflength; $c++ ) {

	$wildGwildP{$c} = $constMutants - $mutGwildP{$c};
	$wildGmutP{$c} = $constMutants - $mutGmutP{$c};

	$temphighrelrisk = $wildGwildP{$c} * $mutGmutP{$c} - $mutGwildP{$c} * $wildGmutP{$c};

	if ($temphighrelrisk > $highrelrisk) {

		print "old risk was " . $highrelrisk . ", ";
		$highrelrisk = $temphighrelrisk;
		$highnuc = $c;
		print "new risk is " . $highrelrisk . " at nucleotide " . $highnuc . "\n";

	}

}


print "highnuc is " . $highnuc . "\n";

Hotspot locating on database contigs and substitions

Also as a zip file.

#!/usr/bin/perl -w 

use strict; 
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;


 
my $query_id;
my $group; 

my $query;
my $req;
my $res;

my $resLocal;
my $thirdLine;

my $id;
my $genotype_id;
my $contig_id;
my $position;
my $bp;

my $c;
my $pos;
my $limit;
my $nextLetter;

my $refseq;
my $temprefseq;

my $reflength;
my $linecounter;
my %Acount;
my %Ccount;
my %Gcount;
my %Tcount;
my $nonmutbase;
my $numMutants;
my $mutRate;

my $expecMut;
my $expecNonMut;
my $chisq;
my $hotposition;
my $nonMutTotal;

my $Acounttemp;
my $Gcounttemp;
my $Ccounttemp;
my $Tcounttemp;

$query_id = 1;
$group = "b";



#
# construct URL of contig query
#

$query = 'http://personalgenome.org/101/all-contigs.php?'.
	"group=$group";

$req = HTTP::Request->new(GET => "$query");
$req->header(Accept => 'text/html');

#        
# send request to server and get response back
#

$res = $ua->request($req);

#            
# check the outcome
#

if ($res->is_success) {
	#print $res->content . "\n\n";  # for instance
	$temprefseq = $res->content;
} else {
	print "Error: " . $res->status_line . "\n";
}




#
# construct URL of substitution query
#

$query = 'http://personalgenome.org/101/all-subst.php?'. 
	"group=$group"; 
             
#print "query $query\n";
          
    
$req = HTTP::Request->new(GET => "$query");
$req->header(Accept => 'text/html');

#        
# send request to server and get response back
#

$res = $ua->request($req);

#            
# check the outcome
#

if ($res->is_success) {
	#print $res->content . "\n\n";  # for instance
	$resLocal = $res->content;
} else {
	print "Error: " . $res->status_line . "\n";
}



#
# Parse the contigs
#

open(REFINPUT, "+<", \$temprefseq);

<REFINPUT>;
<REFINPUT>;

$c = 0;
$pos = 0;

while ($thirdLine = <REFINPUT>) {

	#print $thirdLine . "\n";

	$limit = index($thirdLine,",",$pos);
	$pos = $limit + 1;

	#
	# Finish the parsing by taking all the characters to the end of the line.
	#

	while ( ($nextLetter = substr($thirdLine,$pos,1)) ne "," ) {
		$refseq .= $nextLetter;
		$pos++;
	}

}

#print $refseq . "\n";;

close(REFINPUT);


#
# Initialize all base mutation counter hashes to be 0.
# This code should use the hash-specific forall function
# (or equivalent) at some point in the future.
#

$reflength = length($refseq);

for ($c = 0; $c < $reflength; $c++ ) {
	$Acount{$c} = $Ccount{$c} = $Tcount{$c} = $Gcount{$c} = 0;
}



#
# Parse the substitutions
#

$c = 0;

$numMutants = 0;

open(INPUT, "+<", \$resLocal);

<INPUT>;
<INPUT>;

while ($thirdLine = <INPUT>) {

	$pos = 0;

	#print $thirdLine . "\n";

	$limit = index($thirdLine,",",$pos);
	$id = substr($thirdLine,$pos,$limit - $pos);
	#print "id is " . $id . "\n";
	$pos = $limit + 1;

	$limit = index($thirdLine,",",$pos);
	$genotype_id = substr($thirdLine,$pos,$limit - $pos);
	#print "genotype_id is " . $genotype_id . "\n";
	$pos = $limit + 1;

	$limit = index($thirdLine,",",$pos);
	$contig_id = substr($thirdLine,$pos,$limit - $pos);
	#print "contig_id is " . $contig_id . "\n";
	$pos = $limit + 1;

	$limit = index($thirdLine,",",$pos);
	$position = substr($thirdLine,$pos,$limit - $pos);
	#print "position is " . $position . "\n";
	$pos = $limit + 1;

	#
	# Finish the parsing by taking all the characters to the end of the line.
	#

	while ( ($nextLetter = substr($thirdLine,$pos,1)) ne "\n" ) {
		$bp .= $nextLetter;
		$pos++;
	}

	#print "bp is " . $bp . "\n";

	if ($bp eq "A") { $Acount{$position}++; }
	if ($bp eq "G") { $Gcount{$position}++; }
	if ($bp eq "C") { $Ccount{$position}++; }
	if ($bp eq "T") { $Tcount{$position}++; }

	$numMutants++;

}

close(INPUT);

#
# Do chi-square calculations.
#


$mutRate = 1/200;

for ($c = 0; $c < $reflength; $c++) {

	# save a few calls to the hash

	$Acounttemp = $Acount{$c};
	$Gcounttemp = $Gcount{$c};
	$Ccounttemp = $Ccount{$c};
	$Tcounttemp = $Tcount{$c};

	#print "Acount" . $c . " is " . $Acounttemp . ", ";
	#print "Gcount" . $c . " is " . $Gcounttemp . ", ";
	#print "Ccount" . $c . " is " . $Ccounttemp . ", ";
	#print "Tcount" . $c . " is " . $Tcounttemp . ". \n";

	#print $c;

	$expecMut = $mutRate / 3 * $numMutants;
	$expecNonMut = $numMutants - (3 * $expecMut);

	$nonmutbase = substr($refseq, $c, 1);

	#print "non mutant base is " . $nonmutbase . "\n";
	#print "expected mutants is " . $expecMut . "\n";
	#print "expected nonmuts is " . $expecNonMut . "\n";


	if ($nonmutbase eq "A") { 
		$nonMutTotal = $numMutants - $Gcounttemp - $Ccounttemp - $Tcounttemp;
		$chisq = ($nonMutTotal - $expecNonMut)**2 / $expecNonMut 
			+ ( ($Gcounttemp - $expecMut)**2 
			+   ($Ccounttemp - $expecMut)**2 
			+   ($Tcounttemp - $expecMut)**2 ) / $expecMut;
	}
	if ($nonmutbase eq "G") {
		$nonMutTotal = $numMutants - $Acounttemp - $Ccounttemp - $Tcounttemp;
		$chisq = ($nonMutTotal - $expecNonMut)**2 / $expecNonMut 
			+ ( ($Acounttemp - $expecMut)**2 
			+   ($Ccounttemp - $expecMut)**2 
			+   ($Tcounttemp - $expecMut)**2 ) / $expecMut;
	}
	if ($nonmutbase eq "C") {
		$nonMutTotal = $numMutants - $Acounttemp - $Gcounttemp - $Tcounttemp;
		$chisq = ($nonMutTotal - $expecNonMut)**2 / $expecNonMut 
			+ ( ($Acounttemp - $expecMut)**2 
			+   ($Gcounttemp - $expecMut)**2 
			+   ($Tcounttemp - $expecMut)**2 ) / $expecMut;
	}
	if ($nonmutbase eq "T") {
		$nonMutTotal = $numMutants - $Acounttemp - $Gcounttemp - $Ccounttemp;
		$chisq = ($nonMutTotal - $expecNonMut)**2 / $expecNonMut 
			+ ( ($Acounttemp - $expecMut)**2 
			+   ($Gcounttemp - $expecMut)**2 
			+   ($Ccounttemp - $expecMut)**2  ) / $expecMut;
	}

	#
	# compare to table of values with p = 0.05. for 3 df, value is 7.82.
	#

	#print "loop " . $c . "\n";

	if ($chisq >= 7.82) {
		print "The chi-square value is " . $chisq . ". There is a significant mutation at " . $c . ".\n";
	}
	#else {
	#	print "The chi-square value is " . $chisq . ". There is not a significant mutation at " . $c . ".\n";
	#}

}

Revised hotspot locating: simple hash implementation

An updated hotspot locator that implements hash tables.

Copy format to reference format, hotspot locating

The Nov. 29 assignment.

A second shot at 2-bit DNA encoding

This is a second attempt at what would more properly be called an encoding algorithm from ASCII to binary.

Two programs are written, encode.pl and decode.pl. Encode.pl takes an ASCII file as input (the file name is hard-coded) and outputs binary code in a file called encoded_sequence.es. Decode.pl takes a file called encoded_sequence.es as its input (again, the name is hard-coded) and converts it back to ASCII code.

The program is succesful on the shorter test sequence that is included in this ZIP file, except for the fact that it adds a few base pairs at the beginning and drops a few at the end. This, of course, is a serious problem, and I plan on working to debug this. However, my more immediate concern is the decode.pl fails on the larger test file, although encode.pl seems to succesfully encode it.

One serious difficulty I've encountered is determining, when decode.pl fails, which program is at fault, since a fault encoding would lead to a faulty decoding. It would be helpful to find some program where I could view the 0s and 1s of a binary file. Any suggestions?

Compressing DNA sequence files

A first attempt at a compression algorithm from ASCII to binary. But I have no way of telling whether or not it worked because I can't figure out how to output the binary variable $out. Thoughts?

"Hello DNA" program

...as a zip file.

Carbon sequstration and competition

According to the Wikipedia page for the carbon cycle [1], there are about 43 trillion tons of carbon circulating throughout the atmosphere, oceans, and biosphere. Each year, 5.5 billion tons (about 0.0001%) of carbon are added to the atmosphere through fossil fuel combustion and cement production. Additionally, 200 million tons of carbon (or 1/2,000 th of 1%) is removed through sedimentation of deep ocean carbon.

Currently, there is substantial research into sequestering excess atmospheric carbon as a way of mitigating global warming and rain acidification. (For example, see the Princeton Carbon Mitigation Initative [2].) But if a long-term energy solution for the planet will rely on photosynthetic organisms, then I believe there may be an opposite risk that, once fossil fuels run dry, that the net flux of carbon will be from the carbon cycle into the earth's crust, essentially sequestered permantently. This would deprive the photosynthetic organisms (and the entire carbon cycle) of sufficient carbon dioxide to create biodiesel fuels, either because the atmospheric CO2 concentration isn't high enough to sustain efficient photosynthesis, or because there isn't enough net carbon in the carbon cycle to meet the earth's energy needs.

I have had difficulty locating literature that discusses the possibility of a long-term loss of carbon from the carbon cycle, and it's possible that it's not a long-term concern. (After all, for millions of years, there was presumably no net increase of the amount of carbon in the carbon cycle, since there was no fossil fuel combustion. If the amount of carbon in the carbon cycle decreases to a certain point, could there be a dynamic equilibrium between sedimentation and some opposite process?) There is a hypothesis that sedimented carbon will eventually be converted back into fossil fuels (and that this is how fossil fuels formed in the first place), in which case fossil fuels could again be mined and combusted -- but we would have to know the relative rates of sequestration and conversion to fossil fuels.

While we've discussed the effects of the limited amount of sunlight on the earth -- there may be competition for sunlight between biodiesel algae and agricultural crops -- there may also be a limited amount of carbon that limits total bioproductivity on earth. In other words, there may also be a competition for carbon between biodiesel algae and agricultural crops. This, again, is somewhat speculative, and a more careful analysis would require data about the efficiency of both biodiesel algae and agricultural crops with regards to the amount of stored useful energy per gram of carbon (as well as the amount of stored useful energy per quantity of light used).

This kind of analysis may also be important with regards to nitrogen, sulfur, phosophorous, or any other element necessary for life.

Current and future energy use

...on the energy usage page.

A scale of personalized medicine

...on the Personalized Medicine Scale page.

Energy transitions in history

I've browsed a few secondary sources on the subject of the major energy transitions in human history, particularly the transitions from wood to coal and from coal to oil I've assembled a list of what I'll call "truths" that I believe accurately describe the effects of both transitions, and then I reexamine these truths to see if they might give us an idea of what we could expect from another major change in our energy source: from oil to algae biodiesel.

  • Once an energy source is well established, humans have shown a rather extreme reluctance to switch energy sources. Transitions between energy sources are exceeding slow, both within nations and worldwide.
As I was trying to find a few books on the shelves in Widener Library, I was amazed at what I felt was the rather extreme number of books, almost all of them published within the last 30 years, touting the benefits of energy technologies that we might be able to agree, as a group, have little long-term viability today, namely wood and coal. (There are dozens of proposals about firewood in particular, claiming that pushing firewood usage is the answer to the rural energy needs in developing countries.) Even after millenia of using both sources, which are rather harsh pollutants and increasingly scarce, there seems to be a large contingent (including many politicians) who want to continue using older fuels.
Coal was first burned in England in Roman times, but it was used only intermittently until the thirteenth century and did not come into widespread use until the early seventeenth century. By 1876, half of the U.S.' energy came from firewood, although over three-quarters came from coal just 25 years later (Barbara Freese, "Coal: a Human History," 137.) In 2003, coal still provided the U.S. with 23% of its energy (oil and natural gas together provide 67%) [3]. And even though China is the world's second-largest Petroleum consumer today, coal still provides China with two-thirds of its energy needs [4]. Similarly, oil had been discovered and used sparingly for millenia before the 20th century: oil deposits near Mesopotamia were known and used as early as 3000 BCE (Harold Williamson and Arnold Daum, "The American Petroleum Industry," vol. 1, 4.)
  • Concerns about pollution tend to concern the immediate, not long-term, effects.
Widespread coal burning in England did not begin until the thirteenth century, and even then its usage was sparse. But by the early fourteenth century, laws were already being passed limiting the burning of coal. This was not because of long-term concerns about climate change (as we consider today), it was simply that the smoke given off locall by the burning of smoke was so noxious as to make city life increasingly miserable. There was little concern at the time about strip mining, runoff from mines, or long-term atmospheric change.
  • New technologies drive the demand for new fuels, but new fuels also generate new technologies.
No special technology is required to burn coal, but the ability to create an iron furnace undoubtedly helped coal to catch on as a viable fuel. The most widely applied use of the early steam engines, for example, was pumping water out of coal mines, and steam technology was used only for this purpose for several decades (Freese 59-60). But improvements in the enginge, along with the widespread availability of coal, led to the creation of Watt's steam engine, which in turn led to the devlopment of the myriad of industrial revolution mechanizations. In 19th-century U.S., lamp technology was developing faster than were the oils that might supply the lamps, which in turn drove the innovation of new oil mining and refining techniques (Williamson vol. 1, 27-42), and later the development of the internal combustion engine.
  • There have often been battles over the control of energy sources.
This point sounds rather obvious, but I believe it's still worth noting. For example, when King Henry dissolved the Catholic Church in England in the 1530s, he had several reasons for doing so, not the least of which was the Church's refusal to grant him an annulment. However, the Church also owned the best coal fields in England, and the Crown benefitted substantially from taking over these lands (Freese 28-29). And, of course, 20th- and 21st-century history has several examples of military actions where countries have tried to gain control of oil-rich regions. (Without mentioning any post-2000 wars, the example of Iraq invading Kuwait in 1990 suffices.)

Now I briefly considered how these points might apply to widespread biodiesel production by algae, and I have attempted to analyze how each of them may or may not continue to hold true.

  • Once an energy source is well established, humans have shown a rather extreme reluctance to switch energy sources. Transitions between energy sources are exceeding slow, both within nations and worldwide.
If algae biodiesel is shown to be viable, how quickly would we attempt to try it on a larger scale? Given past reluctance to move from known energy technologies, I'm going to guess that the progress of biodiesel algae will be most affected by the scarcity of oil. (Then again, I suppose this is where we, as scientists, should put our lobbying if we believe it's a better alternative now.)
  • Concerns about pollution tend to concern the immediate, not long-term, effects.
While it acknowledges possible problems, the UNH paper makes only cursory mention of long-term effects of massive algae ponds. Increase evaporation into the atmosphere? Changing of desert climates? Destruction of natural habitats? Mutant algae? As with fossil fuels, tt may take hundreds of years for us to discover detrimental effects of the technology.
  • New technologies drive the demand for new fuels, but new fuels also generate new technologies.
This speaks to the question of infrastructure in developing algae biodiesel. On the consumer end, all the technology seems to be in place: internal combustion engines, filling stations, pipelines, etc. This is in sharp contrast to coal and oil--when they were first used discovered, there was virtually no way to use them efficiently, or at least as efficiently as we use them today. Optimistically, biodiesel algae may even drive the production of new technologies that consumer it, like coal and oil did. But unlike fossil fuels, which are relatively easy to produce (you dig a mine or a well), production of algae biodisel may remain a problem, and so the expensive start-up costs of producing algae biodiesel, which are much more expensive than the start-up costs of digging a mine, may limit biodiesel production.
  • There have often been battles over the control of energy sources.
This, I believe, will depend on the price of creating and maintaing algae ponds. If it can be done cheaply, it will be far cheaper for a nation to produce its own ponds than try to invade other nations for their ponds. However, I worry that the unequal distribution of sunlight around the globe may lead some colder nations to take more drastic steps to ensure its energy supply.

Complexity and Randomness

I'll cautiously adopt Jason's defintion of a "good definition" -- that is, one that is both intuitive and useful -- in trying to define randomness and complexity. I'm not sure my proposed definition meets either criterion, but I'll give it a stab anyway.

I, like Morten, believe that complexity has to do with predictability, so let's try to define predictability mathematically. To explore this, I'll use the comparison of trying to find the next number in a sequence.

Given the sequence 0, 0, 0, 0, 0, 0, we can guess the next number will be 0. Given the sequence 0, 1, 2, 3, 4, 5, we can guess the next number will be 6, because the difference between each adjacent pair of numbers is constant. Given the sequence 0, 1, 3, 6, 10, 15, we can guess that the next number will be 21, because the difference between each adjacent pair of numbers is a second sequence for which the difference between each adjacent pair of numbers is constant.

And so on. The above example is mathematically trivial in that sequences of this sort can all be written as polynomial functions, but each sequence requires one more level of analysis than the sequence before.

So perhaps one can define complexity by counting the levels of analysis needed to predict what comes next. In the "ordered" example of the dots from Thursday's lecture, I claim that you only need two levels of analysis to determine what the next column of dots would look like (first, looking at the alternating rows, and second, taking the "difference" between rows to see that the "difference" between columns is constant). In the "complex" example of dots patterened as sideways, quasi Sierpinski triangle, it would take several levels of analysis, perhaps taking the difference of different rows a dozen or more times, before any coherent rule could be developed, and then, difference analysis alone might not be sufficient. By this (or any other) definition, no amount of analysis on the random pattern would yield a answer of what the next column would look like.

I'm not sure you can pound out a more formal definition of levels of analysis in all settings, let alone sequences of numbers. I'd claim that predicting the workings of a computer program from source code requires substantially fewer levels of analysis than predicting what a frog will look like from its DNA alone. And then there is a problem of determining a cut-off between order and complexity, and complexity and randomness, which itself is part of the original problem. But I do think this problem should be approached in quantitative terms as much as possible.

Background

Pondering my navel (and the Charles) from Lowell House bell tower.
Enlarge
Pondering my navel (and the Charles) from Lowell House bell tower.

I'm a chemistry concentrator and junior at Harvard College. I have fierce Currier pride. Email me if you'd like: meisel *(at)* fas.harvard.edu.