MESSAGE
DATE | 2016-10-27 |
FROM | Christopher League
|
SUBJECT | Re: [Learn] Phylogenetics educational links
|
From learn-bounces-at-nylxs.com Thu Oct 27 16:09:40 2016 Return-Path: X-Original-To: archive-at-mrbrklyn.com Delivered-To: archive-at-mrbrklyn.com Received: from www.mrbrklyn.com (www.mrbrklyn.com [96.57.23.82]) by mrbrklyn.com (Postfix) with ESMTP id 6EBF4161312; Thu, 27 Oct 2016 16:09:40 -0400 (EDT) X-Original-To: learn-at-nylxs.com Delivered-To: learn-at-nylxs.com Received: from liucs.net (contrapunctus.net [174.136.110.10]) by mrbrklyn.com (Postfix) with ESMTP id CA8F9160E77 for ; Thu, 27 Oct 2016 16:09:37 -0400 (EDT) Received: from localhost (unknown [148.4.40.11]) by liucs.net (Postfix) with ESMTPSA id AF1C5E08E for ; Thu, 27 Oct 2016 16:09:35 -0400 (EDT) From: Christopher League To: learn-at-nylxs.com In-Reply-To: <8a528f28-6298-2fe8-0a01-8899000d0244-at-mrbrklyn.com> References: <8a528f28-6298-2fe8-0a01-8899000d0244-at-mrbrklyn.com> User-Agent: Notmuch/0.21 (http://notmuchmail.org) Emacs/25.1.1 (x86_64-unknown-linux-gnu) Date: Thu, 27 Oct 2016 16:09:30 -0400 Message-ID: <8760odd05h.fsf-at-contrapunctus.net> MIME-Version: 1.0 Subject: Re: [Learn] Phylogenetics educational links X-BeenThere: learn-at-nylxs.com X-Mailman-Version: 2.1.17 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0074659448==" Errors-To: learn-bounces-at-nylxs.com Sender: "Learn"
--===============0074659448== Content-Type: multipart/signed; boundary="===-=-="; micalg=pgp-sha256; protocol="application/pgp-signature"
--===-=-= Content-Type: multipart/mixed; boundary="=-=-="
--=-=-= Content-Type: multipart/alternative; boundary="==-=-="
--==-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable
Here's a set of introductory slides on inference of phylogenetic trees.
Based on what I've learned today, "small" parsimony algorithms like Fitch and Sankoff rely on labeling a *given* tree shape (aka topology), so we already have to know (or hypothesize) the ancestral relationships. The algorithm just determines which labels (mutations) to assign to interior nodes. That's really unsatisfying to me.
But the "big" parsimony techniques have to search the entire tree space, which is ENORMOUS. The problem is formally NP-hard. Now -- people do manage to solve (or approximately solve) NP-hard problems every day by using piles of dirty tricks. When it comes to searching gigantic spaces, those dirty tricks are the classic techniques of artificial intelligence.
So the lecture slides get into techniques like greedy algorithms, hill-climbing, simulated annealing, genetic algorithms, etc. Anyway, there could be a lot of meat here that fits under the heading of AI + phylogenetics. It's much more accessible stuff than trying to automatically interpret 3D model data to take measurements of maxillary bones.
CL
Ruben Safir writes:
> http://telliott99.blogspot.com/2010/03/fitch-and-sankoff-algorithms-for.h= tml > > "The Fitch algorithm considers the sites (or characters) one at a time. A= t each tip in the tree, we create a set containing those nucleotides (state= s) that are observed or are compatible with the observation. Thus, if we se= e an A, we create the set {A}. If we see an ambiguity such as R, we create = the set {AG}. Now we move down the tree [away from the tips]. In algorithmi= c terms, we do a postorder tree traversal. At each interior node we create = a set that is the intersection of sets at the two descendant nodes. However= , if that set is empty, we instead create the set that is the union of the = two sets at the descendant nodes. Every time we create such a union, we als= o count one change of state." > > > https://en.wikipedia.org/wiki/Non-parametric > Nonparametric statistics are statistics not based on parameterized famili= es of probability distributions. They include both descriptive and inferent= ial statistics. The typical parameters are the mean, variance, etc. Unlike = parametric statistics, nonparametric statistics make no assumptions about t= he probability distributions of the variables being assessed. The differenc= e between parametric models and non-parametric models is that the former ha= s a fixed number of parameters, while the latter grows the number of parame= ters with the amount of training data.[1] Note that the non-parametric mode= l does, counterintuitively, contain parameters: the distinction is that par= ameters are determined by the training data in the case of non-parametric s= tatistics, not the model. > > https://en.wikipedia.org/wiki/Fitch-Margoliash_algorithm > Distance-matrix methods > > Distance-matrix methods of phylogenetic analysis explicitly rely on a mea= sure of "genetic distance" between the sequences being classified, and ther= efore they require an MSA (multiple sequence alignment) as an input. Distan= ce is often defined as the fraction of mismatches at aligned positions, wit= h gaps either ignored or counted as mismatches.[1] Distance methods attempt= to construct an all-to-all matrix from the sequence query set describing t= he distance between each sequence pair. From this is constructed a phylogen= etic tree that places closely related sequences under the same interior nod= e and whose branch lengths closely reproduce the observed distances between= sequences. Distance-matrix methods may produce either rooted or unrooted t= rees, depending on the algorithm used to calculate them. They are frequentl= y used as the basis for progressive and iterative types of multiple sequenc= e alignment. The main disadvantage of distance-matrix methods is their inab= ility to efficiently use=20 > information about local high-variation regions that appear across multip= le subtrees.[2] > > > --=20 > So many immigrant groups have swept through our town > that Brooklyn, like Atlantis, reaches mythological > proportions in the mind of the world - RI Safir 1998 > http://www.mrbrklyn.com=20 > > DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002 > http://www.nylxs.com - Leadership Development in Free Software > http://www2.mrbrklyn.com/resources - Unpublished Archive > http://www.coinhangout.com - coins! > http://www.brooklyn-living.com > > Being so tracked is for FARM ANIMALS and and extermination camps, > but incompatible with living as a free human being. -RI Safir 2013 > _______________________________________________ > Learn mailing list > Learn-at-nylxs.com > http://lists.mrbrklyn.com/mailman/listinfo/learn
--==-=-= Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
1.0, user-scalable=3Dyes">
Here=E2=80=99s a set of introductory slides on inference of phylogenetic= trees.
pdf" class=3D"uri">http://www.cs.otago.ac.nz/cosc348/phylo/Lecture14_PhyloO= ptim.pdf
Based on what I=E2=80=99ve learned today, =E2=80=9Csmall=E2=80=9D parsim= ony algorithms like Fitch and Sankoff rely on labeling a given tre= e shape (aka topology), so we already have to know (or hypothesize) the anc= estral relationships. The algorithm just determines which labels (mutations= ) to assign to interior nodes. That=E2=80=99s really unsatisfying to me.
But the =E2=80=9Cbig=E2=80=9D parsimony techniques have to search the en= tire tree space, which is ENORMOUS. The problem is formally NP-hard. Now = =E2=80=93 people do manage to solve (or approximately solve) NP-hard proble= ms every day by using piles of dirty tricks. When it comes to searching gig= antic spaces, those dirty tricks are the classic techniques of artificial i= ntelligence.
So the lecture slides get into techniques like greedy algorithms, hill-c= limbing, simulated annealing, genetic algorithms, etc. Anyway, there could = be a lot of meat here that fits under the heading of AI + phylogenetics. It= =E2=80=99s much more accessible stuff than trying to automatically interpre= t 3D model data to take measurements of maxillary bones.
CL
Ruben Safir ruben-at-mrbrklyn.com= writes:
http://telliott99.blogspot.com/2010/03/fitch-and-sankoff-algorithms-for.= html
=E2=80=9CThe Fitch algorithm considers the sites (or characters) one at = a time. At each tip in the tree, we create a set containing those nucleotid= es (states) that are observed or are compatible with the observation. Thus,= if we see an A, we create the set {A}. If we see an ambiguity such as R, w= e create the set {AG}. Now we move down the tree [away from the tips]. In a= lgorithmic terms, we do a postorder tree traversal. At each interior node w= e create a set that is the intersection of sets at the two descendant nodes= . However, if that set is empty, we instead create the set that is the unio= n of the two sets at the descendant nodes. Every time we create such a unio= n, we also count one change of state.=E2=80=9D
https://en.wikipedia.org/wiki/Non-parametric Nonparametric statistics ar= e statistics not based on parameterized families of probability distributio= ns. They include both descriptive and inferential statistics. The typical p= arameters are the mean, variance, etc. Unlike parametric statistics, nonpar= ametric statistics make no assumptions about the probability distributions = of the variables being assessed. The difference between parametric models a= nd non-parametric models is that the former has a fixed number of parameter= s, while the latter grows the number of parameters with the amount of train= ing data.[1] Note that the non-parametric model does, counterintuitively, c= ontain parameters: the distinction is that parameters are determined by the= training data in the case of non-parametric statistics, not the model.
https://en.wikipedia.org/wiki/Fitch-Margoliash_algorithm Distance-matrix= methods
Distance-matrix methods of phylogenetic analysis explicitly rely on a me= asure of =E2=80=9Cgenetic distance=E2=80=9D between the sequences being cla= ssified, and therefore they require an MSA (multiple sequence alignment) as= an input. Distance is often defined as the fraction of mismatches at align= ed positions, with gaps either ignored or counted as mismatches.[1] Distanc= e methods attempt to construct an all-to-all matrix from the sequence query= set describing the distance between each sequence pair. From this is const= ructed a phylogenetic tree that places closely related sequences under the = same interior node and whose branch lengths closely reproduce the observed = distances between sequences. Distance-matrix methods may produce either roo= ted or unrooted trees, depending on the algorithm used to calculate them. T= hey are frequently used as the basis for progressive and iterative types of= multiple sequence alignment. The main disadvantage of distance-matrix meth= ods is their inability to efficiently use information about local high-vari= ation regions that appear across multiple subtrees.[2]
=E2=80=93 So many immigrant groups have swept through our town that Broo= klyn, like Atlantis, reaches mythological proportions in the mind of the wo= rld - RI Safir 1998 http://www.mrbrklyn.com
DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002 http://www.nylxs.= com - Leadership Development in Free Software http://www2.mrbrklyn.com/reso= urces - Unpublished Archive http://www.coinhangout.com - coins! http://www.= brooklyn-living.com
Being so tracked is for FARM ANIMALS and and extermination camps, but in= compatible with living as a free human being. -RI Safir 2013 ______________= _________________________________ Learn mailing list Learn-at-nylxs.com http:/= /lists.mrbrklyn.com/mailman/listinfo/learn
--==-=-=--
--=-=-=--
--===-=-= Content-Type: application/pgp-signature; name="signature.asc"
-----BEGIN PGP SIGNATURE-----
iQEcBAEBCAAGBQJYEl76AAoJEGuLsz1PMbCLTawIAJwrk9T8w1h5LZFyWh8W+bYC L0PdcTaqztW5afgz4wyBNE+Cyf60cCYGXkje6WFSQs+hIZ2UcnU5rGKKIPzyK74G HiXsMYMyaxpzBUdYmk9qRRks/no4zsDW5Va/f6g8SE36l1+cvHZv8Jj2eKIfXFkc hZZ7csdz9IzbOW5jkVBlvCidaHOxZRk4gdbLaY8l4iI8Bc/XQv2p/MLepsI6NkMV WcqbzUMIi8UsF+mtwj/gAsB0weYiGAbmXoIc7gCyFDQZXdqqaKgX7kOOTRv8OPJl VXPZWEVIDAe2BwOlnJ/T0WfwUUNIboyI6d8tfn7sOo94bxCiNkC+IjZjzIjML4Q= =9OrT -----END PGP SIGNATURE----- --===-=-=--
--===============0074659448== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline
_______________________________________________ Learn mailing list Learn-at-nylxs.com http://lists.mrbrklyn.com/mailman/listinfo/learn
--===============0074659448==--
|
|