Friday, May 27, 2016

Fining all possible subsets of a given set

Three possible way to create all subsets of a given set is shown below:

1. Using Binary:
lst = ['a', 'b', 'c', 'd', 'e']
def interpret_combination(s):
 s = s[::-1] #reversing the string to match the index positions with the given list
 if not '1' in list(s): # all zeros
  return []
 else:
  indices_of_ones = [i for i, x in enumerate(list(s)) if x == '1'] # find the indices of 1s only
  l = list(s) #convert to list as you can't operate on strings
  for item in indices_of_ones:
   l[item] = lst[item] # replace the one's with the characters on the list
  l = [x for x in l if x != '0'] # replace the zeros
  return ''.join(map(str, l)) # join them to make a string again
all_subsets = []

for i in xrange(2 ** len(lst)): # until 2^(lenght of list)
 l1 = len(str(bin(i))[2:]) # binary representation without the preceding 0b
 l2 = len(lst)
 s = (l2 - l1)*'0' + str(bin(i))[2:] # prepending the zeros in front to make it more readable
 all_subsets.append(interpret_combination(s)) 

print all_subsets

2. Recursive:
def adder(elem, elem_set): # copy elem with each element of elem_set
 exp_lst = []
 for item in elem_set:
  exp_lst.append(elem + item)
 return exp_lst

def all_subsets(s): 
 if len(s) is 0:
  return ''
 elif len(s) is 1:
  return ['', s[0]]
 else:
  return all_subsets(s[:-1]) + adder(s[len(s)-1], all_subsets(s[:-1]))  

print all_subsets(['a','b', 'c', 'd'])  

3. Iterative:
lst = ['a', 'b', 'c', 'd', 'e']
sub_arr = []
sub_arr_tmp = []

for item in lst:
 for elem in sub_arr:
  sub_arr_tmp.append([item + elem])
 sub_arr.append(item)
 if len(sub_arr_tmp) is 0:
  sub_arr.append('')
 else:
   for e in sub_arr_tmp:
    if(isinstance(e, list)):
     sub_arr.append(e[0])
 del sub_arr_tmp[:]

print set(sub_arr)
print len(set(sub_arr))



Tuesday, May 3, 2016

Finding Gaussian Distribution Parameters Using Expectation Maximization Matlab

Expecation Maximization is a powerful approach to estimate parameters of statistical distributions. It consists of two steps, i) Expectation step, where the likelihood of the dataset is estimated using the current set of parameters. ii) Maximization step, where the parameters are recalculated based on the likelihood found in the expectation step.

Let's see an example: Suppose we have a dataset which has only one feature, namely, age. Suppose the data was taken from a country where most of the people are young and from another country where most of them are old. We are interested in finding out the mean and variance of these two distributions, assuming that they were normally distributed. 

Say our dataset looks like this: 
$[25, 27, 27, 28, 28, 30, 100, 105, 105, 107, 107, 108]$ . We shall implement the EM algorithm to find out the means and variances.
clc;clear all;
% sample dataset
d = [25 27 27 28 28 30 100 105 105 107 107 108];

%initial guesses of the two means and variances
mu1 = 10;
mu2 = 50;
sigma1 = 10;
sigma2 = 10;

% initial responsibilities
p1 = .5;
p2 = .5;

%until converges

for k = 1:100
    
% start expectation step    

v1 = p1 * normpdf(d, mu1, sqrt(sigma1)); 
v2 = p2 * normpdf(d, mu2, sqrt(sigma2));

r1 = v1 ./ (v1 + v2); %divide ith element of v1 by the sum of ith element of both v1 and v2
r2 = v2 ./ (v1 + v2) ;%divide ith element of v2 by the sum of ith element of both v1 and v2

% end expectation step    

% start maximization step

mu1 = sum(r1 .* d) / sum(r1); %recalculate mu1, weighted average of the datapoints
mu2 = sum(r2 .* d) / sum(r2); %recalculate mu2, weighted average of the datapoints

sigma1 = sum((d-mu1) .* (d-mu1) .* r1) / sum(r1); %recalculate weighted variance first distribution
sigma2 = sum((d-mu2) .* (d-mu2) .* r2) / sum(r2); %recalculate weighted variance second distribution

p1 = sum(r1) / length(d); % recalculate responsibilities, total weights normalized by number of data points
p2 = sum(r2) / length(d);

% end maximization step

end

mu1 % 27.5000, mean of the first distribution
mu2 % 105.3333, mean of the second distribution
sigma1 % 2.2500, variance of the first distribution
sigma2 % 6.8889, variance of the second distribution