Anik's Tech Tutorials

Thursday, February 18, 2016

Chuck-a-Luck: A fine game involving probability

I stumbled onto this game while reading a fantastic book, 'Innumeracy' by John Allen Paulos. The rules of the game is simple: you play against a host. The host has three dices with number 1-6 on them. Then you would be asked to choose a number in that range. Let's say you have chosen number 4. The host would then roll the three dices at the same time. Now there are three outcomes:

1. If all three dices have 4 on them, you get 3 dollars.
2. If two of the dices come up with 4, you get 2 dollars.
3. If only dice has 4 on it, you get 1 dollar.
4. Else(none of the dices has came up with 4) then you pay the host 1 dollar.

Now, what do you think about your chances are in this game? Obviously, as it is played in casinos, the casino owners won't have allowed this game if there were no profit for them. So even if this game seems quite winnable, you should re-calculate your chances.

Here is the expected outcome of this game:

expected outcome(EO) = sum(all the ways you can win * associated rewards) - sum(all the ways you can lose * associated cost)

EO = sum(P(all three dices has 4) * 3 + P(any two dices has 4) * 2 + P(any dices has 4) * 1) - P(no dice has 4) * 1

= (3C3 * (1/6)^3 * (5/6)^0)*3 + (3C2 * (1/6)^2 * (5/6)^1)*2 + (3C1 * (1/6)^1 * (5/6)^2)*1) - (3C3 * (5/6)^3 * (1/6)^0)*1

= 3/216 + 10/72 + 25/72 - 125/216 = -.08

As you can see, the outcome is quite counter intuitive. It seems that you would lose .08 dollars every time you play the game.

Following is a simulation of the game play. It shows how many times you can play before you lose all of your money in the cold-hearted probabilistic eventuality:

'''
code is also hosted at https://github.com/hasanIqbalAnik/probabilities/blob/master/monty_hall_simul.py
'''
import random
def dice_roll():
    return random.randint(1, 6)

balance = 10 #you start with 10 dollars
count_games = 0
choice = random.randint(1,6)

while(balance > 0):
    outcomes = [dice_roll(), dice_roll(), dice_roll()]
    if(outcomes.count(choice) == 3):
        balance = balance + 3
    elif(outcomes.count(choice) == 2):
        balance = balance + 2
    elif(outcomes.count(choice) == 1):
        balance = balance + 1
    else:
        balance = balance - 1
    count_games = count_games + 1

print balance # 0 :(
print count_games # sample output, 74 times

This also matches the probability calculation that you would be able to play around 80 times with 10 dollars, losing .08 dollars on average each time.

Monday, November 9, 2015

HITS Algorithm Calculation Example

The equations for HITS algorithm are:

a = L'h (L' denoting transpose of adjacency matrix L)
h = La

a, h denoting the authority score and hub score respectively.

The following python code shall compute the hub score and authority score for this particular graph:

import numpy as np

adjacency_mtx = np.matrix([

  [0, 0, 0, 1, 0],
  [0, 0, 0, 1, 1],
  [0, 0, 0, 1, 0],
  [0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0]
 ])
hub_score = np.matrix([ #initial guess
  [2],
  [2],
  [2],
  [2],
  [2]
 ])
authority_score = np.matrix.transpose(adjacency_mtx)*hub_score
hub_score = adjacency_mtx*authority_score

print authority_score
print hub_score

#output:
[[0]
 [0]
 [0]
 [6]
 [2]]

[[6]
 [8]
 [6]
 [0]
 [0]]

This matches our intuition that, node 4 has a higher authority over node 5 and node 2 is a better hub than nodes 1 and 3.

Tuesday, November 3, 2015

Maximum Independent weight and traceback set in Graph

The following code shall find the maximum weight independent set and weight in a given graph:

def wis(arr):
 if(len(arr) == 0):
  return 0
 A = []
 B = {}
 for i in xrange(len(arr)+1):
  A.insert(i, 0)
 A[0] = 0
 A[1] = arr[0]
 B[arr[0]] = [arr[0]]
 for i in xrange(2, len(arr)+1): 
  A[i] = max(A[i-1], A[i-2]+arr[i-1])
  try:
   if(A[i-2]+arr[i-1] > A[i-1]):
    B[A[i]] = [B[A[i-2]], arr[i-1]]
  except KeyError:
   B[A[i-2]] = 0
   B[A[i]] = [B[A[i-2]], arr[i-1]]
 return A,B

res, trace = wis([5, 7, 100, -15, 3, 2])
print res[len(res)-1]
print trace[res[len(res)-1]]

Saturday, October 10, 2015

Beauty of Dynamic Programming

Simply printing the 37th term of Fibonacci using recursion took me 10 seconds(8 GB, Core i5). Tried to find the 50th term, was waiting for like eternity.

def fib(n):
 if n<=2:
  f = 1
 else:
  f = fib(n-1) + fib(n-2)
 return f

print fib(37) #10sec, 24157817

But,

def fibo(n):
 fib = {}
 for k in range(1, n+1):
  if k<=2:
   f = 1
  else:
   f = fib[k-1] + fib[k-2]
  fib[k] = f
 return fib[k]
 

print fibo(100)#instant, 354224848179261915075

Even the 100th Term comes along instantaneously!

Wednesday, July 1, 2015

Artificial Neural Network in Octave: Backpropagation to predict test scores

This post shall be using the same code of Programming assignment 4(week 5) in the online course:
https://www.coursera.org/learn/machine-learning/
As Backpropagation is a mathematically complex algorithm, using a simpler dataset and reviewing each step along the way would give us a better intuition. That is the goal of this post.

Here, we shall be taking the following steps.

1. Generate some random data points. Let's say, we are going to appear for the GRE test. As a part of the preparation, we would appear for two preliminary tests of the same format, let's say powerprep1(p1) and powerprep2(p2). We would try to predict the final score based on these two. I have written a function to generate test scores as much as needed in a random fashion.
https://gist.github.com/hasanIqbalAnik/6aa2af7138595d2ba85d

Here, p1, p2 would consist our input matrix X, and final scores would be Y. These are the first two and the last columns of the data matrix returned by our data generating function. For example:

P1	P2	Final
317	318	319
305	306	307
302	303	303

2. Structure of our Neural Network: For simplicity, It would have 3 layers: 1 input layer(3 nodes(including bias)), 1 hidden layer(6 nodes(including bias) and 1 output layer.

The full code is available here:
https://gist.github.com/hasanIqbalAnik/bd51dbf3e91550c69620

Now, to fit this dataset in the code, we need to care about just the following things:

Our num_labels would be 340.
We do not have pre-initialized Theta1 and Theta2 as we did in the assignment, so we would need to initialize them randomly from the beginning.
Handle lambda carefully, a higher value of lambda would result in less overfitting but high bias and vice versa.

The rest can just be left as it is. The prediction performance would depend on a number of things like, how your data is distributed, number of hidden layers, your handling of bias and variance etc.

Friday, June 26, 2015

Natural Language Processing with Python: Chapter 6 Excercise Answers

#ex1
#Too descriptive

#ex 2
def gender_features(word):
    return {
        'suffix1': word[-1:],
        'suffix2': word[-2:],
        'startswith': word[0].lower(),
        'length':len(word),
        'first2char':word[0:2].lower(),
        'containsyn':'yn' in word
        }

names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)
featuresets = [(gender_features(n), g) for (n,g) in names]

train_set = nltk.apply_features(gender_features, names[500:])
devtest_set = nltk.apply_features(gender_features, names[500:1000])
test_set = nltk.apply_features(gender_features, names[1000:len(names)])

classifier = nltk.NaiveBayesClassifier.train(train_set)

p(nltk.classify.accuracy(classifier, test_set)) # 81% accuracy
p(classifier.show_most_informative_features(5))

Tuesday, June 23, 2015

Natural Language Processing with Python: Chapter 2 Answers

import pprint
from nltk.corpus import wordnet as wn
import nltk
from nltk.corpus import *
from nltk.corpus import state_union
from matplotlib import pylab

def p(x):
    pprint.pprint(x)

#ex 1
sent = "The quick brown fox jumps over the lazy dog"
phrase = sent.split()
phrase.append('extra words') #addition
for w in phrase:
   p(phrase.index(w)) #indexing
p(phrase*2) #multiplication
p(phrase[2:5]) #slicing
p(sorted(map(lambda x: x.lower(),phrase)))


#ex 2
austen = gutenberg.words('austen-persuasion.txt')
p(len(map(lambda x: x.lower(),austen)))
p(len(set(map(lambda x: x.lower(),austen))))


#ex 3
brown_categories = brown.categories()
p(brown.words(categories = [brown_categories[1], brown_categories[2]])[100:])

#ex 4
def tabulate(cfdist, words, categories):
    print '%-16s' % 'Category',
    for word in words:
        print '%6s' % word,
    print
    for category in categories:
        print '%-16s' % category,
        for word in words:
            print '%6d' % cfdist[category][word],
        print

cfd = nltk.ConditionalFreqDist(
    (fileid, word)
    for fileid in state_union.fileids()
    for word in state_union.words(fileid))

tabulate(cfd, ['men', 'women', 'people'], state_union.fileids())

#ex 5
p(wn.synset('book.n.01').part_holonyms())
p(wn.synset('book.n.01').substance_holonyms())
p(wn.synset('book.n.01').member_holonyms())
p(wn.synset('book.n.01').part_meronyms())
p(wn.synset('book.n.01').substance_meronyms())
p(wn.synset('book.n.01').member_meronyms())

#ex 6
Circular translations could result in inaccuracies, even errors. So while translating from one language to another and then translating back, comparing with other languages could be helpful to reduce imperfections.

#ex 7
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
p(emma.concordance('however'))
brn = nltk.Text(nltk.corpus.brown.words('ca01'))
p(brn.concordance('however'))
cht = nltk.Text(state_union.words(state_union.fileids()[0]))
p(cht.concordance('however')) #mistake in the usage of however!


#ex 8
cfd = nltk.ConditionalFreqDist(
    (fileid, name[0])
    for fileid in names.fileids()
    for name in names.words(fileid))
cfd.plot()


#ex 9
#?

#ex 10
#?

#ex 11
#?

#ex 12
cmd  = nltk.corpus.cmudict.dict()
p(len(set(map(lambda x: x.lower(),cmd)))) #123455
ctr = 0
for k in cmd.keys():
    if(len(cmd[k]) > 1):
        ctr = ctr+1
p(ctr) # 9241

#ex 13
ctr = 0
als = list(wn.all_synsets('n'))
for w in als:
    if(len(w.hyponyms()) == 0):
        ctr = ctr + 1
p(ctr/len(als)) #0.7967119283931072

#ex 14
def supergloss(s):
    res = s.definition() +"\n"
    for w in s.hyponyms():
        res += ' '+ str(w) + ' '+ w.definition() + " \n"
    for w in s.hypernyms():
        res += ' '+ str(w) + ' '+ w.definition() + " \n"
    return res
p(supergloss(wn.synset('dog.n.01')))

#ex 15
p((lambda  x: x in brown.words() and brown.words().count(x) >=3, brown.words()))

#ex 16
def lexical_diversity(text):
    return len(text) / len(set(text))
cfd = nltk.ConditionalFreqDist(
    (category, lexical_diversity(nltk.Text(brown.words(categories=category))))
    for category in brown.categories())
cfd.tabulate()


#ex 17
def fifty_most(words):
    content = [w for w in words if w.lower() not in stopwords.words('english')]
    return nltk.FreqDist(content).most_common(50)

#ex 18
def fifty_most_bigrams(words):
    content = [w for w in words if w.lower() not in stopwords.words('english')]
    return nltk.FreqDist(nltk.bigrams(content)).most_common(50)

#ex 19
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
words = ['people', 'earth', 'country', 'science', 'sports', 'space', 'love']
cfd.tabulate(conditions=genres, samples=words)

#news category mostly deals with people and country, religion does not talk about science at all, sports is mostly a hobby, romance of course talks about love. Humor, as usual does not give much of a shit about anything.

#ex 20
#section?

#ex 21
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if isdigit(y[-1]))) for x in d[word.lower()]]


#would be updated