Monday, November 9, 2015

HITS Algorithm Calculation Example


The equations for HITS algorithm are:

a = L'h (L' denoting transpose of adjacency matrix L)
h = La

a, h denoting the authority score and hub score respectively.

The following python code shall compute the hub score and authority score for this particular graph:

import numpy as np

adjacency_mtx = np.matrix([

  [0, 0, 0, 1, 0],
  [0, 0, 0, 1, 1],
  [0, 0, 0, 1, 0],
  [0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0]
 ])
hub_score = np.matrix([ #initial guess
  [2],
  [2],
  [2],
  [2],
  [2]
 ])
authority_score = np.matrix.transpose(adjacency_mtx)*hub_score
hub_score = adjacency_mtx*authority_score

print authority_score
print hub_score

#output:
[[0]
 [0]
 [0]
 [6]
 [2]]

[[6]
 [8]
 [6]
 [0]
 [0]]

This matches our intuition that, node 4 has a higher authority over node 5 and node 2 is a better hub than nodes 1 and 3.

Tuesday, November 3, 2015

Maximum Independent weight and traceback set in Graph

The following code shall find the maximum weight independent set and weight in a given graph:

def wis(arr):
 if(len(arr) == 0):
  return 0
 A = []
 B = {}
 for i in xrange(len(arr)+1):
  A.insert(i, 0)
 A[0] = 0
 A[1] = arr[0]
 B[arr[0]] = [arr[0]]
 for i in xrange(2, len(arr)+1): 
  A[i] = max(A[i-1], A[i-2]+arr[i-1])
  try:
   if(A[i-2]+arr[i-1] > A[i-1]):
    B[A[i]] = [B[A[i-2]], arr[i-1]]
  except KeyError:
   B[A[i-2]] = 0
   B[A[i]] = [B[A[i-2]], arr[i-1]]
 return A,B

res, trace = wis([5, 7, 100, -15, 3, 2])
print res[len(res)-1]
print trace[res[len(res)-1]]

Saturday, October 10, 2015

Beauty of Dynamic Programming


Simply printing the 37th term of Fibonacci using recursion took me 10 seconds(8 GB, Core i5). Tried to find the 50th term, was waiting for like eternity.
def fib(n):
 if n<=2:
  f = 1
 else:
  f = fib(n-1) + fib(n-2)
 return f

print fib(37) #10sec, 24157817
But,
def fibo(n):
 fib = {}
 for k in range(1, n+1):
  if k<=2:
   f = 1
  else:
   f = fib[k-1] + fib[k-2]
  fib[k] = f
 return fib[k]
 

print fibo(100)#instant, 354224848179261915075

Even the 100th Term comes along instantaneously!

Wednesday, July 1, 2015

Artificial Neural Network in Octave: Backpropagation to predict test scores

This post shall be using the same code of Programming assignment 4(week 5) in the online course:
https://www.coursera.org/learn/machine-learning/
As Backpropagation is a mathematically complex algorithm, using a simpler dataset and reviewing each step along the way would give us a better intuition. That is the goal of this post.

Here, we shall be taking the following steps.

1. Generate some random data points. Let's say, we are going to appear for the GRE test. As a part of the preparation, we would appear for two preliminary tests of the same format, let's say powerprep1(p1) and powerprep2(p2). We would try to predict the final score based on these two. I have written a function to generate test scores as much as needed in a random fashion.
https://gist.github.com/hasanIqbalAnik/6aa2af7138595d2ba85d

Here, p1, p2 would consist our input matrix X, and final scores would be Y. These are the first two and the last columns of the data matrix returned by our data generating function. For example:
P1 P2 Final
317 318 319
305 306 307
302 303 303


2. Structure of our Neural Network: For simplicity, It would have 3 layers: 1 input layer(3 nodes(including bias)), 1 hidden layer(6 nodes(including bias) and 1 output layer.

The full code is available here:
https://gist.github.com/hasanIqbalAnik/bd51dbf3e91550c69620

Now, to fit this dataset in the code, we need to care about just the following things:
  • Our num_labels would be 340.
  • We do not have pre-initialized Theta1 and Theta2 as we did in the assignment, so we would need to initialize them randomly from the beginning. 
  • Handle lambda carefully, a higher value of lambda would result in less overfitting but high bias and vice versa.
The rest can just be left as it is. The prediction performance would depend on a number of things like, how your data is distributed, number of hidden layers, your handling of bias and variance etc.

Friday, June 26, 2015

Natural Language Processing with Python: Chapter 6 Excercise Answers

#ex1
#Too descriptive

#ex 2
def gender_features(word):
    return {
        'suffix1': word[-1:],
        'suffix2': word[-2:],
        'startswith': word[0].lower(),
        'length':len(word),
        'first2char':word[0:2].lower(),
        'containsyn':'yn' in word
        }

names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)
featuresets = [(gender_features(n), g) for (n,g) in names]

train_set = nltk.apply_features(gender_features, names[500:])
devtest_set = nltk.apply_features(gender_features, names[500:1000])
test_set = nltk.apply_features(gender_features, names[1000:len(names)])

classifier = nltk.NaiveBayesClassifier.train(train_set)

p(nltk.classify.accuracy(classifier, test_set)) # 81% accuracy
p(classifier.show_most_informative_features(5))



Tuesday, June 23, 2015

Natural Language Processing with Python: Chapter 2 Answers


import pprint
from nltk.corpus import wordnet as wn
import nltk
from nltk.corpus import *
from nltk.corpus import state_union
from matplotlib import pylab

def p(x):
    pprint.pprint(x)

#ex 1
sent = "The quick brown fox jumps over the lazy dog"
phrase = sent.split()
phrase.append('extra words') #addition
for w in phrase:
   p(phrase.index(w)) #indexing
p(phrase*2) #multiplication
p(phrase[2:5]) #slicing
p(sorted(map(lambda x: x.lower(),phrase)))


#ex 2
austen = gutenberg.words('austen-persuasion.txt')
p(len(map(lambda x: x.lower(),austen)))
p(len(set(map(lambda x: x.lower(),austen))))


#ex 3
brown_categories = brown.categories()
p(brown.words(categories = [brown_categories[1], brown_categories[2]])[100:])

#ex 4
def tabulate(cfdist, words, categories):
    print '%-16s' % 'Category',
    for word in words:
        print '%6s' % word,
    print
    for category in categories:
        print '%-16s' % category,
        for word in words:
            print '%6d' % cfdist[category][word],
        print

cfd = nltk.ConditionalFreqDist(
    (fileid, word)
    for fileid in state_union.fileids()
    for word in state_union.words(fileid))

tabulate(cfd, ['men', 'women', 'people'], state_union.fileids())

#ex 5
p(wn.synset('book.n.01').part_holonyms())
p(wn.synset('book.n.01').substance_holonyms())
p(wn.synset('book.n.01').member_holonyms())
p(wn.synset('book.n.01').part_meronyms())
p(wn.synset('book.n.01').substance_meronyms())
p(wn.synset('book.n.01').member_meronyms())

#ex 6
Circular translations could result in inaccuracies, even errors. So while translating from one language to another and then translating back, comparing with other languages could be helpful to reduce imperfections.

#ex 7
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
p(emma.concordance('however'))
brn = nltk.Text(nltk.corpus.brown.words('ca01'))
p(brn.concordance('however'))
cht = nltk.Text(state_union.words(state_union.fileids()[0]))
p(cht.concordance('however')) #mistake in the usage of however!


#ex 8
cfd = nltk.ConditionalFreqDist(
    (fileid, name[0])
    for fileid in names.fileids()
    for name in names.words(fileid))
cfd.plot()


#ex 9
#?

#ex 10
#?

#ex 11
#?

#ex 12
cmd  = nltk.corpus.cmudict.dict()
p(len(set(map(lambda x: x.lower(),cmd)))) #123455
ctr = 0
for k in cmd.keys():
    if(len(cmd[k]) > 1):
        ctr = ctr+1
p(ctr) # 9241

#ex 13
ctr = 0
als = list(wn.all_synsets('n'))
for w in als:
    if(len(w.hyponyms()) == 0):
        ctr = ctr + 1
p(ctr/len(als)) #0.7967119283931072

#ex 14
def supergloss(s):
    res = s.definition() +"\n"
    for w in s.hyponyms():
        res += ' '+ str(w) + ' '+ w.definition() + " \n"
    for w in s.hypernyms():
        res += ' '+ str(w) + ' '+ w.definition() + " \n"
    return res
p(supergloss(wn.synset('dog.n.01')))

#ex 15
p((lambda  x: x in brown.words() and brown.words().count(x) >=3, brown.words()))

#ex 16
def lexical_diversity(text):
    return len(text) / len(set(text))
cfd = nltk.ConditionalFreqDist(
    (category, lexical_diversity(nltk.Text(brown.words(categories=category))))
    for category in brown.categories())
cfd.tabulate()


#ex 17
def fifty_most(words):
    content = [w for w in words if w.lower() not in stopwords.words('english')]
    return nltk.FreqDist(content).most_common(50)

#ex 18
def fifty_most_bigrams(words):
    content = [w for w in words if w.lower() not in stopwords.words('english')]
    return nltk.FreqDist(nltk.bigrams(content)).most_common(50)

#ex 19
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
words = ['people', 'earth', 'country', 'science', 'sports', 'space', 'love']
cfd.tabulate(conditions=genres, samples=words)

#news category mostly deals with people and country, religion does not talk about science at all, sports is mostly a hobby, romance of course talks about love. Humor, as usual does not give much of a shit about anything.

#ex 20
#section?

#ex 21
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if isdigit(y[-1]))) for x in d[word.lower()]]


#would be updated