Google Code Jam – Helper classes

Google Code Jam is a programming competition, where you have to write a solution to a problem, that can then process a file containing sample input data. The competition is performed under time constraints, and therefore anything that can be written in advance is a help.

The coderunner module contains 2 classes FileHelper & CodeJamRunner.

CodeJamRunner

The CodeJamRunner class is a helper for running specific solutions for a CodeJam problem. It handles all the standard code used to solve a problem:

  • Reading in an input file
  • Writing the data to the output file
  • Checking test data, if the problem is being run in test mode.

The class exposes one method run(), the first 2 parameters are methods:

  1. data_builder: the method that should be used to build data structures from the input file. This method should return a single data object.
  2. solve_case: the method that should be used to solve a single input case (most problems have multiple input cases in a single input file). This method should return a string that is the solution for this case. Note: the CodeJamRunner automatically prepends the case number (as per the standard codejam solution format.

The final 2 parameters help construct the input file name. They are problem_name and problem_size (the default structure of an input file name is X-SIZE.input e.g. A-large.input). The CodeJamRunner, concatenates the 2 halves appropriately with the joining ‘-‘. So in this instance call the method with problem_name=’A’ and problem_size=’large’. Note: If the size parameter is omitted the runner will assume that it is a test case, and use the test data for this problem.

FileHelper

Code Jam input has a numnber of standard input formats. The FileHelper class simplifies the process of extracting the data in that format and putting it into the appropriate data structures. Example methods include:

  • get_int()
  • get_ints() – returns a list of ints
  • get_grid() – returns a 2d list of floats

The data_builder has a FileHelper object passed as it’s first argument, with access to the appropriate input file.

An Example using the CodeJamRunner to solve a CodeJam Problem

Let’s assume:

For the input there is one line, containing 3 ints: height width and depth.
For the output, just return the volume of the object.

Clearly this is a very simplistic problem, but should provide an example of how to use the two classes in the module.

from codejam.utils.codejamrunner import CodeJamRunner

class Dynam(object):pass

def solver(data):
	"""Calculate and return the volume of the cube using the data object."""
	return data.h * data.w * data.d

def data_builder(f):
	"""Extract the data from the input file f, and return an object containing the data. 
	
	The Dynam class is a regular class, and allows the properties to be stored on the object in 
	dot notation.
	"""

    data = Dynam()
    data.h, data.w, data.d = f.get_ints()

    return data


cjr = CodeJamRunner()
# run the CodeJamRunner, with our data_builder and solver methods, for problem A.
cjr.run(data_builder, solver, problem_name = "A")
	

If you would like to download the coderunner module source, it can be found on my github. Please note that it comes bundled with other CodeJam utilities that I have written – to be written about later.

Reading csv files

This post is a Python solution to a programming challenge from Programming Praxis. The challenge is to read a csv file and convert it into an html table.

On first inspection, this seems quite a simple task; Read in the file, replace line breaks and comma’s with the appropriate tr and td tags:

output = ''.join(['<html><body><table><tr><td>', 
                  sample_csv.replace(',', '</td><td>').replace('\n', '</td></tr><tr><td>'), 
                  '</td></tr></table></body></html>'])

The above code works well on simple csv’s, but doesn’t take into acccount the variety of csv formats. firstly there are a range of delimiting characters that can be used, tabs, or pipes are common alternatives to comma’s. Often the first line of data is column headings which would be good to wrap in th tags rather than simple td’s. Finally fields like addresses, which are likely to contain comma’s and therefore cause problems with delimiting are usually provided surrounded by quotes.

Fortunately python already has a library built to handle such quirks, aptly named csv. Within the module there is a useful Sniffer class, that uses a number of heuristics to assess the format of the csv, and also will identify whether it has a first row header or not. It also handles quoted fields correctly.

Final Code:


import csv


def html_page_generation(f):
    def generate_page(*args, **kwargs):
    
        with open('output.html', 'w') as html_out:
            html_out.write(''.join(['''<!DOCTYPE html><html><body>''', 
                                   f(*args, **kwargs    ), 
                                   '''</body></html>''']))
       
    return generate_page
    
    
@html_page_generation
def read_csv(file_name):
    
    mark_up = ['<table>']
    with open(file_name, 'rb') as csvfile:
        try:
            dialect = csv.Sniffer().sniff(csvfile.read())
            csvfile.seek(0)
            has_header = csv.Sniffer().has_header(csvfile.read())
            csvfile.seek(0)
            simple_reader = csv.reader(csvfile, dialect, 
                                       skipinitialspace=True)
        except:
            has_header = False
            csvfile.seek(0)
            # use the default csv settings if non can be sniffed
            simple_reader = csv.reader(csvfile, skipinitialspace=True)
            
        
        if has_header:
            mark_up.append('<tr><th>')
            mark_up.append('</th><th>'.join(simple_reader.next()))
            mark_up.append('</th></tr>')
            
            
        for row in simple_reader:
            mark_up.append('<tr><td>')
            mark_up.append('</td><td>'.join(row))
            mark_up.append('</td></tr>')
    
    mark_up.append('</table>')
    return ''.join(mark_up)
    
while True:
    file_name = raw_input('Please enter the name of the csv file to be '
                          'converted, or q to quit.\n')
    if file_name == 'q':
        break
      
    try:
        read_csv(file_name)
    except IOError:
        print "File %s not found. Please try again." % file_name

The full source code including sample data can be found on bit bucket

Narcissistic Numbers

A narcisistic number is a number where the sum of the nth powers of the digits
of an n-digit number is equal to the number. For instance, 153 is a
narcissistic number of length 3 because 1^3 + 5^3 + 3^3 = 1 + 125 + 27 = 153
or 1^4 + 6^4 + 3^4 + 4^4 = 1634 for a 4 digit number.

The Challenge set on Programming Praxis is to find all the narcissistic numbers. We are given a little help, in that although 61*(9^61) < 10^60 (meaning there are no narcissistic numbers greater than 10^61, we are also told that the largest narcissistic number is 115132219018763992565095597973971522401 (n=39).

Starting with a naive implimentation

My first approach iterated through the numbers from the shortest number to be found, up to the longest.

A problem with range.

Using range() to generate the list of numbers to iterate over soon caused problems. When the program started to calculate values for n=10, there was a rather nasty out of memory exception. range() in python 2.7 creates the full list in one go, rather than a generator. The xrange method should be used instead. NOTE: python 3.x returns a generator for either range or xrange, so this would be less of a problem there. In the end, a while loop seemed like the most simple option.

		
def generate_power_list(power):
    return [i**power for i in range(0,10)]
  
  
def find_narcissistic_numbers_naive(min_length, max_length):
    for number_length in range(min_length, max_length):

        power_dict = generate_power_dictionary(number_length)
        
        max_number = 10 ** number_length
        number = 10** (number_length -1)
        while number < max_number:
            
            value = 0
            for digit in str(number):
                value += power_dict[digit]
            
            if value == number:
                logging.debug('narcissistic %s ' % number)
            
            number += 1
		

With the range fix, the method works correctly, but performance already became an issue as soon as digit length reached 11-12 – given the performance was O(10^n), getting to n=39 was going to take some time! Clearly we need something that isn’t O(n^10) if we are going to find all narcissistic numbers.

Unique Digit combinations

For any unique combination of digits, there could clearly only ever be one narcissistic number (the digits would only add up to one value). The number of permutations for selecting unique combinations of digits is based on the r-topic (or figurate) numbers, found in Pascals Triangle (in our case, as there are 10 possible digits, so we use the 10th diagonal).

Pascals triangle with the 10th diagonal highlighted.

There is a marked improvement on the order of the algorithm, and with that I started on a first implimentation of a recursive algorithm that would use combinations of digits, rather than each number.

Recursive solution

In this solution each recursion handles a single digit of the array of digits being used, and tries all appropriate combinations of that digit.


    
def execute_recursive(digits, number_length):
    index = len(digits)
    if digits:
        number = digits[-1]
    else:
        number = 0
    results = []
    digits.append(number)    
    
    if len(digits) < number_length:
        while number < 10:
            results += execute_recursive(digits[:], number_length)
            number += 1
            digits[index] = number
        
    else:
        while number < 10:
            digit_value = sum_digits(digits)
            if check_numbers_match_group(digit_value, digits):
                results.append(digit_value)
                logging.debug(digit_value)
                
            number += 1
            digits[index] = number
            
    return results

    
def find_narcissistic_numbers(min_length, max_length):
    for number_length in range(min_length, max_length):
        digits = []
        t_start = time.clock()
        results = execute_recursive(digits, number_length)
        print 'duration: %s for number length: %s' %(time.clock() - t_start, number_length)

		

This improved performance drastically, taking 1 minute to find all narcissistic numbers with n=19.

Code optimisations

Performance still needs to be improved, in order to reach n=39 in a reasonable time. Time for some code optimisations…

Lists instead of dictionaries

In the first Naive implimentation, I used a dictionary, to cache the values of the powers of a digit for a given number length. Given that digits are all integers, I had been rather slow in not using lists, and simply using the digit as an index. As well as providing faster access to the result, it also meant that we no longer had to do a number of costly string conversions to dictionary keys. This change had a significant performance improvement of about 33%.

Narcissistic number check

In the base version, when checking that a number matched the digits, we iterated through each digit type, to ensure that there were the same number of each type. In this version we have added the optimisation of checking the digit length is correct before doing the full check.

I expected that this would have more of an effect on small number lengths, because as number length increases, there will tend to be more numbers in the middle of the distribution. This was somewhat upheld by the results:

  1. n=16: 11.5% improvement
  2. n=19: 9.8% improvement
def check_numbers_match_group(number, digits):
    number_search = str(number)
			
	# new in v1.3
    if len(number_search) != len(digits):
        return False
    
    for digit in digit_list:
        if number_search.count(digit[0]) != digits.count(digit[1]):
            return False
    
    return True
			

String Conversions

The function to check if a set of digits still used a number of string conversions, as the method gets called on each iteration, removing these conversions improved performance by 6-7%. Lesson learned – don’t use string conversions anywhere performance sensitive – unsurprisingly!

Caching the digits sum

Up to now we have been using a sum_digits() function that calculates the sum
for each combination of digits. However, particularly for long numbers this
has significant repetition, as most of the time, it is only the last number
that has changed. Version 1.5, keeps a running total, of the
n-1, n-2…n-(n-1) sums, so that it only needs to recalulate the sum for digits that have changed.

This made a huge performance improvement for 19 length digits, 30% and this will increase in impact as the number length increases.

The end of recursion

The way the recursive algorithm is structured, means that the code has to create and execute lots of functions, one for every 10 permutations. Step one is to remove all the recursion, and use an array for the digits, and use a couple of counters to work out which digit should be changed next.

			      
def execute_loops(number_length):

    results = []
    digits = [0]*number_length
    segment_sums = [0]*number_length
    index_limit = column_marker = number_length - 1

    while True:
        while digits[column_marker]==9:
            column_marker -= 1
            if column_marker < 0:	
                return results
        
        digits[column_marker] = digits[column_marker]+1
        
        while column_marker < index_limit:
            if column_marker == 0:
                segment_sums[column_marker] =  power_lists[number_length][digits[column_marker]]
            else:
                segment_sums[column_marker] = segment_sums[column_marker-1] + power_lists[number_length][digits[column_marker]]
                
            digits[column_marker + 1] = digits[column_marker]
            column_marker += 1
        
        new_val = segment_sums[column_marker-1] + power_lists[number_length][digits[column_marker]]
        
        if check_numbers_match_group(new_val, digits):
            results.append(new_val)
            logging.debug(new_val)
         
			

Removal of recursion, improved performance by a further 30%. From a time of over 60s for n=19 at the start of the optimisations, the program has now been improved to run in 17s for n=19, an improvement of 70%.

What next

Although the algorithm has been sped up noticeably, there are still a couple more things that could improve performance.

  • Threading – Currently the program only uses one core, and watching in task manager, seeing that the CPU is only working at 25%, makes me cry a little in side. The main problem here will be dividing the work up evenly between each thread.
  • Pruning – as we saw at the start of the post, as n approaches 61, it gets hard to make numbers big enough to be narcissistic numbers. At n=40, if the digits are only 0 or 9, then there needs to be at least 7 9’s in order to create a number larger than 10**39. Therefore there we can skip combinations with more than 33 0’s, meaning that we save ourselves 48,000 iterations. As n increases, the contribution of other digits to the total reduces (9**40 > 100 * (8**40)), so there are other combinations of digits that will simply be too small even to be considered.
  • Magic – As mentioned in the original post Dik Winter calculated all narcissistic numbers in 1985, which took 30 minutes then (equivalent to about 1s today) – I think there is clearly a much more efficient algorithm for finding these numbers if that is the case, fun for a rainy day!
If you are interested in getting a copy of the code, it can be found on bitbucket at: https://bitbucket.org/GSmith/narcissistic-numbers. It contains a revision, for most of the steps in this post, along with timing results for each version of the code.

Twitter Bootstrap – compiling on windows.

I had tried a couple of  the tools for compiling less on windows, but none seemed to compile bootstrap out of the box.

Instead I found Sylvain’s post which was exactly what I was after.

It was a great post, and I only had to work around one issue:

Step 2.3 is slightly wrong, although the binary is called lessc, in npm it is just less. Therefore the command should be: npm install less (remove the final c)

WordPress – Tribe Events Calendar

I have been using the tribe events calendar plugin for wordpress – which is excellent.

I have been customising the list view, and was trying to add an additional link to the event.

Elsewhere on the page, there is a method tribe_get_event_link() which seemed a pretty promising option to get the link. This was being used in the following context:

<?php the_title('<h2 class="entry-title" itemprop="name"><a href="' . tribe_get_event_link() . '" title="' . the_title_attribute('echo=0') . '" rel="bookmark">', '</a></h2>'); ?>

However this didn’t work – I assume the method works within the context of the the_title method.

If the get_ is removed from the method name – it seems that it works.

tribe_event_link()

For other methods and properties with a get_ in the name, they also work when the get is removed. Please excuse the rather poor technical knowledge of php – still seems a world away from python / C#.

Creating paged documents in a web environment

As a web developer, it is an ingrained design pattern (with plenty of accompanying tools) to generate mark up which is then rendered in a clients browser.

There are however, often situations when working on web projects where it is desirable to output documents / books / or other paper based information from an existing database of data. In many cases much of the business logic already exists to extract the data from the database, however turning your beautifully crafted html into something that looks good on paper, or as an attachment to an email, particularly breaking the content so that it spans page breaks gracefully is not always easy.

There are a number of ways of achieving this:

  • Print style sheets – quick and dirty, and rarely handles page breaks well.
  • PDF converters. These appear to have improved significantly since I last had to do this. Programs of particular note are:
    • Prince xml This is an expensive proprietary application, but it does have a free version for non commercial use, which you could use for testing. It does have good integration with web programming languages, so it is easy to generate files from within your web application.
    • Aspose – I used this as a .net application, but I think there are at least Java implimentations too. The project I was working on had very high expectations in terms of the quality and flexibility of output, but it did appear clumsy to get it to do what we wanted. This may be unfair and out of date.
    • There are lots of others – this SO post suggests some others.
  • Use the new generation of iOS ebook publishers. The Baker and Laker frameworks allow html5 web pages to be converted into books on the iPhone and iPad, with more devices coming soon. I think this would need some hacking to get this working, but there is definitely some potential here. I am not yet sure about conversion to pdf, but this could be a useful intermediate step for a print style sheet, or there could be a conversion to pdf available.
  • LaTeX – This is a mark up language specifically for documents, with all the features required for making well laid out and formatted documents. It looks as if it will require some time investment to get started, and create something worthwhile. This in depth tutorial looks like a good starting point.

There are a number of options, I will try some out and see what happens.

MySql-python: Pre Compiled Installer

Setting up the mysql python library requires a C compiler.

There is a built in compiler with visual studio, however configurations took some time, and still wasn’t making too much headway.

Instead you can download a pre compiled version of the library targeted at windows. There are probably more recent versions available with a quick Google at time of reading, however I found mine at Codegood.

This saved me a huge amount of time – thanks  Ioannis!