Paper Keyboard Tracker
A program that through the usage of computer vision & convolutional neural network can detect keys drawn on paper, and track fingers pressing these.
As the final project of the course 5TF078 Deep Learning – Methods and Applications, created a program that would allow the user to draw a keyboard on a piece of paper, and then use this thanks to finger tracking. This would be done through a combination of a convolutional neural network (CNN) and computer vision. The project was highly appreciated by the teacher and gave a deeper understanding of the implementations and limitations of machine learning and computer vision. However, due to the course time, I never got the time to implement the finger tracking (something I’ll do soon).
The project came with an 11-page-long report in Swedish available on the GitHub repository under the folder called “Rapport” (link available above). But for those of you who don’t speak Swedish, here’s follows a summary.
Technology & Flow
Three core technologies were used: Python for the CNN, OpenCV for computer vision and image processing, and Tensorflow as the framework for the model. These technologies were used to create software with the flow seen in the picture below.
The first two boxes (blue and red), show how OpenCV would take a webcam feed, perform some image processing, and detect regions of interest (ROI), and additional logic would select the characters within the ROI and scale it to fit the CNN later used by reading the data of each pixel. Then the third box (yellow), containing the CNN, would classify and return the probability of each classification. Lastly, we have the green box telling us about a user interface where each ROI would be drawn upon a canvas showing the camera feed and the classification.
CNN
Due to limited resources while training the model, transferred learning was not an option (but thoroughly explored). So the module had to be created from scratch and was done so using Tensorflow and Keras Tuner for optimization. It also had to be rather small and handle inputs of size 112x112px, which was larger than the images in the dataset used for training where the balanced version of the dataset EMNIST (Extended MNIST) was used as the larger input would result in a better reading of the contents of each ROI.
The final model used is displayed in the figure below. We see that the final model was rather simple and only landed at a size of 9.40MB, and a precision of 90.4%. These low results were however largely due to the similarities between the letter “I” and number “1”, the letter “O” and number “0”, as well as faulty classified data (errors in the dataset).
Software
To use the program, an interface was required that could show the feed (with and without filters, ROIs, and predictions), as well as change the settings of the image processing to help the CNN. This was created using Python’s Tkinter as the results can be seen below.
On the left side of the image, you can see the available settings, and on the right side the feed and predictions. Below the feed are the images as CNN receives them.