Machine Learning in Reality

Taggun and Machine Learning

Machine Learning (ML) is a marketing hype. I’m allergic to software companies that over emphasise the use of ML in their product. However, ML can be an effective tool to solve real problems in software. In Taggun, I use Naive Bayes classifier to classify keywords and stop words when extracting key information from a receipt like merchant name. It is only one of many techniques that I use to achieve over 80% accuracy in receipt scanning. Other not so sexy techniques used in Taggun include rule-based algorithms, regular expressions, statistical analysis and hours of manual work to perform regression and tweak the algorithm to optimise accuracy. ML is really just one of many ingredients.

Machine Learning is for Everyone

I think there are is a misconception in the implementation of Machine Learning. Most developers think ML is out of reach for our regular day-to-day software development and it requires big and powerful servers to churn through huge data sets, costing big dollars and is only useful to solve big problems like self-driving cars and poker playing bots. That might be true for a lot of scientific breakthroughs we read in the news; but it certainly does not negate the fact that ML can be easily included into one of our tool sets to solve real business problems.

ML is so simple, even your web browser can do deep learning

Not every problem requires big and powerful servers with GPU accelerated processor to train. Here is real proof of how you can use your web browser to perform deep Learning using Convolutional Neural Network to classify 60,000 images with over 80% accuracy.

Just add the library and start hacking

Here are some simple open sourced libraries that any developers can take advantage of ML easily. Most of them only require less than 5 lines of code to start training with a set of data.

Domain Knowledge and Clean Data Wins

Here are some tips to achieve best results when training your model for business problems:

  • If the training data is in articles format like news, resume or song lyrics; use some good old Natural Language Processing (NLP) techniques to tokenise and stem the articles before you input them to train your network model. A consistent set of input prevents a corrupted model.
  • Apply different weighting or bias (of importance or relevance) for different sections of an article. For example: the start of the resume should carry more weights to classify a resume because the summary of the candidate is usually placed at the beginning. Other example: keywords in title and h1 elements in a web page should carry more weights than other elements like p or div.
  • Start small. Rather than training one model for everything under the sun, I suggest to train multiple different models with smaller segments of the data sets. Invest your time to automate the pipeline to read input data from database, train/retrain model and then save the network for consumption.
  • ML is really just one of the tools when you want to offer a solution for a user. A combination of other tools like UX design, rule-based algorithms and NLP, can really 10x the accuracy and effectiveness of ML without spending too much time and money and being stuck with a super smart model that no one uses.