About

I teach data science and analytics related courses for undergraduate and postgraduate/master students in computer science school (previously) and business school (now). I mainly use the materials from the following lists as the main references in my teaching (and also in my research). They can be used as the directed study materials for students who would like to develop their portfolio of data analytics and machine learning knowledge and skills. I will keep updating the lists. Due to my limited knowledge, the contents are biased towards some specific fields or topics, and many good materials may not be included. If you think I have left something out that should be included in the the following lists, any of your contributions is welcome and please let me know.

DSML Theory

Probability, Statistics and Linear Algebra

I studied mathematical finance and statistics in postgraduate studies, and then self-studied machine learning since my PhD research. The following are the fundamental mathematics materials which I read and use in my study and teaching. Probability and statistics play a significant role. However, it should be noted that many topics covered by these books are not very popular or widely used in data science and machine learning.

  • Walter Rudin. Principles of Mathematical Analysis, McGraw-Hill, 3rd Edition, 1976.
  • David Freedman. Statistical Models: Theory And Practice, Cambridge University Press, 2nd Edition, 2009.
  • Morris DeGroot and Mark Schervish. Probability and Statistics, Pearson, 4th Edition, 2013.
  • George Casella and Roger Berger. Statistical Inference, 2nd Edition, 2002.
  • Sheldon Ross. Introduction to Probability Models, Academic Press, 10th Edition, 2009.
  • Geoffrey Grimmett, David Stirzaker. Probability and Random Processes, Oxford University Press, 2001.
  • Alexander Mood, Franklin Graybill, Duane Boes. Introduction to the Theory of Statistics, 3rd Edition, McGraw-Hill, 1974.
The following two books provide a good exposition of the essential mathematics of machine learning. I recommend them for those who want to study the theory of machine learning algorithms:

Machine Learning and Data Mining (Introductory Level)

I recommend the following two books for students in business background who do not know what is data science and machine learning and want to get a grasp on the big picture.

  • Pedro Domingos. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World, Allen Lane, 2015.

The following books are easy to follow. I recommend them for students in business background as the first book to study machine learning and data science. I use several materials from Gareth James's book in my teaching at business schools.

Machine Learning and Data Mining (Intermediate and Advanced Level)

The books in the following list are suitable for undergraduates, postgraduates, PhDs and mature researchers. I widely use materials for my research and teaching from the books by David MacKay, Christopher Bishop, Kevin Murphy, Yaser Abu-Mostafa, Hang Li, and Zhihua Zhou.

DSML Practice

It is very difficult to compare the data programming tools without knowing:
  • What do you plan to do?
  • What is your preference of investment (mainly your time) and reward?
  • Who you do work with and who do you want to present and share your work?

I have used Python, R, Matlab and Microsoft Azure Machine Learning Studio for years in my research and teaching. The following are my humble experience.

Python

I mainly use Python in my research. It is a high-level, object-oriented, general-purpose programming language. It is easy to learn, quite fast, and with a lot of machine learning packages and a comprehensive range of codes online. I guess the latter two are the main reasons why Python has been extremely successful for machine learning and data analytics today. The following are just the introductory Python materials for those who have not used it before.

R

In my research, I like to use R for quick descriptive analytics and visualisation of experimental results. I knew R from S-Plus when I studied statistics courses many years ago. My experience of using R was not pleasant at that time so I switched to Matlab and Mathematica for a couple of years until RStudio and ggplot2 came to me. R was developed mainly for statistical computing but it is expanded to data science and machine learning in recent years. I am a big fan of the R packages developed by Hadley Wickham, which significantly improve my experience of using R. Therefore, I strongly recommend his R books series:

Other good R books include:

  • Julia Silge, David Robinson. Text Mining with R, O'Reilly, 2017.
  • Robert Kabacoff. R in Action: Data Analysis and Graphics with R, Manning Publications, 2015.
  • W. John Braun, Duncan J. Murdoch. A First Course in Statistical Programming with R, Cambridge University Press, 3rd Edition, 2021.
  • 黄天元. R语言数据高效处理指南, 北京大学出版社, 2019.
  • Deborah Nolan, Duncan Lang. Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving, CRC, 2015.

Matlab

Matlab was my favourite tool in my research. It is perhaps the most successful commercial software in mathematical programming. It is very powerful; has a user-friendly interface (debugging is easy and the generated figures are editable); is very good at simulating and modelling systems. Matlab has the File Exchange while it is not as popular as the communities of Python and R. The license for Matlab can be costly though many universities and companies purchase Matlab licence each year for students, staff and researchers. The following two books I found very helpful when I used and taught Matlab for data analytics.

  • Wendy Martinez, Angel Martinez. Computational Statistics Handbook with Matlab, 3rd Edition, CRC, 2015.
  • Jaan Kiusalaas. Numerical Methods in Engineering with Matlab, 2nd Edition, Cambridge University Press, 2012.

Microsoft Azure Machine Learning Studio

Some of my business school courses are for students who do not have received a lot of mathematical and/or programming trainings. Their primary goal is to quickly apply the popular machine learning algorithms for business analytics. Therefore, Azure Machine Learning Studio can be an ideal tool. It allows users to build and deploy machine learning algorithms in a simple way by connecting the basic modules. R, Python and Jupyter notebooks can also be used in Microsoft Azure Machine Learning Studio for making customized functions and analytics. It should be noted that Machine Learning Studio (classic) will be retired by 31 August 2024 and transition to Azure Machine Learning.

© Bowei Chen 2024