About
I teach data science and analytics related courses for undergraduate and postgraduate/master
students in computer science school (previously) and business school (now). I mainly use the
materials from the following lists as the main references in my teaching (and also in my research).
They can be used as the directed study materials for students who would like to develop their
portfolio of data analytics and machine learning knowledge and skills. I will keep updating the
lists. Due to
my limited knowledge, the contents are biased towards some specific fields or topics, and many good
materials may not be included. If you think I have left something out that should be included in the
the following lists, any of your contributions is welcome and please let me know.
DSML Theory
Probability, Statistics and Linear Algebra
I studied mathematical finance and statistics in postgraduate studies, and then self-studied machine
learning since my PhD research. The following are the fundamental mathematics materials which I read
and use in my study and teaching. Probability and statistics play a significant role. However, it
should be noted that many topics covered by these books are not very popular or widely
used in data science and machine learning.
- Walter Rudin. Principles of Mathematical Analysis, McGraw-Hill, 3rd Edition, 1976.
- David Freedman. Statistical Models: Theory And Practice, Cambridge University Press, 2nd
Edition, 2009.
- Morris DeGroot and Mark Schervish. Probability and Statistics, Pearson, 4th Edition, 2013.
- George Casella and Roger Berger. Statistical Inference, 2nd Edition, 2002.
- Sheldon Ross. Introduction to Probability Models, Academic Press, 10th Edition, 2009.
- Geoffrey Grimmett, David Stirzaker. Probability and Random Processes, Oxford University Press,
2001.
- Alexander Mood, Franklin Graybill, Duane Boes. Introduction to the Theory of Statistics, 3rd
Edition, McGraw-Hill, 1974.
The following two books provide a good exposition of the essential mathematics of machine learning. I
recommend them for those who want to study the theory of machine learning algorithms:
Machine Learning and Data Mining (Introductory Level)
I recommend the following two books for students in business background who do not know what is data
science and machine learning and want to get a grasp on the big picture.
- Pedro Domingos. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will
Remake Our World, Allen Lane, 2015.
The following books are easy to follow. I recommend them for students in business background as the
first book to study machine learning and data science. I use several materials from Gareth James's
book in my teaching at business schools.
-
Jiawei Han, Micheline Kamber, Jian Pei. Data Mining: Concepts and Techniques, Morgan Kaufmann,
3rd Edition, 2011.
-
Nong Ye. Data Mining: Theories, Algorithms, and Examples, CRC, 2014.
-
Sandro Skansi. Introduction to Deep Learning From Logical Calculus to Artificial Intelligence,
Springer, 2018.
-
Gareth James, Daniela Witten, Trevor
Hastie, Robert Tibshirani. An Introduction to Statistical
Learning: with Applications in R, Springer, 2013.
Machine Learning and Data Mining (Intermediate and Advanced Level)
The books in the following list are suitable for undergraduates, postgraduates,
PhDs and mature researchers. I widely use materials for my research and teaching from the books by
David MacKay, Christopher Bishop, Kevin Murphy, Yaser Abu-Mostafa, Hang Li, and Zhihua Zhou.
-
David MacKay. Information
Theory, Inference and Learning Algorithms, Cambridge University
Press, 2013.
-
Charu Aggarwal. Data Mining: The Textbook, CRC, 2015.
-
Mohammed Zaki and Wagner Meira. Data Mining and Analysis: Fundamental Concepts and Algorithms,
Cambridge University Press, 2014.
-
Trevor Hastie, Robert
Tibshirani, Jerome Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2nd Edition, Springer, 2011.
-
Simon Rogers and Mark Girolami. A First Course in Machine Learning, CRC, 2nd Edition, 2016.
-
Christopher
Bishop. Pattern Recognition and Machine Learning, Springer, 2007.
-
Kevin Murphy.
Probabilistic Machine Learning: An Introduction, MIT Press, 2022.
-
Kevin Murphy.
Probabilistic Machine Learning: Advanced Topics, MIT Press, 2023.
-
David
Barber. Bayesian Reasoning and Machine Learning, Cambridge University Press, 2012.
-
Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar. Foundations of Machine Learning, MIT
Press, 2nd Edition, 2018.
-
Ethem Alpaydin. Introduction to Machine Learning, MIT Press, 3rd Edition, 2014.
-
Yaser Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. Learning From Data, 2012
[[Link](http://amlbook.com/)]
-
Hang Li. 统计学习方法,
第二版,清华大学出版社, 2019.
-
Zhihua Zhou.
机器学习, 清华大学出版社, 2016.
-
Ian Goodfellow, Yoshua
Bengio, Aaron Courville. Deep Learning, MIT Press, 2016.
-
Zhihua Zhou. Ensemble Methods: Foundations and Algorithms, CRC, 2012.
-
Richard Sutton, Andrew Barto, Francis Bach. Reinforcement Learning: An Introduction, MIT
Press, 2nd Edition, 2018.
-
Carl Rasmussen and Christopher Williams. Gaussian Processes for Machine Learning, MIT Press,
2006.
DSML Practice
It is very difficult to compare the data programming tools without knowing:
- What do you plan to do?
-
What is your preference of investment (mainly your time) and reward?
-
Who you do work with and who do you want to present and share your work?
I have used Python, R, Matlab and Microsoft Azure Machine Learning Studio for years in my research
and teaching. The following are my humble experience.
Python
I mainly use Python in my research. It is a
high-level, object-oriented, general-purpose programming
language. It is easy to learn, quite fast, and with a lot of machine learning packages and a
comprehensive range of codes online. I guess the latter two are the main reasons why Python has been
extremely successful for machine learning and data analytics today. The following are just the
introductory Python materials for those who have not used it before.
-
Wes McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython,
O'Reilly, 2012.
-
Peter Harrington. Machine Learning in Action, Manning Publishing, 2012.
-
Jake VanderPlas. Python Data Science Handbook Essential Tools for Working with Data, O'Reilly,
2016.
-
Sebastian Raschka. Python Machine Learning, Packt Publishing, 2015.
-
Sheppard. Introduction
to Python for
Econometrics, Statistics and Data Analysis,
University of Oxford Lecture Notes, 2014.
R
In my research, I like to use R for quick
descriptive analytics and visualisation of experimental results. I knew R from S-Plus when I studied
statistics courses many years ago. My experience of using R was not pleasant at that time so I
switched to Matlab and Mathematica for a
couple of years until RStudio and ggplot2 came to me. R was developed mainly for statistical
computing but it is expanded to data science and machine learning in recent years. I am a big fan
of the R packages developed by Hadley Wickham, which
significantly improve my experience of using R. Therefore, I strongly recommend his R books series:
Other good R books include:
-
Julia Silge, David Robinson. Text
Mining with R, O'Reilly, 2017.
-
Robert Kabacoff. R in Action: Data Analysis and Graphics with R, Manning Publications, 2015.
-
W. John Braun, Duncan J. Murdoch. A First Course in Statistical Programming with R, Cambridge
University Press, 3rd Edition, 2021.
-
黄天元. R语言数据高效处理指南, 北京大学出版社, 2019.
-
Deborah Nolan, Duncan Lang. Data Science in R: A Case Studies Approach to Computational
Reasoning and Problem Solving, CRC, 2015.
Matlab
Matlab was my favourite tool in my research.
It is perhaps the most successful commercial software
in mathematical programming. It is very powerful; has a user-friendly interface (debugging is easy
and the generated figures are editable); is very good at simulating and modelling systems. Matlab
has the File
Exchange while it is not as popular as the communities of Python and R. The
license for Matlab can be costly though many universities and companies purchase Matlab licence each
year for students, staff and researchers. The following two books I found very helpful when I used
and taught Matlab for data analytics.
-
Wendy Martinez, Angel Martinez. Computational Statistics Handbook with Matlab, 3rd Edition, CRC,
2015.
-
Jaan Kiusalaas. Numerical Methods in Engineering with Matlab, 2nd Edition, Cambridge University
Press, 2012.
Microsoft Azure Machine Learning Studio
Some of my business school courses are for students who do not have received a lot of mathematical
and/or programming trainings. Their primary goal is to quickly apply the popular machine learning
algorithms for business analytics. Therefore, Azure Machine Learning Studio can be an
ideal tool. It allows users to build and deploy machine learning algorithms in a simple way by
connecting the basic modules. R, Python and Jupyter notebooks can also be used in Microsoft Azure
Machine Learning Studio for making customized functions and analytics. It should be noted that
Machine Learning Studio (classic) will be retired by 31 August 2024 and transition to Azure Machine
Learning.