How should I start learning Data Science skills?

On numerous occasions, I’ve been approached with queries on how to begin acquiring Data Science skills. Should you dive into a stack of books, enroll in online courses, pursue post-graduate studies, learn from English content or in your native language, or perhaps invest in paid workshops? Or maybe you’re considering self-studying from the plethora of free content available online. Now, there are even more possibilities than there were when I started with Data Science. Even back then, I found myself struggling to make the right decision and find the optimal path.
prerequisites
motivation
MOOCs
R
programming languages
YouTube
books
studies
recommendations
Author

Kamil Wais

Published

September 24, 2023

On numerous occasions, I’ve been approached with queries on how to begin acquiring Data Science skills. Should you dive into a stack of books, enroll in online courses, pursue post-graduate studies, learn from English content or in your native language, or perhaps invest in paid workshops? Or maybe you’re considering self-studying from the plethora of free content available online.

Now, there are even more possibilities than there were when I started with Data Science. Even back then, I found myself struggling to make the right decision and find the optimal path. Let’s review my journey first to see how one might struggle to find an ideal learning path.

My journey

Looking back, I remember being unsure where to start my journey in data science. During my major in sociology, I had taken some rudimentary classes involving SPSS, but it wasn’t until after my Ph.D. that I truly dove into the field. I undertook a postgraduate study in Statistical Methods in Business, which was computer workshops using SAS software. This experience helped me explore various areas of statistics and its applications in various areas. 

However, I found myself heavily reliant on proprietary software that required expensive licenses. This was a significant hurdle for someone like me who was self-funding my learning journey. Fortunately, our instructors also recommended exploring R - a statistical programming language that was not only powerful but also freely available. Though I was exposed to many Data Science methods already, I realized that I needed a deeper understanding and a better tool. 

I began my journey into learning R by completing my first MOOC on the Coursera platform in 2012. It was called Statistics One, and R was used by Andrew Conway, the Senior Lecturer at Princeton University. Over the next few years, I completed more than one hundred MOOCs (yes, that’s a lot!), focusing primarily on topics related to Data Science and Computer Science. Here’s a brief list of selected subjects I’ve dived into: 

  • Statistical inference
  • Databases
  • Social and economic networks
  • Gamification
  • Machine learning
  • Big data
  • Recommender systems
  • Social surveys and methods and statistics in social sciences
  • Evaluating social programs
  • Measuring causal effects
  • Process Mining
  • Programming
  • Interpretation of randomized clinical trials
  • Business Analytics
  • Smart Cities
  • Internet of Things
  • Semantic web technologies
  • Software product management
  • Data Science Ethics
  • Data mining
  • Linked Data
  • Agent-Based Modeling
  • Bayesian Statistics
  • User Experience
  • Data visualization
  • Structural Equation Modeling
  • Forecasting

It’s not like I started out knowing that R would end up as my primary programming language. Initially, I dabbled in multiple courses across various languages and technologies, such as R, Python, Java (specifically for Android), Tableau, HTML, CSS, JavaScript, AngularJS, and NodeJS. I even undertook a few projects using Excel and Visual Basic. One of the first MOOCs I enrolled in was the renowned Machine Learning course by Andrew Ng, Director of the Stanford Artificial Intelligence Lab and a Professor at Stanford University. This course largely revolved around Matlab or its free alternative, Octave. Of course, there are many programming languages I haven’t had the opportunity to revisit. However, this experimentation phase was essential and expected. It allowed me to explore different paths until I realized my keen interest in R.

At the inception of my learning journey with R, I found several books exceptionally helpful. My educational background in SPSS and SAS allowed me to more easily make a connection with the book “R for SAS and SPSS Users” by Robert A. Muenchen. This book served as a springboard for my transition, providing a wealth of comprehensive, yet easily digestible information. 

Moreover, as a non-native English speaker, I have found enormous value in resources that are translated into my native language. An example of such a resource is the first book about R written entirely in Polish - Przewodnik po pakiecie R by Przemysław Biecek. Published for the first time in 2008, this book stood out as the first Polish book dedicated solely to teaching the programming language, R. 

I understand the challenges one might face using learning materials in a second language, especially at the beginner level. Therefore, depending on your comfort level and proficiency in English, you might want to start with resources in your native language. However, I strongly recommend switching to English content as soon as possible. To enhance your skills and situate yourself at the forefront of data science, you’ll benefit significantly from the vast reservoir of resources available in English.

Nowadays, I work with R professionally full-time in an international global environment, and that wouldn’t be possible without knowing English. Some might even say that:

Of all programming languages, English is the most important one, as with this language you can learn all others.

Was my learning path the best one? Not necessarily. I spent a great deal of time exploring various aspects and potential avenues. Deciding which skills, technologies, or programming languages will offer the highest return on investment is always a challenging task. However, in hindsight, it’s clear that some areas were indeed more worthwhile and vital to understand than others. It isn’t necessarily about specific tools or programming languages; instead, it’s about grasping fundamental concepts, understanding more complex common problems, or recognizing patterns.

The possibilities, the job market, and the technologies are constantly changing. So please view my journey as a source of inspiration, rather than a formula for success with data science. Everyone’s learning path will - and probably should - be somewhat different. However, I guarantee you’ll experience an exploration phase and a subsequent specialization phase, and after some time, the cycle of exploration and specialization will likely start over. The crucial point here is to grasp and understand the most fundamental underlying concepts that can be transferred between tools, technologies, programming languages, and domains.

Why do people often find it difficult to start learning Data Science skills?

In today’s time, learning Data Science skills on your own may seem more challenging than ever before. Why is this so? Predominantly, because a surfeit of resources that are available now, both free and paid. These include a plethora of MOOCs, short online courses, books, and e-books, not to mention numerous blog posts and YouTube videos. Consequently, anyone wishing to take the first tentative step into Data Science is faced with an almost paralyzing array of options

Additional challenges include language barriers, especially relevant for non-native English speakers. Furthermore, there’s an avalanche of new terminology that might appear as an entirely new dialect to newcomers, regardless of their linguistic background. 

Amidst this complexity, a big question arises concerning the choice of the first suitable programming language for learning. Should one opt for R, Python, or perhaps, Julia, or some other language altogether? Moreover, if up to this point you’ve only worked on platforms like Excel, SAS, or SPSS, the thought of learning a full-fledged scripting language may seem intimidating. Especially for those who’ve only written macros before, the transition could be quite overwhelming. 

The dilemmas don’t stop there. For instance, should you learn how to conduct analysis on a cloud-based platform or learn how to do it locally on your personal computer? The former provides independence from local IT infrastructure but may involve some costs. On the other hand, the local solution only requires a personal laptop, presumably already available, but this approach necessitates an understanding of how to set up your environment and install and configure all the necessary software. 

I grappled with these issues myself, so I’m in a position to share some first-hand insights and lessons learned which assisted me in progressing.

What are the options and what are their pros and cons?

Should I learn from YouTube videos?

There’s an abundance of video tutorials available on YouTube, encompassing a wide range of Data Science topics. These resources are easily accessible and free, which sounds like a dream for any aspiring learner. But here’s the catch—it can be rather challenging to find the right tutorials. The problem isn’t scarcity, it’s the overwhelming volume of content! 

Without knowledge and understanding of the authors of these videos, their domain, and teaching experience, you can easily end up spending your time watching low-quality or difficult-to-understand content. As a novice in this field, it’s not always obvious what you need to look for. From my personal experience, trying to stitch together coherent learning from randomly selected videos can be pretty much like a wild goose chase

So, I wouldn’t recommend embarking on your Data Science journey by randomly watching videos in no particular order. Trust me on this having a structured and step-by-step learning path is much more beneficial.

Should I learn from MOOCs and other online courses?

There are a myriad of online courses available on various platforms, ranging from short, quick ones to intensive multiple-week long sessions. Many are paid, but it’s not necessarily wise to dive into every paid course that piques your interest. On my journey, I’ve completed many extensive online courses that were quite a financial and time investment. However, I always made sure to do thorough research beforehand to ensure the course would provide valuable knowledge and skills. 

Luckily, there is also a wealth of so-called Massive Open Online Courses (MOOCs) on platforms like Coursera or edX. Typically, these platforms allow you to “audit” courses for free. This means you can access all the course content such as videos without any payment, though if you want to tackle assessments and earn a course certificate at the end, there will be a fee. It’s significant to note that earning a certificate is not mandatory. You can certainly add it to your LinkedIn profile or CV but always remember - what truly matters is the knowledge and skills you’ve gained. Focus on that foremost.

One major advantage of MOOCs is the high-quality educational content they offer. This content is expertly curated by some of the world’s best lecturers from prestigious universities or experienced professionals from leading industry companies. With the plethora of MOOCs available today, it can be quite a task to find the ones that truly resonate with your learning path. Moreover, determining the appropriate sequence to take them could be equally challenging. That’s where R4Good.Academy steps in. 

Should I learn R or Python to work with data?

Let’s start with some simplicity: If you’re asking which programming language to start with, it probably doesn’t matter at this point

You’ve got options and plenty of them. Start learning R first, then Python, or flip them around—it can be your call. There’s no rule saying you can’t learn both, or indeed, start learning others like Julia, C++, Rust, or Go while you’re at it. Looking to flex your skills in building analytical dashboards? It’s quite likely you’ll need to pick up CSS, HTML, and JavaScript. But here’s the kicker: the final choice isn’t cast in stone. It’s subject to what projects you’ll be working on, which industry you’re aiming for, what companies you’re interested in, or what tech they’ve been using previously. 

However, there’s a word of caution I need to share: juggling too many programming languages or internet technologies at the same time could leave you feeling overwhelmed and lost. It’s generally a wise move to start with just one. If you’re trying to make that choice, look for a language that is free, popular, and adept at solving a wide range of data-related problems. If you’re like me and you don’t come from a Computer Science background, I’d recommend starting with R - it’s a good place to start.

Is learning R from journals and official software documentation a viable option?

At the beginning? Probably not. While there are excellent sources such as the R Journal and the Journal of Statistical Software, not to mention the official documentation of R and the individual documentation for every R package, they’re not necessarily newbie-friendly. To be clear, this content is incredibly current - if there’s a new tool like an R package out there, you can bet you’ll find some documentation for it. Jump in and you can start using it right away. The hurdle here, however, is that these types of documentation can be a minefield for beginners. This isn’t IKEA-style assembly instructions we’re talking about, it’s complex, technical information with minimal examples. It assumes a certain level of prior knowledge, something you might not have as a beginner.

Should I learn R and Data Science from books?

Books can often be the ideal learning resource for beginners, especially if they’re specifically crafted with beginners in mind. More often than not, books are known to contain high-quality content. This is because they are meticulously written by experienced authors and rigorously scrutinized by reviewers. Thus, they serve as a reliable source of information when you are starting your journey in data science.

The first issue with traditional books is that they can be pricey. Thankfully, many books are available as low-cost e-books, free PDFs, or sometimes, even pricey traditional books offer free web HTML versions.

The second issue with books is they aren’t updated as frequently as official documentation of R packages, particularly those translated into your native language. While it’s certainly not poor practice to begin with books in your native language, diving into original English publications might prove to be an even better approach. This principle often applies across all forms of content:

Mastering the material in its original language can offer you an upper hand over those who must wait for the translation.

Should you pursue Data Science at a university?

Nowadays, many universities offer a variety of interesting Data Science programs. You can find full online remote studies, postgraduate studies, and more. While these programs can be quite costly, they may provide a significant return on investment if they deliver high-quality content. However, if you’re a true beginner, be prepared for a steep learning curve - it’ll feel like trying to drink water from a hydrant. You’ll face a plethora of content that you’ll need to understand each week. Achieving this is possible, trust me, I’ve done it. But I suggest a different approach: prepare yourself before jumping in. Start by familiarizing yourself with the topics, and invest time in self-studying relevant books or enrolling in MOOCs courses.

What are our recommendations?

At R4Good.Academy, we provide lists of recommended resources that are optimal for starting your journey into various Data Science topics. These lists will be updated over time. The reason for this is that it requires time to read these books or audit online courses and new ones are emerging all the time. Rest assured, I only recommend content that I can vouch for - it should be of high quality, truly useful, understandable, and worth your precious time. Additionally, the recommended content is either free or low in cost to ensure easy accessibility for everyone.

Lists like that, especially long ones, often fail to give you a clear study order. Thus, in the PDF First Steps in Learning Data Science Skills, which you can get after subscribing to the R4Good.Academy newsletter, you’ll find a concise list of books and online courses that I recommend studying in a specific order. This PDF will also be updated if anything new and worthwhile emerges.

As a general philosophy, I would like to recommend a list of resources that will allow you to dive into the Data Science world with what I call a gentle introduction. By studying at your own pace, carefully selected books, and then longer online courses, you will eventually gain some real-world experience during some side projects.

Even the best list of resources won’t help you if you don’t start adopting the proper mindset. The mindset you need includes:

  • Time Commitment — It’s important to make a regular and sustained commitment to learning. This means setting aside blocks of time for study regularly. Reading a single book or completing a single online course is not going to provide all the skills you desire. However, every step brings you closer to your goal. It’s a continuous process that requires dedication over the long haul. Commit to it.

  • Keep Learning — even if you don’t understand most of the things at the beginning. That’s normal. Be gentle with yourself. Be patient. As you learn more and put what you’re learning into context, you’ll eventually be able to connect the dots. So, continue exploring.

  • Be Open — there’s so much promise in being open — open to new knowledge, to the possibility that things might not immediately make sense, but given enough time and effort, you can grasp even complex concepts. I distinctly remember an instance when I was teaching a class. One student, in particular, heaved a deep sigh, looking completely frustrated, and announced, “I will never understand statistics.” Now, we were not discussing complex statistical models at this point of the lecture. We were merely discussing basic elements like the mean and median of data. Initially, I was taken aback, but then I decided to view this situation from a different perspective. I couldn’t possibly be such a bad teacher that a student feels so lost in a basic concept, could I? At that moment I understood that a closed mind, due to fear or lack of commitment, can be closed even to the simplest ideas.
    Don’t ever label yourself in a limiting way. While learning, it is essential to remember that understanding comes with time and patience. Complex concepts may seem daunting now, but with a consistent effort, you might develop a deep understanding, or at least an intuition for them.  So my advice? Be patient, be persistent. View the things you don’t understand today as interesting concepts that you just haven’t grasped yet.

So, you’re aware of the mindset you should adopt, and there will be a plethora of resources available to assist you. However, at R4Good.Academy, I aim to go a step further. I’ll prepare a concise course that will help you begin with something easily achievable yet significant. You’ll learn how to craft a professional-looking CV (utilizing our dedicated template for PDF documents) with RStudio IDE, Quarto, and Markdown, excluding any statistical programming. Moreover, you’ll learn how to establish the entire R environment locally on your computer or laptop. This way, you can carry it with you wherever you go and utilize it whenever required.

Why should you start with this particular project? Typically, learning Data Science requires a substantial amount of time and dedication to accomplish even the simplest tasks on your own. However, before we delve into data, statistics, and programming, it’s beneficial to first learn something you’ll inevitably have to understand anyway — setting up your working environment with R and preparing different kinds of reports. These newly acquired skills will be immediately applicable to practical, real-world purposes such as creating your professional CV, or any PDF document for that matter. As you progress further in your journey, you’ll learn how to seamlessly integrate outputs from your data analysis into such documents or reports.

If you’re interested, subscribe to the R4Good.Academy newsletter. I’ll inform you as soon as this free educational content is available and ready to use.