Hands-on Tutorials

Use Pipelines to streamline your data science project right now!

Image for post
Image for post
Photo by Myriam Jessier on Unsplash

Most of the data science projects (as keen as I am to say all of them) require a certain level of data cleaning and preprocessing to make the most of the machine learning models. Some common preprocessing or transformations are:

a. Imputing missing values

b. Removing outliers

c. Normalising or standardising numerical features

d. Encoding categorical features

Sci-kit learn has a bunch of functions that support this kind of transformation, such as StandardScaler, SimpleImputer…etc, under the preprocessing package.

A typical and simplified data science workflow would like


Git initiation, rename, stash, reset, and rebase

Besides ordinary workflow, these are some of the Git practices that I found very useful at work

Image for post
Image for post
Photo by Fotis Fotopoulos on Unsplash

Prerequisite

1. Initiate Git in your project and publish it to Github

You might have been working on something on your own for a while. And one day you need to share it or cooperate with others. Maybe your supervisor wants to have a look at what you’ve been doing, maybe there are new recruits who are gonna share your workload, or you simply want to share your work! This practice consists of 2 parts, initiate Git and publish your project to Github.


Git Question 101

This is probably the most-asked question of git users

Image for post
Image for post
Photo by mari lezhava on Unsplash

It’s been long debated in the community that whether merge or rebase should we use.

Some people would say merge is better cause it preserves the most complete working history. Others would argue rebase is neater, which makes the reviewer’s life easier and more efficient. This article will explain what are the differences between merge and rebase and what’s the benefits of using one of them.

Fundamentally, merge and rebase serve the same purpose, to integrate changes from one branch (sometimes multiples branches) into another. Most commonly used when you want to integrate the latest master or develop branch before…


Go through the Git workflow step by step with hands-on practice (code example)

Image for post
Image for post
Photo by Jefferson Santos on Unsplash

Following the previous An Intro to Git and Github for Beginners, today we’re gonna get our hands dirty. This story is going to walk you through the Git workflow with practical code examples. Since Github is the most popular website to host Git repositories, we’ll be using it as examples. We’ll be working with Git’s Command-line interface, therefore, Github CLI is not considered.

Prerequisite

Clone the repository from Github

There are 2 ways to get a repository, create…


Lesson no.1 of Git and Github

To finally understand what the heck are those engineers are talking about!

Image for post
Image for post
Photo by Lorenzo Herrera on Unsplash

Engineer A: “Hey, have you merged your branch to develop yet?”

Engineer B: “No, I’m waiting for my previous PR to be merged so I can proceed. But I’ve already staged all my works and pushed.”

Engineer A: “Ok, I’ll review and merge it later. Don’t forget to rebase!”

Engineer B: “Thanks, I almost forget that. In the meantime, I’ll work on another branch.”

You: “Huh? Were you speaking English? So what’s the progress now??”

Does it sound familiar? This is exactly what happened to me when I first met these sounds-high-tech terminologies. …


See how we combat the pandemic from a statistical perspective

Image for post
Image for post
Photo by Edwin Hooper on Unsplash

To test or not to test, this is a statistical question.

In this global battle with the pandemic, Taiwan has done a marvelous job keeping its citizens safe and healthy, with only 850 confirmed cases and 7 deaths in total to date (17/01/2021). As a small island only a strait from China, where the virus originated, the number is nothing but incredible. Apart from recognising the pandemic much sooner than the rest of the world, Taiwan has never imposed compulsory COVID-19 testing on international arrivals. As a matter of fact, there aren’t many countries that have this policy imposed.

Why…


An easy trick of python’s built-in database, SQLite, to make your data manipulation more flexible and effortless.

Image for post
Image for post
Photo by William Iven on Unsplash

Pandas is a powerful Python package to wrangle your data. However, have you ever encountered some tasks that just make you think ‘if only I could use SQL query here!’? I personally found it particularly annoying when it comes to joining multiple tables and extracting only those columns you want in pandas. For example, you’d like to join 5 tables. You absolutely can do this with only one query in SQL. But in pandas, you have to do 4 times merge, a+b, (a+b)+c, ((a+b)+c)+d,….What’s worse, every time you merge, pandas will keep all columns, despite you probably only need one…


This article provides a step-by-step tutorial of connecting to Azure SQL Server using Python on Linux OS.

Image for post
Image for post
Photo by Markus Winkler on Unsplash

This article provides a step-by-step tutorial of connecting to Azure SQL Server using Python on Linux OS.

After creating an Azure SQL Database/Server, you can find the server name on the overview page.


Hands-on Tutorials

How to use Resample in Pandas to enhance your time series data analysis

Image for post
Image for post
Photo by Jiyeon Park on Unsplash

When it comes to time series analysis, resampling is a critical technique that allows you to flexibly define the resolution of the data you want. You can either increase the frequency like converting 5-minute data into 1-minute data (upsample, increase in data points), or you can do the other way around (downsample, decrease in data points).

Quoting the words from documentation, resample is a “Convenient method for frequency conversion and resampling of time series.

In practice, there are 2 main reasons why using resample.


Image for post
Image for post

Jupyter notebook, previously known as IPython notebook, is one of the most popular IDEs for data science projects. You can put all the codes, visualisations, notes, images, or comments all together to enhance readability and communication. Following are some tricks I found pretty useful and wish I’d known earlier after working on a number of data science/analysis projects.

1. Notebook width adjustment

When you open a notebook, it doesn’t come full width as default. It will only utilise around 50% of the screen. …

James Ho

No matter what your knowledge is, there’s always someone in the world wants to know. So share it!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store