Most of the data science projects (as keen as I am to say all of them) require a certain level of data cleaning and preprocessing to make the most of the machine learning models. Some common preprocessing or transformations are:
a. Imputing missing values
b. Removing outliers
c. Normalising or standardising numerical features
d. Encoding categorical features
Sci-kit learn has a bunch of functions that support this kind of transformation, such as StandardScaler, SimpleImputer…etc, under the preprocessing package.
A typical and simplified data science workflow would like
You might have been working on something on your own for a while. And one day you need to share it or cooperate with others. Maybe your supervisor wants to have a look at what you’ve been doing, maybe there are new recruits who are gonna share your workload, or you simply want to share your work! This practice consists of 2 parts, initiate Git and publish your project to Github.
It’s been long debated in the community that whether merge or rebase should we use.
Some people would say merge is better cause it preserves the most complete working history. Others would argue rebase is neater, which makes the reviewer’s life easier and more efficient. This article will explain what are the differences between merge and rebase and what’s the benefits of using one of them.
Fundamentally, merge and rebase serve the same purpose, to integrate changes from one branch (sometimes multiples branches) into another. Most commonly used when you want to integrate the latest master or develop branch before…
Following the previous An Intro to Git and Github for Beginners, today we’re gonna get our hands dirty. This story is going to walk you through the Git workflow with practical code examples. Since Github is the most popular website to host Git repositories, we’ll be using it as examples. We’ll be working with Git’s Command-line interface, therefore, Github CLI is not considered.
Engineer A: “Hey, have you merged your branch to develop yet?”
Engineer B: “No, I’m waiting for my previous PR to be merged so I can proceed. But I’ve already staged all my works and pushed.”
Engineer A: “Ok, I’ll review and merge it later. Don’t forget to rebase!”
Engineer B: “Thanks, I almost forget that. In the meantime, I’ll work on another branch.”
You: “Huh? Were you speaking English? So what’s the progress now??”
Does it sound familiar? This is exactly what happened to me when I first met these sounds-high-tech terminologies. …
To test or not to test, this is a statistical question.
In this global battle with the pandemic, Taiwan has done a marvelous job keeping its citizens safe and healthy, with only 850 confirmed cases and 7 deaths in total to date (17/01/2021). As a small island only a strait from China, where the virus originated, the number is nothing but incredible. Apart from recognising the pandemic much sooner than the rest of the world, Taiwan has never imposed compulsory COVID-19 testing on international arrivals. As a matter of fact, there aren’t many countries that have this policy imposed.
Pandas is a powerful Python package to wrangle your data. However, have you ever encountered some tasks that just make you think ‘if only I could use SQL query here!’? I personally found it particularly annoying when it comes to joining multiple tables and extracting only those columns you want in pandas. For example, you’d like to join 5 tables. You absolutely can do this with only one query in SQL. But in pandas, you have to do 4 times merge, a+b, (a+b)+c, ((a+b)+c)+d,….What’s worse, every time you merge, pandas will keep all columns, despite you probably only need one…
This article provides a step-by-step tutorial of connecting to Azure SQL Server using Python on Linux OS.
After creating an Azure SQL Database/Server, you can find the server name on the overview page.
When it comes to time series analysis, resampling is a critical technique that allows you to flexibly define the resolution of the data you want. You can either increase the frequency like converting 5-minute data into 1-minute data (upsample, increase in data points), or you can do the other way around (downsample, decrease in data points).
Quoting the words from documentation, resample is a “Convenient method for frequency conversion and resampling of time series.”
In practice, there are 2 main reasons why using resample.
Jupyter notebook, previously known as IPython notebook, is one of the most popular IDEs for data science projects. You can put all the codes, visualisations, notes, images, or comments all together to enhance readability and communication. Following are some tricks I found pretty useful and wish I’d known earlier after working on a number of data science/analysis projects.
1. Notebook width adjustment
When you open a notebook, it doesn’t come full width as default. It will only utilise around 50% of the screen. …
No matter what your knowledge is, there’s always someone in the world wants to know. So share it!