Part 1 of SQL tricks to make your analytics work more efficient

Data scientists/analysts should know SQL, in fact, all professionals working with data and analytics should know SQL. To some extent, SQL is an under-rated skill for data science because it has been taken for granted as a necessary yet uncool way of extracting data out from the database to feed into pandas and {tidyverse} — fancier ways to wrangle your data.

Image for post
Image for post
Photo Source

However, with massive data being collected and churned out every day in the industries, as long as the data reside in a SQL compliant database, SQL is still the most proficient tool to help you investigate, filter and aggregate…

Be prepared: these goofy corny jokes will crack you up 😁

It is well-know that being a data scientist equals to being a programmer, a data engineer and a business analyst all in one. Although having various job responsibilities, all data scientists share one thing — WE CODE, just as developers do!

In fact, people (stakeholders) relate us more to programmers in the sense that we can magically solve data-related problems by sitting in front of a computer and typing on the keyboard, just as developers do programming-related problems!

What’s more important? Well, data scientists share a goofy sense of…

Here is how to avoid them

In my previous mini-series, SQL Tricks Every Data Scientist Should Know, I shared 10 SQL tips to level up your analytics work. For today’s post, I would like to take a different perspective, discussing 3 common SQL mistakes that I have seen fresher data scientists make. Knowing, hence avoiding these errors is equally important as gaining more tips and tricks since writing top-performing SQL queries should be the priority of every data scientist/analyst.

Image for post
Image for post
Image Source: Pexels

When it comes to errors, there are primarily two types. The first type is referred to as Syntax Errors, which, as the name suggests, occurs when we…

Image for post
Image for post
Photo Source: Pexels

Sharing more Pandas tips to level up your data manipulation

My last article, 6 Lesser-Known Yet Awesome Tricks in Pandas, hopefully, has given you some flavor of efficient coding in Pandas. Continuing with this topic, let’s explore more cool Pandas tips and tricks if you haven't known them already.

This blog will work with the same Dairy products table scraped directly from wiki page Table of food nutrients, using the html trick we discussed in the previous blog. Here is a snapshot of the table, where it shows the food names, measure/grams, and the corresponding calorific values,

Image for post
Image for post
Photo Source: Unsplash

Tricks I wish I knew sooner to get more value out of Pandas

As the most popular Python library for analytics, Pandas is a big project that offers various data manipulation and processing capabilities. It is probably no exaggeration to say that data scientists, myself included, use Pandas on a day-to-day basis in our work.

This blog is Part 1 of the mini-series dedicated to sharing my top 10 lesser-known yet most favorable features in Pandas. Hopefully, you can walk away with some inspirations to make your own code more robust and efficient.

The dataset for this mini-series is from the Table of food nutrients, a Wikipedia page containing 16 tabular lists for…

A useful guide on long-to-wide and wide-to-long transformations

Think of this scenario: you just finished a SQL query producing the summary statistics for the stakeholders, and let’s take the following table as an example returned from your query,

Image for post
Image for post
Long data format

This is a longitudinal data that tracks the number of firearm permit checks by month and state. The raw data is publicly accessible, and more background is linked here. So your project is a time-series analysis on the changes in firearm permits, both longitudinally across time and horizontally across states. …

A practical guide to import unstructured text/images data

Being a part of the data science/analytics team, you’ll probably encounter many file types to import and analyze in Python. In the ideal world, all our data reside in the cloud-based databases (e.g., SQL, NoSQL) that are easy to query and extract. In the real world, however, we rarely get neat tabular data. Also, if we need additional data (being structured or unstructured) to augment the analysis, we will inevitably be working with raw data files that come in different formats.

Image for post
Image for post
Photo by Skitterphoto from Pexels

Recently, my team started a project that, as the first step, involves integrating raw data files in formats .csv…

Part 2 of getting more out of SQL to step up your analytics work

In my previous blog, 6 SQL Tricks Every Data Scientist Should Know, I discussed the most useful SQL functionalities in my daily work as a data scientist. This blog, as Part 2 of this mini-series, I will continue to share my SQL tips to help you level up your analytics/reporting efficiency.

Today, we will be working with a new toy table shown below containing multiple data types, and for demonstration purpose, this exercise will be implemented in MS SQL server 2017,

Image for post
Image for post
Toy data table (with variable definitions)
  1. ROW_NUMBER() to return a subset of rows

The ROW_NUMBER() function in SQL creates a unique incrementing integer value to…

Animated visualizations without coding in D3.js

Image for post
Image for post

The coronavirus, or COVID-19, has been spreading over several continents in less than two months, and definitely has been inciting public panic and a pandemic of fear! Every news on this planet is covering coronavirus, the stock market is melting down, and people are allowing anxiety to dictate their behaviors. Certainly, I am no expertise in virus genetics nor epidemiology by any means, but as far as I am concerned, the rational approach is to take necessary precautions of staying at home and practicing social distancing until we get through this together!

Rather than planning for the worst-case scenario while…

I have coded up the correct way so that you don’t have to

Since Keras 2.0, evaluation metrics F-score, precision and recall were removed. However, when it comes to imbalanced classification problems, they are the desired model performance measures. If this concept sounds unfamiliar, the paper linked provides a good explanation of the accuracy paradox and Precision-Recall curve. As a building block for my series of posts (tackling imbalanced dataset in neural networks), this post will focus on implementing the F1-score metric in Keras and discussing what to and not to do.

First Attempt: custom F1-score metric

According to Keras documentation, users can pass custom metrics at the neural networks compilation step. Easy…

Yi Li

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store