Interpolation may sound like a fancy mathematical exercise, but in many ways, it is much like what machine learning does.

- Start with a
**limited set of data points**relating multiple variables - Interpolate (basically, create a model)
- Construct a new function that can be used to
**predict**any future or new…

When we do sophisticated statistical analyses or build complex machine learning models, we often forget that the most like source of the data was a plain text file, read from a disk drive or over an internet connection (parsed from an HTML).

This is the fact. Numeric data, used in…

Comma-separated values (CSV) is the most used widely flat-file format in data analytics. It is simple to understand and work with. CSV files perform decently in small to medium data regimes. However, as we progress towards working with larger datasets, there are some excellent reasons to move towards file formats…

We all love good and comprehensive documentation when we use a new library (or re-use our favorite one for the millionth time), don’t we?

Imagine how would you feel if they took away all the docs from Scikit-learn or TensorFlow website. You will feel pretty powerless, wouldn’t you?

Documentation is…

Sounds like a catchy title? Well, what we really meant by that term is **arbitrary-precision computation** i.e. breaking away from the **restriction of 32-bit or 64-bit arithmetic** that we are normally familiar with.

Here is a quick example.

You have some data points. Numeric, preferably.

And you want to find out **which statistical distribution they might have come from**. Classic statistical inference problem.

There are, of course, rigorous statistical methods to accomplish this goal. But, maybe you are a busy data scientist. Or, a busier software engineer who…

We want to train an AI agent or model that can do something like this,

Little more specifically, we want to train an AI agent (or model) to identify/classify time-series data for,

- low/medium/high variance
- anomaly frequencies (
*little or high fraction of anomalies*) - anomaly scales (
*are the anomalies too far from…*

As I wrote in my highly-cited article, “ *a synthetic dataset is a repository of data that is generated programmatically. So, it is not collected by any real-life survey or experiment. …*

Dask is a feature-rich, easy-to-use, flexible library for parallelized computing in Python. It is specifically optimized and designed for data science and analytics workloads.

In most common scenarios, Dask comes to the rescue when you are dealing with large datasets that would have been tricky (if not downright impossible) to…

As data scientists, all of us have been there.

We are given a large Pandas DataFrame and asked to check some relationships between various fields in the columns — in a ** row-by-row fashion**. It could be some logical operation or some sophisticated mathematical transformation on the raw data.

Essentially, it…