Data science is a very vast field. Nowadays, all the companies gather user data to make a useful decision out of it. Python is the language that is extensively used for data science. In my view, the primary use of python is data science-related tasks. Python gives excellent tools and API, which make it very easy and useful to execute the tasks. However, before discussing the tools that python gives to the user, we first know what is data science is what are their branches and what are the different step that involves data science,
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. Data science encompasses the fields of data mining and big data.
Data science is further divided into subfields like:
1. Image Processing
2. Natural Language Processing
3. Data mining
4. Big Data
5. Image processing involves manipulating images and taking information out of it. Natural language processing involved speech recognition. Data mining and big data involve extracting information from the sea of data.
The step that is involved in data science are the following:
It is the first step in data science to acquire the data. It is not always that we have data in our local memory. Sometimes it is present in a distributed for in different databases. In such a situation, we get the data from all of the sources.
Moreover, sometimes in the worse condition, we even do not have access to the data. It is present on the web, but no database access is given; in such a situation, we have to scrape the data from the web and store it locally.
Data wrangling is a process of cleaning the data. It is also known as data munging. It is essential after acquiring the data because if the foundation is weak, then the building will collapse. In this step, we remove all the data that is not necessary for our task and make the data clean, which our model can accept.
For example, if we have lots of images of cats and dogs, we first have to categorize all of the images and label them. After categorization, we will make all of the images of the same size, and in last, we make colored images to grayscale so that our model does not take enough time, but it is not necessary, some time we take color images as in the case of classifying apple and oranges.
First, we know what an outlier is, so let's take an example if we have data about the weather of 2 years, and there is a trend. However, on some day in July, the temperature drops to 12-degree centigrade than this day is an outlier, which can affect the predicting model. So, in such a case, we remove such an outlier or take an average of it so that our predicting model does not affect.
After cleaning the data and removing the outlier, we reduce the dimension of the data so that the computation cost becomes low, and else the model will take too much time for a prediction.
Selecting the Algorithm for Training:
The selection of a machine learning algorithm is essential as it depends on a type of data. Suppose if we want to classify the images that we must go for KNN or SVM if we want to classify the data that involves the house pricing, then we go for the linear regression or random forest regression techniques.
There are lots of modeling algorithms,s and selecting the algorithm depends on the type of data.
Python offers a variety of tools for data science relating task; some are as follows:
- Pandas for data wrangling and data visualization.
- NLTK for natural language processing
- OpenCV for image processing
- SkLearn for machine learning algorithms
- TensorFlow and Keras for deep learning
So, what we will discuss pandas in details and see how to use pandas and how we can use it to achieve our ultimate goals.
Installation of Pandas:
To install pandas, it is very easy, just type:
pip install pandas
pip3 install pandas
We can use pandas in any editor, but it is good to use the jupyter notebook for that Jupyter notebook makes it very easy to clean the data. It stores the variable in memory, so we do not have to reload all the code again and again. There are two ways to install a jupyter notebook.
1. First, anaconda provides all the libraries and all the tools, and the jupyter notebook comes built-in with anaconda. To install anaconda, you have to visit the Anaconda website and install the latest version of anaconda according to your operating system.
2. The second way is to download the jupyter notebook by python package manager write pip3 install jupyter.
How to launch Jupyter Notebook:
To launch a jupyter, go to that directory where you want to start it and open the terminal there and then write a command jupyter notebook.
In the above image, I go to the folder named as data science and open a terminal there and write command jupyter notebook. Moreover, then it will redirect to the browser and will give an interface like this:
Now click on the new button and click on create the python3 file, and it will automatically create a new file, and then we can test it by running a print statement.
Each gray in the above image is called a cell, and once the cell is executed, there is no need to execute it again; all the data will be stored in memory. Now we will import pandas to check that our pandas' module is installed or not:
So, when I import pandas, it shows me this message. Maybe that message will not pop up in your program; it is a warning, so there is nothing to worry we are all set to dig into pandas.