In this section, we will see how to implement pandas code in python. So first, we will see how to input the data in pandas.
Input The file:
Pandas support many types of the file type to load the data. A user can load CSV file, excel file, JSON file, SQL database, website URL which contains tabular data. We will see some of them below:
To load the CSV following is the code:
In the first line, I import pandas as pd to short the module name. Then we create a data frame in which I load the titanic data and then show that data frame. The output that will be as follows:
It shows all the data that is in the CSV file in a proper table. It is also one of the reason of pandas popularity that its show the data in an appropriate formatted manner to understand more.
Load HTML data:
Suppose we have a website like W3School and we want to extract the tabular data out of it. The website looks like this:
Following code is executed to extract the data from these two tables:
However, if we execute the above code following output is generated:
It returns the list which contains all the tables that are present on the web page. So we have to slice the list and get an only 1st table so our code will look like this:
Moreover, the output can be seen as follows:
Similarly, the second table output looks like this:
Similarly, we can import other files like JSON SQL, etc. in pandas. Even a text file in which data is separated by some value can be imported in pandas.
DF Head and Tail Function:
The head function provides the number of rows from starting. By default, it gives only five rows, but we can give the parameter as much as we want. Similarly, the tail function provides the row from downward. Following is the code of implementation:
The output as follows:
The index starts from 0 to 9
Now we will see the tail function:
The output as follows:
Now we can see the index is from 881 to the last index. This how we can see that if data is loaded or not.
For indexing in pandas, there is a function which is iloc. It takes two parameters of rows and column. The rows and column parameter can further be slices by a colon.
Now we will see getting the data in different scenarios:
Getting 10 to 15 rows with all column:
In the above we can see iloc take two parameters 1st parameter is from 10 to 15 and second parameter after the coma is empty which means we want all of the columns the output that is generated is as follows:
Getting Name and Sex of the first ten rows:
To get only name and sex from the table of only the first ten rows, the slicing will be as flows:
The output that is generated as a result is as follows:
Getting The column:
We can also get the columns by the header value. For multiple values, we have to give the list of columns as a parameter. The code is as follows:
The output is as follows:
Also, for multiple columns following modification has to be made:
The output is as follows:
Rename the Column Name:
Sometimes it is necessary to change the column name. It is also part of data cleaning. Sometimes the header name is long that it is impossible to memorize the name and write the spelling right. So, column rename took the dictionary of the existing name as the key and new name as its value. This function is performed by the rename function. The existing titanic dataset has the column Sex, and I want to rename it as a gender. So, to do so following is the method:
Now if we call the df.columns following output will be generated:
As we can see that Sex is now replaced by gender.
Fetching the Data according to conditions:
In this heading, we will see how we can extract the desired information only. For that purpose, we have to apply conditional operators like equals to, higher than, less than, not equals to, etc.
Let’s take an example of the Titanic dataset. We have to extract all the rows where the Sex is male. For that, we have to apply condition as follows:
The output is as follows:
The output that is generated is given in the above figure. We can see all the data is that which contains the male as sex.
We can further extend the conditions by applying an operation in it. Now we will extract all the data in which the sex is male, and the person survived that incident.
Following is the code to implement such a scenario:
Now the output that is generated as follows:
Now the question arises why we have to do this? The answer to the question is if there is a problem to count how many males survived in a titanic incident so, after all that filtering, we then can calculate the number of rows and by which we can answer the question that this percentage of man was survived in a titanic incident.
In machine learning, the algorithm would not accept the element that is string character, so we have to convert it into a number before feeding into an algorithm. Suppose we have to convert the male into 1 and female as 0 we have to apply if condition and to do so following steps are involved:
The output that is generated as follows:
Applying user-defined function on Data Set:
The functionality that we have done by using the if condition we can also perform it by making a user-defined function and then apply to all of the elements of data set:
This will give the same result as generated by the program using the IF condition.
However, why we have to use this? What is the reason for doing this when we can also make a looping function? The reason is that these functions are much faster than in conventional ways. This difference cannot be observed when data is small, but it has a significant impact when the data is enormous, this function makes it easy and fast to execute.
Get Description of Data:
To get the description of complete data before doing anything will be good pandas make it very easy. It gives an overall description of the data on what type of data is present. Description can be obtained by describe function.
Describe function tells the overall count of the data, mean, standard deviation, etc.
Following is the way to get the description of data:
This data is not so important when data is categorical but when data is continuous as in case of some company opening and closing stocks data of 5 years than this data will be handy to understand the data much more straightforward.
We check the elements if there is any null element in pandas. We can count null elements and then decide whether we have to remove that null element, or we have to replace the value with some appropriate value. Following is the code to check that if there is any null value in data or not:
The output that is generated is as follows:
So here we can see there as 177 null age vale, and cabin data have 687 rows null. We can drop the null values and replace the value with some other value.
First, we will see how to drop the row:
Now the output will be:
Now we can see the length of rows before dropping null value was 891 and after removing it the length of data frames become 183, which is bad we remove a major portion of data in such scenario we can replace the value:
we drop the column of the cabin because there are too many null values and then there only left age column to fill NA value in the data frame to zero after this the output will be as follows:
Now there are no null values. Notice we have mentioned axis parameter when dropping the column and dropping NA value the axis is referred to the rows and column. Axis 0 is referred to row, and axis one is referred to as a column. So, it is to tell the axis which a user wants to drop.
In the above example, we fill the NA values of all data column. To fill the NA value of the specific data column following method is to be applied: