Aritra Chakraborti
4 min readOct 25, 2020

--

Web Scraping ball by ball data from ESPN CricInfo

In this article, I am going to demonstrate how you can scrape data from ESPN Cricinfo website. Well, if you go through the article you should be able to scrape data from any static webpage of your choice.

It is pretty easy and simple and you can learn it in a day. All you have to do is the following:-

1. Plan what data from the website would you like to extract

2. Find out the html elements in the webpage which hold the data

3. Write few simple lines of code to extract the data in your system

Step1: Plan what data from the website you would like to extract

In my case, I would like to extract the commentary of a T20 IPL cricket match, only some selective information viz. runs, ball number, extras etc. I have the below picture in my mind, pretty much how I would want to extract data out of a T20 match commentary from ESPN Cricinfo.

Step1 picture

Step2: Find out the html elements in the webpage which hold the data

Let us open a match commentary: https://www.espncricinfo.com/series/8048/commentary/1216492/mumbai-indians-vs-chennai-super-kings-1st-match-indian-premier-league-2020-21

Each over in the webpage is presented as below:

So as per step 1 picture, following are the information, I would like to extract-

· Ball number of the over

· Result of the ball

· Bowler and Batsman names

Now, let us extract a ball number of an over. Follow the below steps -

Press F12 to see the HTML content of the webpage. You will see the html content opening in the left panel. Elements tab contains the details of all html elements embedded in the page.

Now hover over the ball number, then right click and select ‘Inspect’

You will see a blue highlighted row in the Elements window.

This is our element of focus: <span class = “match-comment-over”> , if you expand it you will see the information is held by this element. For every ball of an over in the webpage, this element stores the information. You can verify that by inspecting the same for any other ball.

Step3: Write few simple lines of code to extract the data in your system

Now let us extract every ball of all the overs from the webpage to our local system. We need to import package BeautifulSoup to extract the information from the html page. We also need to import package pandas to store the extracted information in a dataframe.

Here is what you can do to begin:-

1. Save the html webpage in your local

2. Open the path where you saved, and read it using BeautifulSoup

3. Iterate through the soup content to find all the occurrences of the element of our focus

4. Append the value of the element for every occurrence to a list

5. Store the extracted data in a pandas dataframe

Find the code below for the same. Now if you see, your list contains all the ball numbers:-

You can repeat the same process to scrape the result of the ball and the short commentary line as below:-

As you can see, once I have extracted the information I have stored it in a pandas dataframe by assigning the lists as columns to the dataframe.

Once you have scraped all the information you need, now you can organize this information as per the step1 picture by using common python functions and re package.

If you would like to see how you can organize the extracted information as per step1 picture or want to ask any other related thing, let me know in the comments section.

You can find the updated ball by ball details of all IPL 2020 matches here https://www.kaggle.com/aritrachakraborti/ipl-2020-ball-by-ball-data .

You can find the entire code in my GitHub repo https://github.com/aritra-explorer/My-Projects .

Thanks for going through my article.

--

--