The Convergence Blog

The Convergence is sponsored by Data-Mania
… it’s just another way we’re giving back to the data community from which we sprung.

The Convergence - An online community space that's dedicated to empowering operators in the data industry by providing news and education about evergreen strategies, late-breaking data & AI developments, and free or low-cost upskilling resources that you need to thrive as a leader in the data & AI space.

Creating Datasets: A Reproducible 9-Step Process & Coding Demo

Data-Mania Writer's Guild

Data-Mania Writer's Guild

Reading Time: 5 minutes

Creating datasets is a foundational step in online research. This process is essential for uncovering hidden patterns and trends, quantifying results, and supporting informed decision-making.


Well-documented datasets enhance research reproducibility and foster collaboration among researchers and organizations. Moreover, datasets adapt seamlessly to advanced technologies like machine learning.


In essence, creating datasets is the key to extracting valuable, quantifiable insights from the extensive field of online information, contributing to the credibility and advancement of research efforts. In this article,  we’ll help you to master the art of crafting custom datasets efficiently. We’ll start first with a strategy for creating dataset, and then we’ll follow that with a simple Python coding demo that shows you how to do it!


a tutorial on creating datasets

8 Strategic Steps For Planning and Creating Datasets for Online Research

If you want to create custom datasets for online research, you should start with the following 8 steps:

Step 1. Define Your Research Objectives

Clearly outline your research objectives before diving into creating datasets. Identify the specific insights you aim to gain, setting the foundation for a targeted approach.


By articulating the research goals, you not only set the direction for data collection but also ensure relevance and purpose. This clarity guides the selection of data points, sources, and methodologies, streamlining the entire research process.


Step 2. Identify Necessary Data Points

Pinpoint the essential data points needed to achieve your research goals. Categorize data types (numerical, categorical, or textual) to streamline the collection process.


By categorizing, you streamline the collection process, ensuring that each data point serves a specific purpose in addressing your research objectives. This facilitates efficient data gathering and contributes to the overall structure of the dataset.


Step 3. Leverage Diverse Data Sources

To ensure a comprehensive dataset, utilize diverse sources. Combine manual collection, web scraping, and existing datasets, fostering a holistic perspective.

Web scraping  Techniques

Web scraping techniques involve responsibly extracting relevant information from websites using tools like BeautifulSoup or Scrapy.


BeautifulSoup and Scrapy are Python libraries facilitating efficient web scraping, ensuring compliance with website terms of use. Ethical extraction involves respecting website policies, avoiding excessive requests, and prioritizing user privacy.

For example, in gathering customer opinions from product reviews, web scraping enables the extraction of fine-grained insights, contributing diverse perspectives to the dataset. It’s essential to balance the power of web scraping with ethical practices, ensuring accurate, legal, and respectful acquisition of data for comprehensive analysis.


Manual Data Collection

Implement surveys, interviews, or observations for data not readily available online. Develop structured questionnaires to gather accurate and meaningful insights.


Step 4. Data Cleaning and Validation

Maintain data quality through rigorous cleaning and validation processes. This involves identifying and rectifying errors, missing values, and outliers that can compromise the accuracy of the dataset. The use of tools like Pandas in Python streamlines this process, providing functionalities to identify inconsistencies and handle data anomalies effectively. 

Cleaning ensures uniformity and reliability, preparing the dataset for accurate analysis. On the other hand, validation confirms that the data meets specific criteria, enhancing the overall integrity of the dataset.


Step 5. Ensure Data Privacy and Compliance

Adhere to data privacy regulations and ethical standards. Anonymize sensitive information and comply with legal requirements, such as GDPR, when dealing with personal or proprietary data.


Adhering to these regulations protects your privacy rights and fosters ethical data practices. Anonymization techniques, like encryption or aggregation, safeguard identities while allowing meaningful analysis. Compliance with legal requirements mitigates risks, ensuring organizations operate within the law.


Step 6. Optimal Dataset Size

Consider the size of your dataset based on your research objectives. Strike a balance between comprehensiveness and manageability. For instance, you can cover an extended timeframe when studying climate change impact.


Step 7. Adopt an Iterative Approach

View creating datasets as an iterative process. Refine your dataset as research progresses, addressing feedback and enhancing relevancy. Update information regularly for real-time insights.

Adopting an iterative approach in dataset creation involves continual refinement, addressing feedback, and enhancing relevancy. Embrace the dynamic nature of the process by actively seeking and incorporating feedback, addressing limitations, and aligning the dataset with the increasing research objectives.

Regular updates ensure real-time insights, while technology integration streamlines the iterative cycle. Transparent documentation facilitates collaboration and builds trust, balancing complexity for depth while maintaining usability. This continuous learning process not only refines the dataset but also fosters adaptability, making it vital to effective and evolving research practices.


Step 8. Document Your Process

Thoroughly document the dataset creation process, including sources, cleaning procedures, and any transformations applied. By detailing each step, you provide a roadmap for reproducibility, enabling the replication of the study by peers or future researchers. This transparency also aids in troubleshooting potential issues and ensures the credibility of the dataset.

Creating Datasets Coding Demo: How to Create a Dataset of Airbnb Reviews with Python and BeautifulSoup

Now, let’s practice your skills in creating datasets with a real-life example. This is to empower your data analysis skills by creating a custom dataset of Airbnb reviews using Python and BeautifulSoup. This guide offers a concise, step-by-step approach to gathering and organizing Airbnb reviews for insightful analysis.

Step 1: Install Required Libraries

Ensure Python is installed and install the necessary libraries.

pip install requests beautifulsoup4 pandas


Step 2: Import Libraries

In your Python script, import the required libraries.

import requests

from bs4 import BeautifulSoup

import pandas as PD


Step 3: Choose an Airbnb Listing

Select an Airbnb listing and copy its URL for review extraction.


Step 4: Send HTTP Request

Fetch the HTML content of the Airbnb listing using requests.

url = ‘paste-your-Airbnb-listing-URL-here’

response = requests.get(url)

html = response.text


Step 5: Parse HTML with BeautifulSoup

Parse the HTML content for easy navigation.

soup = BeautifulSoup(html, ‘HTML.parser’)


Step 6: Locate Review Elements

Identify HTML elements containing reviews by inspecting the page source. Typically, reviews are within <div> tags with specific classes.


Step 7: Extract Review Details

Loop through review elements, extracting pertinent information like reviewer name, rating, date, and text.

reviews = []

for review in soup.find_all(‘div’, class_=’your-review-class’):

    reviewer = review.find(‘span’, class_=’reviewer-class’).get_text(strip=True)

    rating = review.find(‘span’, class_=’rating-class’).get_text(strip=True)

    date = review.find(‘span’, class_=’date-class’).get_text(strip=True)

    text = review.find(‘div’, class_=’text-class’).get_text(strip=True)  

    reviews.append({‘Reviewer’: reviewer, ‘Rating’: rating, ‘Date’: date, ‘Text’: text})


Step 8: Create a DataFrame with Pandas

Transform extracted data into a Pandas DataFrame for easy manipulation.

df = pd.DataFrame(reviews)


Step 9: Save the Dataset

Save your dataset to a CSV file for future analysis.

df.to_csv(‘airbnb_reviews_dataset.csv’, index=False)



This marks the end of our creating datasets tutorial. You’ve now successfully created a dataset of Airbnb reviews using Python and BeautifulSoup. This structured dataset is now ready for in-depth analysis providing valuable insights into customer sentiments. Expand your knowledge by applying these steps to different Airbnb listings, uncovering patterns within the extensive world of Airbnb reviews.

Pro-tip: If you liked this post, be sure to check out our 3 Showstopping Data Analytics Use Cases To Uplevel Your Startup Profit-Margins.


Our newsletter is exclusively written for operators in the data & AI industry.

Hi, I'm Lillian Pierson, Data-Mania's founder. We welcome you to our little corner of the internet. Data-Mania offers fractional CMO and marketing consulting services to deep tech B2B businesses.

The Convergence community is sponsored by Data-Mania, as a tribute to the data community from which we sprung. You are welcome anytime.

Get more actionable advice by joining The Convergence Newsletter for free below.

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.
We are 100% committed to you having an AMAZING ✨ experience – that, of course, involves no spam.

Fractional CMO for deep tech B2B businesses. Specializing in go-to-market strategy, SaaS product growth, and consulting revenue growth. American expat serving clients worldwide since 2012.

© Data-Mania, 2012 - 2024+, All Rights Reserved - Terms & Conditions - Privacy Policy | PRODUCTS PROTECTED BY COPYSCAPE

The Convergence is sponsored by Data-Mania, as a tribute to the data community from which we sprung.

Get The Newsletter

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.
* Zero spam. Unsubscribe anytime.