When working with data sets, there may be instances where the same text string appears multiple times. This can happen for a variety of reasons, such as data entry errors or simply because the same information is needed in multiple places. While duplicate text strings may not always be a problem, they can sometimes make data analysis more difficult.
Baserow is an open-source no-code database tool that empowers you to create database-driven applications without having significant technical expertise.
In this tutorial, we’ll learn how to find duplicate text strings using a database filled with names as an example. We’ll assume that the database has a single text field called Name that contains the full name of each person. We’ll cover the following steps:
This can be applied to various scenarios where you need to find—and possibly remove—duplicates. For instance, you can use it to identify duplicate customer records in a customer relationship management (CRM) system, duplicate product entries in an e-commerce database, or duplicate employee records in a human resource management (HRM) system.
To complete this tutorial, you’ll need the following:
We’ll explore two main methods for identifying duplicates in no-code databases:
Let’s dive in!
Log in to your existing Baserow account or create a new account if you don’t have one already.
Once you log in, you can access the dashboard to add workspaces and databases. You can work with an existing workspace, database, or table, or set up a new workspace from scratch.
In this tutorial, we will set up a new workspace and create a database from scratch.
To create a new workspace, click on the + Create workspace button. Then, click on Add workspace.
If you’re just starting with Baserow, we recommend you read our support article on how to create a workspace and how to add collaborators to a workspace.
Now that the workspace is set up, you can either create a database from scratch or choose a template from our template library. Let’s add a database to the workspace.
Click on the + Create new button to start creating a new database.
Select the “Database” option to create a database from scratch.
Next, add a new table or click on a table within the existing database to open it. A default table will be added to the newly created database.
That’s it! We’ve created a new database. Now we can populate, manage, and organize data in Baserow.
Let’s say we have a dataset of people’s names with the following data:
Name |
---|
Alice Baker |
Bob Jones |
Charlie Brown |
Alice Baker |
Bob Jones |
We want to identify and get rid of duplicate rows, specifically those that have the same Name entries.
There are a few different ways to identify duplicate text strings.
If you only have a small number of rows, you can manually sort your data and remove the duplicates. Baserow’s interface is intuitive, allowing you to design databases, set up relations, and manage data effortlessly.
We can calculate the footer aggregate value for all the rows present in a grid view. In this view, click on the footer bar located beneath the Name field at the bottom of the table and choose Unique as the aggregation function.
This resulting value where the Name field is not unique is then displayed in the footer cells. However, referencing these aggregated values in the table is not possible.
Note: The Unique footer aggregation treats each value as a strict string comparison. This means that any differences—like trailing or extra spaces—will be considered different entries. For example,
Alice_Baker
is different fromAlice__Baker
orAlice Baker_
.
That’s where manual sorting steps in.
To manually sort and remove duplicates:
While this is a simple and straightforward method, it can be time-consuming if you have a large dataset.
If you have a large number of rows, or if you want to find and remove duplicates regularly, you can use an automation tool. There are several automation tools available, such as Zapier, Make, and n8n.
Once you have removed the duplicates from your database, you can take steps to prevent duplicates from being added in the future. In this case, we will use Zapier to check for duplicates before a new row is added or updated.
If you are using Make, use Webhooks to track row activities or trigger the workflow when a row is created.
Zapier acts as a bridge between different apps, allowing you to create automated workflows called “Zaps.” By integrating Zapier with Baserow, you can streamline your data management and automate tasks across various platforms.
In this case, let’s build a workflow that runs when a row is created or updated in the table and checks if a record already exists:
Create a new boolean field in the table called IsDuplicated?
Create a Zap in Zapier.
Select Baserow as the trigger app and set the event as “Row Created or Updated” to trigger when a new row is created or an existing one is updated.
Set up the trigger, authenticate with Baserow database token, and enter the table ID.
If you run into any issues integration Baserow and Zapier, refer to our support article on how to sync your Baserow database with Zapier.
Add a filter to check the newly created or updated row against the data in the table, to only proceed when a similar record does not exist. Zapier will test these conditions against the data received.
Select Baserow as the action app and set the event as “Update Row in Baserow” to update the matching existing row in the table. The row ID to be updated will be the existing record.
If the record exists, it updates a row by adding “Yes” to the IsDuplicated? field.
Using these steps, we can determine whether the row already exists and update the data accordingly.
For this scenario, we will use Make to create new rows, without the duplicates.
Make is a workflow automation tool that enables you to automate tasks by connecting various applications, services, and systems through a visual interface.
Using Make and Baserow can streamline your workflow and enhance data management. We can use Make to automate workflows triggered by specific events in Baserow or integrate data seamlessly.
Let’s build a similar workflow:
We will work with two tables:
Make a copy of the Source Table to create the Destination Table. Then, delete all the data in the Destination Table, leaving behind an empty table with a similar structure to the Source Table.
Create a new scenario in Make. This will serve as the foundation for our automation process.
Add a Baserow action to get all rows in the Source Table using the operation “List Rows”.
If you run into any issues integration Baserow and Make, refer to our support article on how to sync your Baserow database with Make.
Configure the Iterator module to iterate over the rows in the Source Table by passing the array from the Baserow module.
Add another Baserow action to make an API call to check if the record already exists in the Destination Table.
Select the destination table and map the fields from the Source Table to the corresponding fields in the Destination Table.
Add the Router module to create a new row to add the record to the Destination Table if the record exists, otherwise, go to the next record.
Filtering out duplicate rows from a dataset can be useful for a variety of tasks, such as cleaning up data, identifying potential errors, and ensuring data integrity.
Whether with automation tools or by manually sorting and removing duplicates, you can ensure that your data remains clean and accurate. Once your data is clean, you can adjust the automation to prevent duplicate records from being inserted again.
The following articles may also be helpful:
If you have any questions regarding this tutorial, feel free to reach out to ask for help in the Baserow community.