Functionalities in BI Leaders, chapter 1: REGEX

  • Articles
  • Automation & BI

¿What is REGEX?

Regex stands for Regular Expressions, and in practice, it is a sequence of characters and symbols that become search patterns for other text. It is a technique developed in theoretical computer science and formal language theory. Generally, these patterns are used as functionalities by algorithms within programming languages, operating systems, ETL tools, Business Intelligence, Databases, Office Automation and many more.

¿When do we need REGEX?

Particularly in Business Intelligence, it is often necessary to perform ETL (Extract Transform and Load) operations, to clean, format and validate the data provided to us. This is when we have the possibility of using Regex to solve problems in a fast, efficient and even elegant way.

Examples of use:

  • Email format validation
  • RUT format validation
  • Validate phone number formatting
  • Unify formatting in a date column
  • Searching and replacing content in a specific position in a text
  • Selecting a subset of files in a folder
  • Extracting content from HTML, JSON, XML and other formats

Illustration 1: Example of Regular Expression to validate a date format of the type dd-MM-yyyy

Practical case – Problem

Many times data does not come ready and comfortable to use with a BI visualization tool, a recurrent case is when the output data of a survey software delivers coded answers and questions. Within this context, Regex becomes very useful to search for a particular value (coded answer) within another text dictionary of questions with their possible output values (the alternatives).

Illustration 2: Example of coded survey responses

Practical case – Solution

The solution is to find a Regex pattern that allows to identify and extract a value searched by a key, within a dictionary in JSON format (it will work for any question in the survey!).

In this case, a pattern that works for us is 2:”([^”]+)”, where the number 2 in the example, will subsequently be the variable of the coded answers of each question, the characters in black are literal and that format is searched in the dictionary. The parentheses () in Regex indicate a group, which is a portion of the text that matches the pattern. Finally, the text [^”]+ tells us that it accepts any character 1 or more times that is different from quotes (“).

Illustration 3: Regular Expression Testing

There are sites like https://regex101.com/ that help you write a Regex, understand its syntax and test it online!

The expected result would be something like this:

Power BI Implementation

Power BI does not have a native implementation of Regex, so you need to insert Python or R scripts at the Power Query level, in addition to installing the environment of the chosen language previously.

  1. Go to Power Query Editor, then in the Transform tab, click on Run Python Script.
  2. Apply the Python RE library function, re.search(pattern,string).group(g), which searches for the pattern in the string and then returns the group g section of the Regex.

Illustration 4: Python script in Power Query

Tableau Implementation

  1. Generate a calculated field which we will call it, final_response
  2. Apply the function REGEX_EXTRACT_NTH(string,pattern,index), which returns the group n° (index) of the string that matches the pattern.

Illustration 5: Field calculated in Tableau

Final thoughts

Using the leading tools in the Business Intelligence market, such as Power BI and Tableau, we have been able to solve a recurring problem with the help of Regex, in addition to knowing its power and the various uses in which it can help when we face ETL problems, at the beginning of a visualization project.