Experience has taught me that regular expressions are the Swiss Army knife of the developer’s toolbox, and there's almost always a better regular expression for the job at hand. Developing a good regular expression tends to be iterative, and the quality and reliability increase the more you feed it new, interesting data that includes edge cases.
A regular expression that works is often good enough. If your data is highly predictable, then optimizing a regex may be an unnecessary endeavor. However, once you start using a regex as part of a wider system, at scale, or across unreliable data sets, the more you should ensure it is reliable, resilient, and performant.
Regex can seem complicated at first, but the system is logical and predictable once you can understand it. However, reverse-engineering a complex regular expression isn’t much fun.
In this blog post, you'll learn how to put together a regex for an important use case: extracting name-value pairs from a log line, which is often an important part of managing your logs. Logs are a good example of when you need to have strong regular expressions because typically, logs are part of a wider system (ideally, you have logs for your entire stack), need to scale with your application, and are often inconsistent. So let’s take a look at some regexes—on the way, you’ll hopefully learn to strengthen other regexes you work with.
Parsing log lines with regex
This use case is based on a real-world requirement that was originally used to assist a customer with parsing their logs in New Relic. New Relic has a powerful data parsing mechanism that lets you ingest raw log data and parse it into individual semantically meaningful columns.
Here are the requirements for the real-world use case:
- The log data contains multiple name-value pairs as well as other data.
- The pairs appear in the format:
(attr=value)
. - The values can contain white space.
- Not all name-value pairs need to be collected.
- Some pairs might be present in all log lines, but some might not.
- The pairs may appear in any order.
Here's an example log line:
my favourite pizza=ham and pineapple drink=lime and lemonade venue=london name=james buchanan
For this example data, let’s say you want to extract the pizza
, drink
, and name
fields from the data. However, you don’t want to extract the venue
data or any other data in the log line. To make things more complicated, what if you want to collect this data from many log lines, and the data isn’t always presented consistently? What regular expression will capture those values for you?
TL;DR, here's the regex
Maybe you arrived here via Google and just want to copy and paste the rule to see if it works for you. Here it is—a regular expression for extracting name-value pairs, separated by the =
sign:
(?:^|\s+)(?=.*?attrname=(?<attrname>[^=]+?(?=(?:\s+\b\w+\b=|\s*?$))))?
And here’s the Grok version:
(?:^|\s+)(?=%{DATA}attrname=(?<attrname>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?
For these rules:
- Not all of the key-value pairs have to be present. The rule still functions on key-value pairs that are present but won't break if some of the key-value pairs aren’t present in a line.
- The order of the key-value pairs does not matter.
- White space is allowed within the value.
To learn more about how the rule works, read on.
Parsing with Grok
This discussion will focus on the Grok version of the rule because it's a little cleaner. Also, parsing rules in New Relic are written in Grok, which allows you to use existing named Grok patterns. Because Grok is based on regular expressions, any valid regular expression is also a valid Grok expression. If you’re not using Grok, just use the standard regular expression version provided in the previous section.
Starting with a fragile parsing rule
Let’s start with some data to test the regex. I love both beer and pizza, and even have my own wood-fired oven, so here’s a pizza-themed data set:
1: my favourite pizza=ham and pineapple drink=lime and lemonade name=james buchanan
2: my favourite drink=lime and lemonade name=james buchanan pizza=ham and pineapple
3: my favourite name=james buchanan pizza=ham and pineapple drink=lime and lemonade
4: my favourite pizza=ham and pineapple drink=lime and lemonade
5: my favourite name=james buchanan pizza=ham and pineapple foo=bar drink=lime and lemonade
6: my favourite drink=lime and lemonade
You’ll see that this data set has the key-value pairs in different orders, various amounts of whitespace, and even different numbers of key-value pairs.
In this example data, key-value pairs on each line are delimited with equal =
signs such as drink=coke
. Let's say you want to extract three values: pizza
, drink
, and name
.
If the data always appears like line one, you could write a Grok parsing rule like this that extracts each of the values:
pizza=(?<pizza>%{DATA})drink=(?<drink>%{DATA})name=(?<name>%{GREEDYDATA})
This works, but the rule is fragile. It requires the values to always be in the same order. If any values are missing or there is any additional data, the entire rule fails. This is bad. You don’t want data to go missing because it doesn’t quite match. And even if you're pretty sure your data is consistent, can you ever be 100% sure?
If you want to try this out yourself with the built-in logs parsing test tool in New Relic, go to Logs > Parsing > Create parsing rule. You can paste in an example log line along with the rule to see the output. Alternatively, you can try the Grok rule out using this Grok tool.
Using a lookahead rule
So how can you make this parsing rule more robust? Using a lookahead comes to the rescue here. In order to target a single key-value pair, you need to know two things: when to start the match and when to end it. Let's work through this step-by-step.
Find the value pair
Take this pizza value pair as an example. It always starts like this: pizza=
. Since the pattern is consistent, you can look ahead and capture the text like this:
(?=%{DATA}pizza=(?<pizza>.*))
This will return the following:
pizza: ham and pineapple drink=lime and lemonade name=james buchanan
DATA
is equivalent to the expression .*?
. You can find a useful list of Grok patterns here. This lookahead rule finds anything after the string pizza=
and captures it into a field called pizza
. While this works, the drink and name values are captured, too. So the rule needs to be restricted to capture characters and whitespace up to the next name-value pair only.
Capture just the attribute you need
To capture just the pizza
value, you can use another lookahead. The following rule captures any character that is not an equal sign. This should be non-greedy, meaning ?
is appended to the pattern [^=]+
. This is followed by whitespace character(s), a word, and then another equal sign. Here’s the rule:
(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=))))
This returns the following for #1: pizza:ham and pineapple
✅
However, it returns the following against #2: no match! ❌
Much better...but wait! Line two failed to match the pizza. Can you see why?
The pattern matches data followed by another name-value pair, but in this case, the rule has searched the entire line and there are no additional name-value pairs. The capture needs to extend to either be followed by another name-value pair or the end of the line, which is signified by $
. It’s also important to consider trailing white space, which you can discard with the non-greedy %{SPACE}?
.
Here’s the updated pattern:
(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))
Returns against #1: pizza:ham and pineapple
✅
Returns against #2: pizza:ham and pineapple
✅
This is much better and more reliable. If you just want to capture one field, you’re finished. However, with logs, you’ll often need to capture multiple fields.
Capture multiple fields in logs
You can chain multiple expressions together to capture other values by repeating the same expression and changing the value names as needed:
(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))(?=%{DATA}drink=(?<drink>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))(?=%{DATA}name=(?<name>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))
This returns the following:
Line #1: pizza:ham and pineapple, name:james buchanan and drink:lime and lemonade
✅
Line #2: same as #1 ✅
Line #3: same as #1 ✅
Line #4: no match! ❌
This works for lines one through three of the sample data. The rule now returns matches regardless of the order of key-value pairs. Unfortunately, it fails for line four of the input:
4: my favourite pizza=ham and pineapple drink=lime and lemonade
You may have noticed that line four is missing the name
key. The regex rule requires name
to be present or the whole pattern fails. This is a common failure that often goes unnoticed when using regexes with data sets. As you can imagine, these kinds of problems can be very tricky to deal with because it looks like the rule is working correctly, but it isn't gathering critical information. You can fix this by making each pattern optional. To do so, add ?
to the end of each expression.
This is the generalized pattern for each key-value pair:
(?=%{DATA}attrname=(?<attrname>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?
Let’s try this regex against the data. The following expression includes the pattern three times, one for each attribute that needs to be captured (name
, pizza
, and drink
):
(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?(?=%{DATA}drink=(?<drink>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?(?=%{DATA}name=(?<name>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?
This returns:
Line #1: pizza:ham and pineapple, name:james buchanan and drink:lime and lemonade
✅
Line #2: same as #1 ✅
Line #3: same as #1 ✅
Line #4: pizza:ham and pineapple, drink:lime and lemonade
✅
Line #5: same as #1 ✅
Line #6: drink:lime and lemonade
✅
The rule correctly matches all test input data in any order and continues to work for missing fields.
Regex lookaheads performance
Lookaheads do have additional performance overhead, so if your data is reliably consistent, you may be able to use a simpler, more performant rule that doesn’t have lookaheads. You can also make this rule much more performant by adding the prefix (?:^|\s+)
at the beginning of your rule:
(?:^|\s+)(?=%{DATA}pizza=(?<pizza>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?(?=%{DATA}drink=(?<drink>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?(?=%{DATA}name=(?<name>[^=]+?(?=(?:\s+%{WORD}=|%{SPACE}?$))))?
This small change ensures that lookaheads happen only at the beginning of a line or when there is a space, not with every character. This stops the rule from using lookaheads where they aren’t needed.
Conclusion
Hopefully, you have a better understanding of how this rule works and have a good sense of how you can iteratively improve a rule to make it more reliable. There is always a better regular expression out there if you put enough thought into it. Good luck finding one that’s even more effective for your use case!
Next steps
Interested in learning more about managing your logs in New Relic? Check out the logs documentation.
Just getting started with logs? Learn more about log management.
You can start accessing your logs in just a few minutes with a free New Relic account. Your account includes 100 GB/month of free data ingest, one free full-access user, and unlimited free basic users.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.