Regular expression help

Dance Commander · January 2010

I'm trying to make a regular expression that will identify subsequent lines in a data file that have the same date. The columns are tab delimited. Here's an example of the data, followed by my regular expression:

1955	05/08/2004	129	0	75.1	68.6	71.8	90.5	227.	7.81	ENE	23.8	73.8	74.0	74.4	0.3	
1957	05/09/2004	130	0	79.9	68.4	74.5	87.6	484.	9.55	E	23.8	75.8	75.1	74.6	0

([0-9]{2}/[0-9]{2}/[0-9]{4}).*$\n[^\t]+\t\1

It matches the date, saves it via the capturing parentheses, matches everything else on the line, matches the newline marker, matches whatever the first column of the subsequent line is, and then looks for the backreferenced date: \1. This regexp works perfectly fine without the \1 in it, although obviously it just matches every line except the last. As soon as I put the \1 in there, though, Textpad, which uses the POSIX engine, complains that the regexp isn't valid.
Anyone have some idea why in the heck this would be?
I'm also open to any comments on the quality/brevity of the regexp here--I'm still learning.

Vrtra Theory · January 2010

Have you tried testing a very simple regex with a back reference (the "\1")? I ask because even though it is listed in the POSIX standard, you can be POSIX-compliant without it, and most newer extended RE implementations leave it out, AFAIK.

If a simple expression like "(\d)\1" works fine, then I'm not sure what the problem is, your regex looks OK to me.

localh77 · January 2010

Hmm, interesting. We should be able to figure this out. I not exactly a regexp expert, but what is the dollar sign in there for? When I take it out, it matches fine for me. And when I leave it in, it doesn't match, although I don't get an error.

Anyway, if you're still having problems, I would just re-work it to not use a backreference. Assuming that you're looping through a bunch of lines, something like this:

foreach $line(split(/\n/,$lines))
{
	if($line =~ /^[^\t]+\t([0-9]{2}\/[0-9]{2}\/[0-9]{4})/)
	{
		if($previous_date eq $1)
		{
			print "match: $1\n";
		}
		$previous_date = $1;
	}
}

ronya · January 2010

A wise sensei linked me this once.

Baron Dirigible · January 2010

Unless I'm missing something, your regex never actually matches the first column of the subsequent line. I don't have a copy of TextPad, but I tested the following regex using TextWrangler, and it worked fine:

^[\d]+\t([\d\/]+)\t.+\n.+?\t\1.+

The one problem I can see with this implementation is it will only match two successive lines — which I guess could be enough for what you need, but it's a slow Sunday at work, so I'm going to see how feasible it is to match an arbitrary number of lines.

[edit:

behold!

^([\d]+\t([\d\/]+)\t.+\n)(.+?\t\2.+\r)+

TheGreat2nd · January 2010

ronya wrote: »

A wise sensei linked me this once.

oh. my. god.

:^: :^:

Dance Commander · January 2010

I will have to take a look through this again when I'm at work on Tuesday. I think that V-Theory probably has it--I remember now trying to use other regular expressions with backreferences in the search string and having them fail similarly. It does support backreferences in the replacement string, oddly.
So assuming that is the problem--and I will try some of these other expressions as well--can someone recommend a free PC text editor with better, hopefully more standard regexp support?

Dance Commander · January 2010

So, simple backreferences work ok, but as soon as you go past a newline the whole thing shits the bed. Can someone recommend a text editor with a better RE engine? I do an awful lot of find-and-replace on data files that is greatly sped up by regular expressions, so something fully featured would be a tremendous help.

Penny Arcade

Quick Links

Regular expression help

Posts