Powershell/Scripting question - how can I make this faster?

BlazeFire · February 2019

I have a list (A) of ~10 000 timestamps (format: yyyy-MM-dd-HH-mm-ss). I have another list (B) of ~60 000 time stamps (same format). Both lists are in chronological order.

I want to make a third list (C) made up of timestamps from A that are within 30 seconds of any timestamp in B. The way I'm doing this right now (psuedocode) is:

foreach [i]time1[/i] in A {
     foreach [i]time2[/i] in B {
          [i]diff[/i] = [i]time1[/i] - [i]time2[/i]
          if (abs([i]diff[/i]) < 30) {
                 record [i]time1[/i]
                 break
          }
          elseif  ([i]diff[/i] < -120) {
                 break             
          }
     }
}

The elseif catches situations where the current value from list A is already earlier in time than B (ie I already know it won't show up in

.

Any suggestions on how could I make this faster?

Fakeedit: As I was typing this up, it occurred to me that I may not need to go through the entirety of list B. If I kept a pointer to where I found the last value, I think I can start there again on the next iteration of the outer loop. Hmm.

Real edit:
Some sample date. The marked item from List A would "match" any of the marked items in List B but it would only take the first one to break out of the loop.

List A:

2019-02-07-11-49-23
2019-02-07-11-49-54
2019-02-07-11-50-25
2019-02-07-11-53-31
2019-02-07-11-54-02
2019-02-07-11-56-06
2019-02-11-15-26-28 ********

List B:

2019-02-07-23-07-55
2019-02-07-23-07-56
2019-02-07-23-18-35
2019-02-07-23-18-36
2019-02-07-23-18-37
2019-02-07-23-18-38
2019-02-07-23-18-39
2019-02-08-00-46-46
2019-02-08-00-46-47
2019-02-09-03-24-37
2019-02-11-15-26-33 ********
2019-02-11-15-26-34 ********
2019-02-11-15-26-35 ********
2019-02-11-15-26-41 ********
2019-02-11-15-26-42 ********
2019-02-11-15-26-43 ********

Smasher · February 2019

The algorithm you're using is O(n^2); since the lists are already sorted you can get it down to O(n). If you're not familiar with Big-O notation, that means you can go from doing the inner loop ~600 million times to ~70 thousand. I don't know Powershell syntax, but in pseudocode what you want is something like this:

a_index = 0
b_index = 0
while(a_index < a_length and b_index < b_length) {
	if(abs(A[a_index] - B[b_index]) < 30) {
		record(A[a_index])
		a_index++
	}
	else if(A[a_index] > B[b_index]) {  
		b_index++	
	}
	else {
		a_index++
	}
}

ArbitraryDescriptor · February 2019

What Smasher said: You don't need to keep going back to the beginning of the second list.

Side note: I also don't powershell, so you broke me for a minute there with this apparent double sided array accessor

[i]diff[/i]

...Until I typed it in here and realized that was just an escaped italics tag.

ArbitraryDescriptor · February 2019

Edit: Depending on the actual spread of your data, you might be able to squeeze a few more ms out of it if you skip every n A records on the main loop, then walk backwards if you jumped past the next B record.

Like

n=100
while(stuff):

  if aList[a] > bList[b]:
    if aList[a-int(n/2)] < bList[b]: 
      /*subloop for half skipped records*/
    elif aList[a-n+1] < bList[b]: 
      /*subloop for all skipped records*/
    else:
      a=a+n //instead of a++

But if n is too large you'll wind up in the elif 99% of the time, and there's probably a lower limit on what percentage of records need to be skipped to offset the additional overhead of the subloop.

Edit: That... Wasn't an edit. This is!

If you really want to over-engineer the damned thing: You could try to make n self-balanced.
You could track the frequency of hits to the elif block, and if its greater than some desirable threshold, say, 50%, adjust n down. Do the same in the else block to gradually increase n so it keeps making larger jumps until it's going too far.

BlazeFire · February 2019

Perfect, thank you both for your help. Time went from an hour to 30 seconds. Big-O notation was a long time ago for me but I remember enough to recognize I was doing this in a silly way and to make sense of Smasher's answer. Glad to see that my first thought while typing up the original post was the right way to do it.

BlazeFire · February 2019

ArbitraryDescriptor wrote: »
Edit: Depending on the actual spread of your data, you might be able to squeeze a few more ms out of it if you skip every n A records on the main loop, then walk backwards if you jumped past the next B record.

Like
n=100
while(stuff):

  if aList[a] > bList[b]:
    if aList[a-int(n/2)] < bList[b]: 
      /*subloop for half skipped records*/
    elif aList[a-n+1] < bList[b]: 
      /*subloop for all skipped records*/
    else:
      a=a+n //instead of a++  
But if n is too large you'll wind up in the elif 99% of the time, and there's probably a lower limit on what percentage of records need to be skipped to offset the additional overhead of the subloop.

Edit: That... Wasn't an edit. This is!

If you really want to over-engineer the damned thing: You could try to make n self-balanced.
You could track the frequency of hits to the elif block, and if its greater than some desirable threshold, say, 50%, adjust n down. Do the same in the else block to gradually increase n so it keeps making larger jumps until it's going too far.

This is an interesting idea, but something I won't end up implementing. This is something I'll be manually running once every 1-2 weeks so the current speed is adequate.

ArbitraryDescriptor · February 2019

But it sounds like it could save you dozens of seconds a year!

:rotate:

BlazeFire · February 2019

Don't tempt me!

Penny Arcade

Quick Links

Powershell/Scripting question - how can I make this faster?

Posts