The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
The Guiding Principles and New Rules document is now in effect.

Powershell/Scripting question - how can I make this faster?

BlazeFireBlazeFire Registered User regular
edited February 2019 in Help / Advice Forum
I have a list (A) of ~10 000 timestamps (format: yyyy-MM-dd-HH-mm-ss). I have another list (B) of ~60 000 time stamps (same format). Both lists are in chronological order.

I want to make a third list (C) made up of timestamps from A that are within 30 seconds of any timestamp in B. The way I'm doing this right now (psuedocode) is:
foreach [i]time1[/i] in A {
     foreach [i]time2[/i] in B {
          [i]diff[/i] = [i]time1[/i] - [i]time2[/i]
          if (abs([i]diff[/i]) < 30) {
                 record [i]time1[/i]
                 break
          }
          elseif  ([i]diff[/i] < -120) {
                 break             
          }
     }
}

The elseif catches situations where the current value from list A is already earlier in time than B (ie I already know it won't show up in B).

Any suggestions on how could I make this faster?


Fakeedit: As I was typing this up, it occurred to me that I may not need to go through the entirety of list B. If I kept a pointer to where I found the last value, I think I can start there again on the next iteration of the outer loop. Hmm.

Real edit:
Some sample date. The marked item from List A would "match" any of the marked items in List B but it would only take the first one to break out of the loop.

List A:
2019-02-07-11-49-23
2019-02-07-11-49-54
2019-02-07-11-50-25
2019-02-07-11-53-31
2019-02-07-11-54-02
2019-02-07-11-56-06
2019-02-11-15-26-28 ********


List B:
2019-02-07-23-07-55
2019-02-07-23-07-56
2019-02-07-23-18-35
2019-02-07-23-18-36
2019-02-07-23-18-37
2019-02-07-23-18-38
2019-02-07-23-18-39
2019-02-08-00-46-46
2019-02-08-00-46-47
2019-02-09-03-24-37
2019-02-11-15-26-33 ********
2019-02-11-15-26-34 ********
2019-02-11-15-26-35 ********
2019-02-11-15-26-41 ********
2019-02-11-15-26-42 ********
2019-02-11-15-26-43 ********

BlazeFire on

Posts

  • SmasherSmasher Starting to get dizzy Registered User regular
    edited February 2019
    The algorithm you're using is O(n^2); since the lists are already sorted you can get it down to O(n). If you're not familiar with Big-O notation, that means you can go from doing the inner loop ~600 million times to ~70 thousand. I don't know Powershell syntax, but in pseudocode what you want is something like this:
    a_index = 0
    b_index = 0
    while(a_index < a_length and b_index < b_length) {
    	if(abs(A[a_index] - B[b_index]) < 30) {
    		record(A[a_index])
    		a_index++
    	}
    	else if(A[a_index] > B[b_index]) {  
    		b_index++	
    	}
    	else {
    		a_index++
    	}
    }	
    

    Smasher on
  • ArbitraryDescriptorArbitraryDescriptor Registered User regular
    What Smasher said: You don't need to keep going back to the beginning of the second list.

    Side note: I also don't powershell, so you broke me for a minute there with this apparent double sided array accessor
    [i]diff[/i]
    

    ...Until I typed it in here and realized that was just an escaped italics tag.

  • ArbitraryDescriptorArbitraryDescriptor Registered User regular
    edited February 2019
    Edit: Depending on the actual spread of your data, you might be able to squeeze a few more ms out of it if you skip every n A records on the main loop, then walk backwards if you jumped past the next B record.

    Like
    n=100
    while(stuff):
    
      if aList[a] > bList[b]:
        if aList[a-int(n/2)] < bList[b]: 
          /*subloop for half skipped records*/
        elif aList[a-n+1] < bList[b]: 
          /*subloop for all skipped records*/
        else:
          a=a+n //instead of a++  
    

    But if n is too large you'll wind up in the elif 99% of the time, and there's probably a lower limit on what percentage of records need to be skipped to offset the additional overhead of the subloop.

    Edit: That... Wasn't an edit. This is!

    If you really want to over-engineer the damned thing: You could try to make n self-balanced.
    You could track the frequency of hits to the elif block, and if its greater than some desirable threshold, say, 50%, adjust n down. Do the same in the else block to gradually increase n so it keeps making larger jumps until it's going too far.

    ArbitraryDescriptor on
  • BlazeFireBlazeFire Registered User regular
    Perfect, thank you both for your help. Time went from an hour to 30 seconds. Big-O notation was a long time ago for me but I remember enough to recognize I was doing this in a silly way and to make sense of Smasher's answer. Glad to see that my first thought while typing up the original post was the right way to do it.

  • BlazeFireBlazeFire Registered User regular
    Edit: Depending on the actual spread of your data, you might be able to squeeze a few more ms out of it if you skip every n A records on the main loop, then walk backwards if you jumped past the next B record.

    Like
    n=100
    while(stuff):
    
      if aList[a] > bList[b]:
        if aList[a-int(n/2)] < bList[b]: 
          /*subloop for half skipped records*/
        elif aList[a-n+1] < bList[b]: 
          /*subloop for all skipped records*/
        else:
          a=a+n //instead of a++  
    

    But if n is too large you'll wind up in the elif 99% of the time, and there's probably a lower limit on what percentage of records need to be skipped to offset the additional overhead of the subloop.

    Edit: That... Wasn't an edit. This is!

    If you really want to over-engineer the damned thing: You could try to make n self-balanced.
    You could track the frequency of hits to the elif block, and if its greater than some desirable threshold, say, 50%, adjust n down. Do the same in the else block to gradually increase n so it keeps making larger jumps until it's going too far.

    This is an interesting idea, but something I won't end up implementing. This is something I'll be manually running once every 1-2 weeks so the current speed is adequate.

  • ArbitraryDescriptorArbitraryDescriptor Registered User regular
    edited February 2019
    But it sounds like it could save you dozens of seconds a year!

    :rotate:

    ArbitraryDescriptor on
  • BlazeFireBlazeFire Registered User regular
    Don't tempt me!

    automation.png

Sign In or Register to comment.