As was foretold, we've added advertisements to the forums! If you have questions, or if you encounter any bugs, please visit this thread: https://forums.penny-arcade.com/discussion/240191/forum-advertisement-faq-and-reports-thread/

Powershell/Scripting question - how can I make this faster?

BlazeFireBlazeFire Registered User regular
edited February 2019 in Help / Advice Forum
I have a list (A) of ~10 000 timestamps (format: yyyy-MM-dd-HH-mm-ss). I have another list (B) of ~60 000 time stamps (same format). Both lists are in chronological order.

I want to make a third list (C) made up of timestamps from A that are within 30 seconds of any timestamp in B. The way I'm doing this right now (psuedocode) is:
foreach [i]time1[/i] in A {
     foreach [i]time2[/i] in B {
          [i]diff[/i] = [i]time1[/i] - [i]time2[/i]
          if (abs([i]diff[/i]) < 30) {
                 record [i]time1[/i]
                 break
          }
          elseif  ([i]diff[/i] < -120) {
                 break             
          }
     }
}

The elseif catches situations where the current value from list A is already earlier in time than B (ie I already know it won't show up in B).

Any suggestions on how could I make this faster?


Fakeedit: As I was typing this up, it occurred to me that I may not need to go through the entirety of list B. If I kept a pointer to where I found the last value, I think I can start there again on the next iteration of the outer loop. Hmm.

Real edit:
Some sample date. The marked item from List A would "match" any of the marked items in List B but it would only take the first one to break out of the loop.

List A:
2019-02-07-11-49-23
2019-02-07-11-49-54
2019-02-07-11-50-25
2019-02-07-11-53-31
2019-02-07-11-54-02
2019-02-07-11-56-06
2019-02-11-15-26-28 ********


List B:
2019-02-07-23-07-55
2019-02-07-23-07-56
2019-02-07-23-18-35
2019-02-07-23-18-36
2019-02-07-23-18-37
2019-02-07-23-18-38
2019-02-07-23-18-39
2019-02-08-00-46-46
2019-02-08-00-46-47
2019-02-09-03-24-37
2019-02-11-15-26-33 ********
2019-02-11-15-26-34 ********
2019-02-11-15-26-35 ********
2019-02-11-15-26-41 ********
2019-02-11-15-26-42 ********
2019-02-11-15-26-43 ********

BlazeFire on

Posts

  • SmasherSmasher Starting to get dizzy Registered User regular
    edited February 2019
    The algorithm you're using is O(n^2); since the lists are already sorted you can get it down to O(n). If you're not familiar with Big-O notation, that means you can go from doing the inner loop ~600 million times to ~70 thousand. I don't know Powershell syntax, but in pseudocode what you want is something like this:
    a_index = 0
    b_index = 0
    while(a_index < a_length and b_index < b_length) {
    	if(abs(A[a_index] - B[b_index]) < 30) {
    		record(A[a_index])
    		a_index++
    	}
    	else if(A[a_index] > B[b_index]) {  
    		b_index++	
    	}
    	else {
    		a_index++
    	}
    }	
    

    Smasher on
  • ArbitraryDescriptorArbitraryDescriptor changed Registered User regular
    What Smasher said: You don't need to keep going back to the beginning of the second list.

    Side note: I also don't powershell, so you broke me for a minute there with this apparent double sided array accessor
    [i]diff[/i]
    

    ...Until I typed it in here and realized that was just an escaped italics tag.

  • ArbitraryDescriptorArbitraryDescriptor changed Registered User regular
    edited February 2019
    Edit: Depending on the actual spread of your data, you might be able to squeeze a few more ms out of it if you skip every n A records on the main loop, then walk backwards if you jumped past the next B record.

    Like
    n=100
    while(stuff):
    
      if aList[a] > bList[b]:
        if aList[a-int(n/2)] < bList[b]: 
          /*subloop for half skipped records*/
        elif aList[a-n+1] < bList[b]: 
          /*subloop for all skipped records*/
        else:
          a=a+n //instead of a++  
    

    But if n is too large you'll wind up in the elif 99% of the time, and there's probably a lower limit on what percentage of records need to be skipped to offset the additional overhead of the subloop.

    Edit: That... Wasn't an edit. This is!

    If you really want to over-engineer the damned thing: You could try to make n self-balanced.
    You could track the frequency of hits to the elif block, and if its greater than some desirable threshold, say, 50%, adjust n down. Do the same in the else block to gradually increase n so it keeps making larger jumps until it's going too far.

    ArbitraryDescriptor on
  • BlazeFireBlazeFire Registered User regular
    Perfect, thank you both for your help. Time went from an hour to 30 seconds. Big-O notation was a long time ago for me but I remember enough to recognize I was doing this in a silly way and to make sense of Smasher's answer. Glad to see that my first thought while typing up the original post was the right way to do it.

  • BlazeFireBlazeFire Registered User regular
    Edit: Depending on the actual spread of your data, you might be able to squeeze a few more ms out of it if you skip every n A records on the main loop, then walk backwards if you jumped past the next B record.

    Like
    n=100
    while(stuff):
    
      if aList[a] > bList[b]:
        if aList[a-int(n/2)] < bList[b]: 
          /*subloop for half skipped records*/
        elif aList[a-n+1] < bList[b]: 
          /*subloop for all skipped records*/
        else:
          a=a+n //instead of a++  
    

    But if n is too large you'll wind up in the elif 99% of the time, and there's probably a lower limit on what percentage of records need to be skipped to offset the additional overhead of the subloop.

    Edit: That... Wasn't an edit. This is!

    If you really want to over-engineer the damned thing: You could try to make n self-balanced.
    You could track the frequency of hits to the elif block, and if its greater than some desirable threshold, say, 50%, adjust n down. Do the same in the else block to gradually increase n so it keeps making larger jumps until it's going too far.

    This is an interesting idea, but something I won't end up implementing. This is something I'll be manually running once every 1-2 weeks so the current speed is adequate.

  • ArbitraryDescriptorArbitraryDescriptor changed Registered User regular
    edited February 2019
    But it sounds like it could save you dozens of seconds a year!

    :rotate:

    ArbitraryDescriptor on
  • BlazeFireBlazeFire Registered User regular
    Don't tempt me!

    automation.png

Sign In or Register to comment.