Club PA 2.0 has arrived! If you'd like to access some extra PA content and help support the forums, check it out at patreon.com/ClubPA
The image size limit has been raised to 1mb! Anything larger than that should be linked to. This is a HARD limit, please do not abuse it.
Our new Indie Games subforum is now open for business in G&T. Go and check it out, you might land a code for a free game. If you're developing an indie game and want to post about it, follow these directions. If you don't, he'll break your legs! Hahaha! Seriously though.
Our rules have been updated and given their own forum. Go and look at them! They are nice, and there may be new ones that you didn't know about! Hooray for rules! Hooray for The System! Hooray for Conforming!

Powershell/Scripting question - how can I make this faster?

BlazeFireBlazeFire Registered User regular
edited February 13 in Help / Advice Forum
I have a list (A) of ~10 000 timestamps (format: yyyy-MM-dd-HH-mm-ss). I have another list (B) of ~60 000 time stamps (same format). Both lists are in chronological order.

I want to make a third list (C) made up of timestamps from A that are within 30 seconds of any timestamp in B. The way I'm doing this right now (psuedocode) is:
foreach [i]time1[/i] in A {
     foreach [i]time2[/i] in B {
          [i]diff[/i] = [i]time1[/i] - [i]time2[/i]
          if (abs([i]diff[/i]) < 30) {
                 record [i]time1[/i]
                 break
          }
          elseif  ([i]diff[/i] < -120) {
                 break             
          }
     }
}

The elseif catches situations where the current value from list A is already earlier in time than B (ie I already know it won't show up in B).

Any suggestions on how could I make this faster?


Fakeedit: As I was typing this up, it occurred to me that I may not need to go through the entirety of list B. If I kept a pointer to where I found the last value, I think I can start there again on the next iteration of the outer loop. Hmm.

Real edit:
Some sample date. The marked item from List A would "match" any of the marked items in List B but it would only take the first one to break out of the loop.

List A:
2019-02-07-11-49-23
2019-02-07-11-49-54
2019-02-07-11-50-25
2019-02-07-11-53-31
2019-02-07-11-54-02
2019-02-07-11-56-06
2019-02-11-15-26-28 ********


List B:
2019-02-07-23-07-55
2019-02-07-23-07-56
2019-02-07-23-18-35
2019-02-07-23-18-36
2019-02-07-23-18-37
2019-02-07-23-18-38
2019-02-07-23-18-39
2019-02-08-00-46-46
2019-02-08-00-46-47
2019-02-09-03-24-37
2019-02-11-15-26-33 ********
2019-02-11-15-26-34 ********
2019-02-11-15-26-35 ********
2019-02-11-15-26-41 ********
2019-02-11-15-26-42 ********
2019-02-11-15-26-43 ********

BlazeFire on

Posts

  • SmasherSmasher Starting to get dizzy Registered User regular
    edited February 14
    The algorithm you're using is O(n^2); since the lists are already sorted you can get it down to O(n). If you're not familiar with Big-O notation, that means you can go from doing the inner loop ~600 million times to ~70 thousand. I don't know Powershell syntax, but in pseudocode what you want is something like this:
    a_index = 0
    b_index = 0
    while(a_index < a_length and b_index < b_length) {
    	if(abs(A[a_index] - B[b_index]) < 30) {
    		record(A[a_index])
    		a_index++
    	}
    	else if(A[a_index] > B[b_index]) {  
    		b_index++	
    	}
    	else {
    		a_index++
    	}
    }	
    

    Smasher on
    ArbitraryDescriptorkimeOrca
  • ArbitraryDescriptorArbitraryDescriptor Registered User regular
    What Smasher said: You don't need to keep going back to the beginning of the second list.

    Side note: I also don't powershell, so you broke me for a minute there with this apparent double sided array accessor
    [i]diff[/i]
    

    ...Until I typed it in here and realized that was just an escaped italics tag.

  • ArbitraryDescriptorArbitraryDescriptor Registered User regular
    edited February 14
    Edit: Depending on the actual spread of your data, you might be able to squeeze a few more ms out of it if you skip every n A records on the main loop, then walk backwards if you jumped past the next B record.

    Like
    n=100
    while(stuff):
    
      if aList[a] > bList[b]:
        if aList[a-int(n/2)] < bList[b]: 
          /*subloop for half skipped records*/
        elif aList[a-n+1] < bList[b]: 
          /*subloop for all skipped records*/
        else:
          a=a+n //instead of a++  
    

    But if n is too large you'll wind up in the elif 99% of the time, and there's probably a lower limit on what percentage of records need to be skipped to offset the additional overhead of the subloop.

    Edit: That... Wasn't an edit. This is!

    If you really want to over-engineer the damned thing: You could try to make n self-balanced.
    You could track the frequency of hits to the elif block, and if its greater than some desirable threshold, say, 50%, adjust n down. Do the same in the else block to gradually increase n so it keeps making larger jumps until it's going too far.

    ArbitraryDescriptor on
  • BlazeFireBlazeFire Registered User regular
    Perfect, thank you both for your help. Time went from an hour to 30 seconds. Big-O notation was a long time ago for me but I remember enough to recognize I was doing this in a silly way and to make sense of Smasher's answer. Glad to see that my first thought while typing up the original post was the right way to do it.

    Zilla360
  • BlazeFireBlazeFire Registered User regular
    Edit: Depending on the actual spread of your data, you might be able to squeeze a few more ms out of it if you skip every n A records on the main loop, then walk backwards if you jumped past the next B record.

    Like
    n=100
    while(stuff):
    
      if aList[a] > bList[b]:
        if aList[a-int(n/2)] < bList[b]: 
          /*subloop for half skipped records*/
        elif aList[a-n+1] < bList[b]: 
          /*subloop for all skipped records*/
        else:
          a=a+n //instead of a++  
    

    But if n is too large you'll wind up in the elif 99% of the time, and there's probably a lower limit on what percentage of records need to be skipped to offset the additional overhead of the subloop.

    Edit: That... Wasn't an edit. This is!

    If you really want to over-engineer the damned thing: You could try to make n self-balanced.
    You could track the frequency of hits to the elif block, and if its greater than some desirable threshold, say, 50%, adjust n down. Do the same in the else block to gradually increase n so it keeps making larger jumps until it's going too far.

    This is an interesting idea, but something I won't end up implementing. This is something I'll be manually running once every 1-2 weeks so the current speed is adequate.

  • ArbitraryDescriptorArbitraryDescriptor Registered User regular
    edited February 14
    But it sounds like it could save you dozens of seconds a year!

    :rotate:

    ArbitraryDescriptor on
    dispatch.o
  • BlazeFireBlazeFire Registered User regular
    Don't tempt me!

    automation.png

    ArbitraryDescriptorElvenshaeDaenrisAngelHedgieZilla360DrovekBliss 101SmrtnikStabbity Style
Sign In or Register to comment.