As was foretold, we've added advertisements to the forums! If you have questions, or if you encounter any bugs, please visit this thread: https://forums.penny-arcade.com/discussion/240191/forum-advertisement-faq-and-reports-thread/
Options

[Programming] Kafkaesque rabbits in the queue at the pub

12467100

Posts

  • Options
    templewulftemplewulf The Team Chump USARegistered User regular
    Alright so I've been at this job for a bit over a year, and basically my boss uses STATA and I use R

    We both kind of came out of academics, hence academic-ish stats programs, that's just what we're used to.

    But like... it is becoming very clear that what we're doing is not the right way to do what we do long term. We wind up with poorly built rickety scripts that are really hard to keep track of and really difficult to show other people or keep up to date effectively. Just lots and lots of individual scripts that rely on specific prepping to run.

    So what I'm trying to look for is a change in language and/or approach for the both of us to leap into and try to focus on doing this stuff properly.

    A task we often do:
    - Take information in the form of .csv and .xls files, usually tables wit some mixture of data types, saved locally
    - Alter that information, table operations and reshaping and all that, making new variables out of old ones, correcting errors
    - Report data from these tables/reshaped things
    - Most of the data reporting winds up being done in Tableau or like just some raw numbers like "20 people blah, 40 people blahed".

    But like basically I feel like the sort of thing we were doing in academia isn't coming up too often here. We're not doing complex stats all that often so much as just some straightforward math and fussing about with graphing in a separate much prettier program because we can actually buy a thing to make nice graphs. We're doing data manipulation more often than we're doing stats, because often the .csv and .xls files MUST come in exactly that layout, and we've got to fuss with them a lot before they work right for our purposes.

    And I've definitely run into some issues where we were doing operations on such giant data sets that R would chug for like 20-40 minutes to complete operations. There was one moment with literal millions of rows of thousands of columns where it just was not possible to perform certain operations, even after making the code as efficient as I could manage.

    Is there another language that people seem to enjoy for this sort of work? I've had a bit of experience with SQL, a little C#, a bit of Python. I'm open to learning whatever, though, my boss is also eager to pick up something other than STATA scripting. Also would help to have a nice strong style guide/tutorial so we don't get into bad habits again!

    I like the Python recommendations; especially if you're the kind of person who wants to run utilities from the command line.

    On the other hand, my partner is a data science graduate who did a lot of R in grad school. Since R scripts tend to be very siloed, like you mentioned, data manipulation can be difficult to keep track of. She migrated those to Rails and put them up on Heroku so that her department wouldn't have to requisition server resources from the IT department.

    The big upsides are that it can be run by non-technical peers in her department, and there is always one canonical version. The big downside is that learning How2WebApp is an undertaking in and of itself. If you don't already know how and/or you don't want to do it recreationally, sticking with command line Ruby or Python is a better bet.

    Twitch.tv/FiercePunchStudios | PSN | Steam | Discord | SFV CFN: templewulf
  • Options
    durandal4532durandal4532 Registered User regular
    Hm. That's a very good alternate.

    I think the command-line will make more sense due to the potentially confidential nature of the information we're getting. Some of it is protected health info, and I don't want to risk any sort of opening of a security hole, even though it'd be unlikely. I know for a fact at least one of my boss's scripts used identifiable information to do some of the work. If I can't absolutely guarantee the security I wouldn't want to risk it, and absolutely guaranteeing that would be a like months-long process with the way things go here.

    This is really helpful, though!

    Take a moment to donate what you can to Critical Resistance and Black Lives Matter.
  • Options
    NogsNogs Crap, crap, mega crap. Crap, crap, mega crap.Registered User regular
    so ive been interviewing, and i just finished 3 rounds of interviews with a company

    recruiter/hr guy called me up just now

    'you're really impressive with your React knowledge. But we really need someone that knows javascript, css and html'



    ....right okay buddy, so you have no idea what you are talking about, that's cool.

    rotate.jpg
    PARKER, YOU'RE FIRED! <-- My comic book podcast! Satan look here!
  • Options
    EtheaEthea Registered User regular
    Nogs wrote: »
    so ive been interviewing, and i just finished 3 rounds of interviews with a company

    recruiter/hr guy called me up just now

    'you're really impressive with your React knowledge. But we really need someone that knows javascript, css and html'



    ....right okay buddy, so you have no idea what you are talking about, that's cool.

    Well you dodged a bullet.

  • Options
    bowenbowen How you doin'? Registered User regular
    that's strange though

    not a doctor, not a lawyer, examples I use may not be fully researched so don't take out of context plz, don't @ me
  • Options
    templewulftemplewulf The Team Chump USARegistered User regular
    Hm. That's a very good alternate.

    I think the command-line will make more sense due to the potentially confidential nature of the information we're getting. Some of it is protected health info, and I don't want to risk any sort of opening of a security hole, even though it'd be unlikely. I know for a fact at least one of my boss's scripts used identifiable information to do some of the work. If I can't absolutely guarantee the security I wouldn't want to risk it, and absolutely guaranteeing that would be a like months-long process with the way things go here.

    This is really helpful, though!

    Oh wow yes, if you are in HIPAA land, securing information is a whole other topic. Good call!

    If nothing else, tracking your scripts with a VCS (e.g. Git) will help with tracking scripts, and you can add a little ReadMe to the repo that will explain the setup process.

    Twitch.tv/FiercePunchStudios | PSN | Steam | Discord | SFV CFN: templewulf
  • Options
    zeenyzeeny Registered User regular
    Notebooks. Guys, cmon. Cut it with the "command line". Notebooks.

  • Options
    jaziekjaziek Bad at everything And mad about it.Registered User regular
    edited July 2016
    bowen wrote: »
    sounds like you want a RESTful service

    Check out any of the sinatra-based frameworks like sinatra itself, nancyfx, node.js with express/loopback/etc, servicestack (not really sinatra based but it's a neat concept)
    bowen wrote: »
    Pick a language and I can probably point you in the right direction!

    I dunno,

    we have a million and one ways of directly accessing the database as it stands, we do raw SQL queries from PHP and C# applications, there's some stuff using entity framework, some stuff using T-SQL through a custom ODBC driver, a shitload of SSIS packages, loads of stored procedures, the list goes on...

    Really I'm not sure what would be easier, and/or better in the long run.

    Do I advocate going for some kind of data access layer that sits directly on top of the databases that everything goes through, that operates completely independently of any of our applications, or do we go for implementing some standardised library that we can plug into all the apps individually that lets us switch databases relatively easily with some kind of plugin architecture.

    To me, it seems like having the data access be a service in and of itself would simplify this kind of migration in the future, and let us be provider agnostic going forward with less effort, but surely it kills performance to have all that extra cruft in the way.

    Language wise whatever we do will end up likely being c# or C++.

    jaziek on
    Steam ||| SC2 - Jaziek.377 on EU & NA. ||| Twitch Stream
  • Options
    bowenbowen How you doin'? Registered User regular
    You could do both.

    I'd recommend the DAL through a RESTful service. That's what I'm doing, and I'm finding it much easier to deal with small helper utilities that might need access to data compared to the library system I had before.

    With C# you can use dynamic objects and JSON to convert back and forth from applications to the server and not even need classes set up before hand. Whether that's good or not is up to you.

    Nancyfx is a nice package, you should look into it, it's C# based, and the one I'm most familiar with anyways.

    Basically everything would end up going through something like http://locallansystem:9999/customer/1245033 and you could pull a customer object down based on the ID you sent it. This way php, c#, C++, python, javascript, etc etc etc can do it all

    not a doctor, not a lawyer, examples I use may not be fully researched so don't take out of context plz, don't @ me
  • Options
    ecco the dolphinecco the dolphin Registered User regular
    edited July 2016
    Looking into implementing jwt as a sort of stateless authentication mechanism.

    Doing some due diligence by reading up on best practises.

    Farrrrrr out now I have to protect against XSS and CSRF attacks for my authentication to have any meaning.

    Edit: I mean, yes, they should be done as a matter of course, but man. Security and doing things properly.

    I was just going to throw stuff together as a proof of concept, but now I think I need to do it properly.

    Edit 2: Woah, I think that NancyFx's simple view engine can generate anti forgery tokens for me in simple HTML files. Yessssss

    ecco the dolphin on
    Penny Arcade Developers at PADev.net.
  • Options
    Alistair HuttonAlistair Hutton Dr EdinburghRegistered User regular
    edited July 2016
    Ok, this is killing me now

    I have a list of Match objects that have a index and length field
    Match
    {
      index = 5,
      length = 10
    }
    

    Where index is where they start the match in an arbitrary string and length is the length of the match. So a Match(3,5) means starting at character 3 in the string there is a 5 character long match.

    So given the list
    [Match(0,100),Match(5,3),Match(6,1),Match(50,10), Match(200,10)]
    
    I want a function that returns
    [
      [Match(0,100), Match(200,10)],
      [Match(5,3), Match(50,10), Match(200,10)].
      [Match(6,1),Match(50,10), Match(200,10)]
    ]
    

    So a list of all Match groups where every indivdual Match does not overlap with another Match. Crucially though I don't wat to see
    [Match(50,10), Match(200,10)]
    or
    [Match(200,10)]
    

    as one of the returned groups. The groups have to be the largest set of non-overlapping Matches possible.

    The solution to this is swirling around on the edge of my ability to comprehend it. It seems like it should be a really simple classical algorithmic problem but every time I think I am close I discover and edge case that ruins my approach.

    Alistair Hutton on
    I have a thoughtful and infrequently updated blog about games http://whatithinkaboutwhenithinkaboutgames.wordpress.com/

    I made a game, it has penguins in it. It's pay what you like on Gumroad.

    Currently Ebaying Nothing at all but I might do in the future.
  • Options
    OrcaOrca Also known as Espressosaurus WrexRegistered User regular
    edited July 2016
    A naive solution would be an an O(n!) complexity, trying all permutations in order and returning the non-overlapping portion of each run.

    A quick google for disjoint set (which is what you want to turn your data into) turns up the appropriately named disjoint-set data structure, runs in a much more reasonable nlog(n).

    edit: disjoint-set would only return *one* of the disjoint sets...problematic. I'll have some coffee and think about it, but getting reasonable time/space complexity might be interesting...

    Orca on
  • Options
    InfidelInfidel Heretic Registered User regular
    Ok, this is killing me now

    I have a list of Match objects that have a index and length field
    Match
    {
      index = 5,
      length = 10
    }
    

    Where index is where they start the match in an arbitrary string and length is the length of the match. So a Match(3,5) means starting at character 3 in the string there is a 5 character long match.

    So given the list
    [Match(0,100),Match(5,3),Match(6,1),Match(50,10), Match(200,10)]
    
    I want a function that returns
    [
      [Match(0,100), Match(200,10)],
      [Match(5,3), Match(50,10), Match(200,10)].
      [Match(6,1),Match(50,10), Match(200,10)]
    ]
    

    So a list of all Match groups where every indivdual Match does not overlap with another Match. Crucially though I don't wat to see
    [Match(6,1),Match(50,10), Match(200,10)]
    or
    [Match(50,10), Match(200,10)]
    or
    [Match(200,10)]
    

    as one of the returned groups. The groups have to be the largest set of non-overlapping Matches possible.

    The solution to this is swirling around on the edge of my ability to comprehend it. It seems like it should be a really simple classical algorithmic problem but every time I think I am close I discover and edge case that ruins my approach.

    Generating the possible lists is not too bad. Assuming you're good there and just worried about the largest sets part, you can generate the full list first and then test for each element in that list (the set of matches) that it is not the proper subset of any other, if so remove it.

    Performance is a thing I like to look at after I have a functioning and correct algorithm, because usually then I can see the potential optimizations (the code patterns tend to be common even if the problem was novel).

    OrokosPA.png
  • Options
    OrcaOrca Also known as Espressosaurus WrexRegistered User regular
    Orca wrote: »
    A naive solution would be an an O(n!) complexity, trying all permutations in order and returning the non-overlapping portion of each run.

    A quick google for disjoint set (which is what you want to turn your data into) turns up the appropriately named disjoint-set data structure, runs in a much more reasonable nlog(n).

    edit: disjoint-set would only return *one* of the disjoint sets...problematic. I'll have some coffee and think about it, but getting reasonable time/space complexity might be interesting...

    perhaps start with a disjoint set. Any elements of the set with more than one entry need to have one element removed and disjoint-set run on those elements. I'm going to use starting and ending characters, because it's easier for me to think of it that way.

    e.x., let's say there's a a set consisting of [(0, 3), (5, 10), (5, 7), (8, 10), (20, 30)]. We run disjoint set on the input and get:
    [[(0,3)],
    [(5, 10), (5, 7), (8, 10)],
    [(20, 30)]]

    The first and last sets consist of only one element; they'll be part of every single reported list. The middle set is more complicated, consisting of 3 elements. What if we could make that set disjoint? We could recursively remove one element and then run disjoint-set on the remainder. The problem is that the order in which we remove elements from consideration is important, which means we're permuting the order in which we remove elements, putting us squarely in O(n!) territory again--but with the right data sets you're at least lowering the size of your n.

    This is a surprisingly interesting problem for a Saturday morning.

  • Options
    InfidelInfidel Heretic Registered User regular
    Yeah, the disjoint set would let you remove the trivial cases, like the omni-present Match(200, 10) in the example, since the real solutions are the same as the answer for the first four elements only, with the last tacked onto each solution.

    OrokosPA.png
  • Options
    Alistair HuttonAlistair Hutton Dr EdinburghRegistered User regular
    edited July 2016
    Orca wrote: »
    This is a surprisingly interesting problem for a Saturday morning.

    Yeah I initially thought this was a really easy trivial-to-solve-with-recursion problem but it is proving meatier than I thought.

    To be honest I'm tempted to say that in my real world application of this the total size of n will never be that large so I could just go for the - generate all combinations of all sizes, filter the combinations to remove overlaps and discard all sequences that are strict subsets approach.

    I just can't help but think there is a perfect solution for this that I am missing. Also, if I very hit some pathological input my solution will blow up fast.

    Alistair Hutton on
    I have a thoughtful and infrequently updated blog about games http://whatithinkaboutwhenithinkaboutgames.wordpress.com/

    I made a game, it has penguins in it. It's pay what you like on Gumroad.

    Currently Ebaying Nothing at all but I might do in the future.
  • Options
    OrcaOrca Also known as Espressosaurus WrexRegistered User regular
    Orca wrote: »
    This is a surprisingly interesting problem for a Saturday morning.

    Yeah I initially thought this was a really easy trivial-to-solve-with-recursion problem but it is proving meatier than I thought.

    To be honest I'm tempted to say the total size of n will never be that large so I could just got for the generate all combinations of all sizes, filter the combinations to remove overlaps and discard all sequences that are strict subsets.

    I just can't help but think there is a perfect solution for this that I am missing. Also, if I very hit some pathological input my solution will blow up fast.

    It's the requirement for all disjoint sets that makes the problem explode. One partition isn't hard. :)

  • Options
    Alistair HuttonAlistair Hutton Dr EdinburghRegistered User regular
    Orca wrote: »
    Orca wrote: »
    This is a surprisingly interesting problem for a Saturday morning.

    Yeah I initially thought this was a really easy trivial-to-solve-with-recursion problem but it is proving meatier than I thought.

    To be honest I'm tempted to say the total size of n will never be that large so I could just got for the generate all combinations of all sizes, filter the combinations to remove overlaps and discard all sequences that are strict subsets.

    I just can't help but think there is a perfect solution for this that I am missing. Also, if I very hit some pathological input my solution will blow up fast.

    It's the requirement for all disjoint sets that makes the problem explode. One partition isn't hard. :)

    Wait, wait, wait.

    I think his might be a time for dynamic programming.

    Maybe.

    I can smell an n² solution.

    Oh no, wait that's just dinner I'm cooking.

    I have a thoughtful and infrequently updated blog about games http://whatithinkaboutwhenithinkaboutgames.wordpress.com/

    I made a game, it has penguins in it. It's pay what you like on Gumroad.

    Currently Ebaying Nothing at all but I might do in the future.
  • Options
    OrcaOrca Also known as Espressosaurus WrexRegistered User regular
    Orca wrote: »
    Orca wrote: »
    This is a surprisingly interesting problem for a Saturday morning.

    Yeah I initially thought this was a really easy trivial-to-solve-with-recursion problem but it is proving meatier than I thought.

    To be honest I'm tempted to say the total size of n will never be that large so I could just got for the generate all combinations of all sizes, filter the combinations to remove overlaps and discard all sequences that are strict subsets.

    I just can't help but think there is a perfect solution for this that I am missing. Also, if I very hit some pathological input my solution will blow up fast.

    It's the requirement for all disjoint sets that makes the problem explode. One partition isn't hard. :)

    Wait, wait, wait.

    I think his might be a time for dynamic programming.

    Maybe.

    I can smell an n² solution.

    Oh no, wait that's just dinner I'm cooking.

    I was thinking of that, but the ordering dependency seems to me like it makes the memoization have the same complexity. I'd love to be proven wrong though!

  • Options
    Alistair HuttonAlistair Hutton Dr EdinburghRegistered User regular
    edited July 2016
    Other thought, can I formulate this as a graph traversal/spanning/search problem?

    POST DOODLING EDIT: I am smelling gold when it comes to thinking about this as graph traversal - this might just end up as a round about way of doing disjoint subsets but I'm feeling it, hopefully I'll be able to find some time to code this tonight.

    POST POST DOODLING EDIT: No, I am smelling anti gold. Guh, horrible. The key thing here is to get a comprehensive set of test cases so I don't keep going down blind alleys.

    Alistair Hutton on
    I have a thoughtful and infrequently updated blog about games http://whatithinkaboutwhenithinkaboutgames.wordpress.com/

    I made a game, it has penguins in it. It's pay what you like on Gumroad.

    Currently Ebaying Nothing at all but I might do in the future.
  • Options
    OrcaOrca Also known as Espressosaurus WrexRegistered User regular
    edited July 2016
    Other thought, can I formulate this as a graph traversal/spanning/search problem?

    POST DOODLING EDIT: I am smelling gold when it comes to thinking about this as graph traversal - this might just end up as a round about way of doing disjoint subsets but I'm feeling it, hopefully I'll be able to find some time to code this tonight.

    POST POST DOODLING EDIT: No, I am smelling anti gold. Guh, horrible. The key thing here is to get a comprehensive set of test cases so I don't keep going down blind alleys.

    Yeah, my attempts to construct a suitable graph have so far either ended up having O(n!) space complexity (on top of the implied O(n!) to construct it), which is not helpful, or reduced to the same O(n!) runtime complexity as the naive solution.

    Orca on
  • Options
    Alistair HuttonAlistair Hutton Dr EdinburghRegistered User regular
    edited July 2016
    I think I will implement my naive solution to prove that I can make something work then I'll take a stab at a dynamic programming thingy.

    I'm convinced that dynamic programming must be a winner because this looks like a project Euler style problem and the answer to those is always dynamic programming.

    Alistair Hutton on
    I have a thoughtful and infrequently updated blog about games http://whatithinkaboutwhenithinkaboutgames.wordpress.com/

    I made a game, it has penguins in it. It's pay what you like on Gumroad.

    Currently Ebaying Nothing at all but I might do in the future.
  • Options
    OrcaOrca Also known as Espressosaurus WrexRegistered User regular
    I thought the answer was always to use a hashmap.

  • Options
    LD50LD50 Registered User regular
    Orca wrote: »
    I thought the answer was always to use a hashmap.

    Only for the first half of the problems.

  • Options
    TraceTrace GNU Terry Pratchett; GNU Gus; GNU Carrie Fisher; GNU Adam We Registered User regular
    https://groups.google.com/forum/#!topic/open-nars/3XE_XSE1RXY
    Gorilla is a Notebook-based REPL and now became a notebook interface for OpenNARS 2.0.0. Since it is a REPL, it allows full Clojure code and allows one to work with examples in a Notebook-like manner and to document them (also has Latex support) and to nicely present them while the system is running (saving worksheets is possible of course also), making experiments of any kind easy and reproducible. It can be used while Lense is running, and can be used together with the Cursive/IntelliJ IDE.

    gorilla0.png

    gorilla1.png

    gorilla2.png

    gorilla3.png

  • Options
    PhyphorPhyphor Building Planet Busters Tasting FruitRegistered User regular
    Orca wrote: »
    Orca wrote: »
    A naive solution would be an an O(n!) complexity, trying all permutations in order and returning the non-overlapping portion of each run.

    A quick google for disjoint set (which is what you want to turn your data into) turns up the appropriately named disjoint-set data structure, runs in a much more reasonable nlog(n).

    edit: disjoint-set would only return *one* of the disjoint sets...problematic. I'll have some coffee and think about it, but getting reasonable time/space complexity might be interesting...

    perhaps start with a disjoint set. Any elements of the set with more than one entry need to have one element removed and disjoint-set run on those elements. I'm going to use starting and ending characters, because it's easier for me to think of it that way.

    e.x., let's say there's a a set consisting of [(0, 3), (5, 10), (5, 7), (8, 10), (20, 30)]. We run disjoint set on the input and get:
    [[(0,3)],
    [(5, 10), (5, 7), (8, 10)],
    [(20, 30)]]

    The first and last sets consist of only one element; they'll be part of every single reported list. The middle set is more complicated, consisting of 3 elements. What if we could make that set disjoint? We could recursively remove one element and then run disjoint-set on the remainder. The problem is that the order in which we remove elements from consideration is important, which means we're permuting the order in which we remove elements, putting us squarely in O(n!) territory again--but with the right data sets you're at least lowering the size of your n.

    This is a surprisingly interesting problem for a Saturday morning.

    The ordering in which you remove entries is constrained though, if you sort the list by starting index, you can do something like:

    Pick first element, add to list
    If overlaps, recurse with a copy with this element removed to produce other sets
    Skip any overlapping elements and continue on

    Because you know that you don't have to consider any elements with a lower starting index than the current cursor this should limit it

  • Options
    OrcaOrca Also known as Espressosaurus WrexRegistered User regular
    edited July 2016
    Phyphor wrote: »
    Orca wrote: »
    Orca wrote: »
    A naive solution would be an an O(n!) complexity, trying all permutations in order and returning the non-overlapping portion of each run.

    A quick google for disjoint set (which is what you want to turn your data into) turns up the appropriately named disjoint-set data structure, runs in a much more reasonable nlog(n).

    edit: disjoint-set would only return *one* of the disjoint sets...problematic. I'll have some coffee and think about it, but getting reasonable time/space complexity might be interesting...

    perhaps start with a disjoint set. Any elements of the set with more than one entry need to have one element removed and disjoint-set run on those elements. I'm going to use starting and ending characters, because it's easier for me to think of it that way.

    e.x., let's say there's a a set consisting of [(0, 3), (5, 10), (5, 7), (8, 10), (20, 30)]. We run disjoint set on the input and get:
    [[(0,3)],
    [(5, 10), (5, 7), (8, 10)],
    [(20, 30)]]

    The first and last sets consist of only one element; they'll be part of every single reported list. The middle set is more complicated, consisting of 3 elements. What if we could make that set disjoint? We could recursively remove one element and then run disjoint-set on the remainder. The problem is that the order in which we remove elements from consideration is important, which means we're permuting the order in which we remove elements, putting us squarely in O(n!) territory again--but with the right data sets you're at least lowering the size of your n.

    This is a surprisingly interesting problem for a Saturday morning.

    The ordering in which you remove entries is constrained though, if you sort the list by starting index, you can do something like:

    Pick first element, add to list
    If overlaps, recurse with a copy with this element removed to produce other sets
    Skip any overlapping elements and continue on

    Because you know that you don't have to consider any elements with a lower starting index than the current cursor this should limit it

    Hm. Sort by starting index, and items with the same starting index reverse sort by the ending index? Then it's fully constrained.

    So:

    [(0,100), (0, 3), (5, 10), (5, 7), (8, 10), (20, 30)] ->

    [(0, 100), (0, 3), (5, 7), (5, 10), (8, 10), (20, 30)]

    disjoint set:
    [
    [(0, 100), (0, 3), (5, 10), (5, 7), (8, 10), (20, 30)]
    ] (well, that wasn't helpful)

    Remove first element of disjoint set, recursively run disjoint set on remainder:

    [
    [(0, 3)],
    [(5, 10), (5, 7), (8, 10)]
    [(20, 30)]
    ]

    More than one element returned, and the size of at least one set is greater than one; run on the set of size > 1:

    [(5, 10), (5, 7), (8, 10)]

    Remove first element and run disjoint set:
    [
    [(5, 7)],
    [(8, 10)]
    ]

    sets are size 1, return...

    This looks extremely promising!

    Orca on
  • Options
    OrcaOrca Also known as Espressosaurus WrexRegistered User regular
    edited July 2016
    So let's see. O(n log(n)) for the sort (the two stage sort could be combined by a sort function that converts the two variables into a single value, e.g. start_index * 1e7 - end_index). O(n log(n)) for each disjoint set operation. In the worst case, that might be done n times. So O(n log(n) + n^2log(n)) -> O(n^2 log(n)). Not a bad reduction from O(n!), assuming this covers all inputs!

    Orca on
  • Options
    PhyphorPhyphor Building Planet Busters Tasting FruitRegistered User regular
    edited July 2016
    Given the sets of: {0, 10}, {2, 5}, {3, 6}, {5, 8}, {7, 11}, {10, 15} (with start end not index/length)

    The results should be:
    {0,10} {10,15}
    {2,5} {5,8} {10,15}
    {2,5} {7,11}
    {3,6} {7,11}
    {3,6} {10, 15}


    The following produces:
    (0, 10) (10, 15)
    (2, 5) (5, 8) (10, 15)
    (2, 5) (7, 11)
    (3, 6) (7, 11)
    (3, 6) (10, 15)

    edit: Okay I fixed it I think
    edit2: added explicit sorting
    #include <map>
    #include <vector>
    #include <algorithm>
    
    struct Foo
    {
    	int start, end;
    
    	bool overlaps(const Foo& f) const
    	{
    		return start < f.end == f.start < end;
    	}
    	bool operator < (const Foo& r) const
    	{
    		return start < r.start || start == r.start && end > r.end;
    	}
    };
    
    void PrintSet(const std::vector<Foo>& list)
    {
    	for(auto& e: list)
    		printf("(%d, %d) ", e.start, e.end);
    	printf("\n");
    }
    
    void DisjointSets(std::vector<std::vector<Foo>>& ll, std::vector<Foo>& out, const std::vector<Foo>& in, size_t i)
    {
    	while(!out.empty() && i < in.size() && out.back().overlaps(in[i])) i++;
    	if(i == in.size()) {
    		ll.emplace_back(out);
    		return;
    	}
    
    	size_t j = i;
    	int smallest_end = in[i].end;
    	for(; j < in.size() && in[i].overlaps(in[j]); j++) {
    		if(in[j].start >= smallest_end)
    			continue;
    		if(in[j].end < smallest_end)
    			smallest_end = in[j].end;
    		std::vector<Foo> temp = out;
    		temp.push_back(in[j]);
    		DisjointSets(ll, temp, in, j+1);
    	}
    }
    
    std::vector<std::vector<Foo>> Get(const std::vector<Foo>& in)
    {
    	std::vector<std::vector<Foo>> ret;
    	std::vector<Foo> l;
    	DisjointSets(ret, l, in, 0);
    	return ret;
    }
    
    int main()
    {
    	std::vector<Foo> list = {{0, 10}, {2, 5}, {3, 6}, {5, 8}, {7, 11}, {10, 15}};
    	std::sort(list.begin(), list.end());
    	for(auto& e: Get(list))
    		PrintSet(e);
    	return 0;
    }
    

    Phyphor on
  • Options
    PhyphorPhyphor Building Planet Busters Tasting FruitRegistered User regular
    edited July 2016
    Okay yeah complexity can get really bad with lots of overlapping members

    Phyphor on
  • Options
    Alistair HuttonAlistair Hutton Dr EdinburghRegistered User regular
    edited July 2016
    Phyphor wrote: »
    Given the sets of: {0, 10}, {2, 5}, {3, 6}, {5, 8}, {7, 11}, {10, 15} (with start end not index/length)

    The results should be:
    {0,10} {10,15}
    {2,5} {5,8} {10,15}
    {2,5} {7,11}
    {3,6} {7,11}
    {3,6} {10, 15}


    The following produces:
    (0, 10) (10, 15)
    (2, 5) (5, 8) (10, 15)
    (2, 5) (7, 11)
    (3, 6) (7, 11)
    (3, 6) (10, 15)

    edit: Okay I fixed it I think
    edit2: added explicit sorting
    #include <map>
    #include <vector>
    #include <algorithm>
    
    struct Foo
    {
    	int start, end;
    
    	bool overlaps(const Foo& f) const
    	{
    		return start < f.end == f.start < end;
    	}
    	bool operator < (const Foo& r) const
    	{
    		return start < r.start || start == r.start && end > r.end;
    	}
    };
    
    void PrintSet(const std::vector<Foo>& list)
    {
    	for(auto& e: list)
    		printf("(%d, %d) ", e.start, e.end);
    	printf("\n");
    }
    
    void DisjointSets(std::vector<std::vector<Foo>>& ll, std::vector<Foo>& out, const std::vector<Foo>& in, size_t i)
    {
    	while(!out.empty() && i < in.size() && out.back().overlaps(in[i])) i++;
    	if(i == in.size()) {
    		ll.emplace_back(out);
    		return;
    	}
    
    	size_t j = i;
    	int smallest_end = in[i].end;
    	for(; j < in.size() && in[i].overlaps(in[j]); j++) {
    		if(in[j].start >= smallest_end)
    			continue;
    		if(in[j].end < smallest_end)
    			smallest_end = in[j].end;
    		std::vector<Foo> temp = out;
    		temp.push_back(in[j]);
    		DisjointSets(ll, temp, in, j+1);
    	}
    }
    
    std::vector<std::vector<Foo>> Get(const std::vector<Foo>& in)
    {
    	std::vector<std::vector<Foo>> ret;
    	std::vector<Foo> l;
    	DisjointSets(ret, l, in, 0);
    	return ret;
    }
    
    int main()
    {
    	std::vector<Foo> list = {{0, 10}, {2, 5}, {3, 6}, {5, 8}, {7, 11}, {10, 15}};
    	std::sort(list.begin(), list.end());
    	for(auto& e: Get(list))
    		PrintSet(e);
    	return 0;
    }
    

    As a minor thing if 5 is the end and start element surely the counts as an overlapping match?

    So {2,5} {5,8} {10,15} shouldn't be there?

    Alistair Hutton on
    I have a thoughtful and infrequently updated blog about games http://whatithinkaboutwhenithinkaboutgames.wordpress.com/

    I made a game, it has penguins in it. It's pay what you like on Gumroad.

    Currently Ebaying Nothing at all but I might do in the future.
  • Options
    InfidelInfidel Heretic Registered User regular
    Usually you denote start/end that way with start being inclusive and end being exclusive.

    [start, end)

    OrokosPA.png
  • Options
    Alistair HuttonAlistair Hutton Dr EdinburghRegistered User regular
    Combined with a discussion with my wife and your description Phyphor I think I have got a way to think about this as a tree construction problem that should be reasonably elegant. The difficulty I had was working our where in the tree new nodes should be added but I think I have a way.

    I have a thoughtful and infrequently updated blog about games http://whatithinkaboutwhenithinkaboutgames.wordpress.com/

    I made a game, it has penguins in it. It's pay what you like on Gumroad.

    Currently Ebaying Nothing at all but I might do in the future.
  • Options
    PhyphorPhyphor Building Planet Busters Tasting FruitRegistered User regular
    edited July 2016
    Oh, yeah that's the C/C++ convention where the "end" of a list is always one-past-the-end. It has a couple nice properties, end index is always equal to the size for 0-based indices, size of a subset is always the difference between two indices and a zero-length subset doesn't require any special handling

    It's simple to convert it be the other way around though

    Phyphor on
  • Options
    Alistair HuttonAlistair Hutton Dr EdinburghRegistered User regular
    edited July 2016
    I am hubristically going to announce I think I have an algorithm that's n²ish+n

    I am too tired to implement it today but I will take a stab at it tomorrow. It's based on constructing a directed acyclic tree from the matches. It might just be a fancier way of doing Phyphor's solution but we'll see.

    Edit: I think my worst case is n³+n and my best case is n+n but I think there is scope for some serious optimisations so that my general case is solidly n² or maybe better.

    Alistair Hutton on
    I have a thoughtful and infrequently updated blog about games http://whatithinkaboutwhenithinkaboutgames.wordpress.com/

    I made a game, it has penguins in it. It's pay what you like on Gumroad.

    Currently Ebaying Nothing at all but I might do in the future.
  • Options
    PhyphorPhyphor Building Planet Busters Tasting FruitRegistered User regular
    edited July 2016
    Yeah that's a fairly straightforward optimization to what I wrote, instead of recalculating the common sequences like I'm doing with the recursion a DAG effectively memoizes them

    You can effectively split the problem into multiple pieces too if you can find one element that is in every set, everything below can be calculated separately from everything above

    Phyphor on
  • Options
    InfidelInfidel Heretic Registered User regular
    Phyphor wrote: »
    Yeah that's a fairly straightforward optimization to what I wrote, instead of recalculating the common sequences like I'm doing with the recursion a DAG effectively memoizes them

    You can effectively split the problem into multiple pieces too if you can find one element that is in every set, everything below can be calculated separately from everything above

    Yeah, the disjoint set (however you do it) partitions the problem well. If you can find any, they are completely independent pieces that can be put through your algorithm and stitched at the end.

    OrokosPA.png
  • Options
    InfidelInfidel Heretic Registered User regular
    Phyphor wrote: »
    Yeah that's a fairly straightforward optimization to what I wrote, instead of recalculating the common sequences like I'm doing with the recursion a DAG effectively memoizes them

    You can effectively split the problem into multiple pieces too if you can find one element that is in every set, everything below can be calculated separately from everything above

    Yeah, the disjoint set (however you do it) partitions the problem well. If you can find any, they are completely independent pieces that can be put through your algorithm and stitched at the end.

    OrokosPA.png
  • Options
    ecco the dolphinecco the dolphin Registered User regular
    Hey, is there any interest in having a "Programming Thread Problem of the Month" ?

    Where I suppose we focus on the algorithm to solve the problem, and then the implementation (I'm sure some languages would be nicer than others haha).

    Penny Arcade Developers at PADev.net.
  • Options
    PhyphorPhyphor Building Planet Busters Tasting FruitRegistered User regular
    Sure, that sounds interesting

This discussion has been closed.