Alright so I've been at this job for a bit over a year, and basically my boss uses STATA and I use R
We both kind of came out of academics, hence academic-ish stats programs, that's just what we're used to.
But like... it is becoming very clear that what we're doing is not the right way to do what we do long term. We wind up with poorly built rickety scripts that are really hard to keep track of and really difficult to show other people or keep up to date effectively. Just lots and lots of individual scripts that rely on specific prepping to run.
So what I'm trying to look for is a change in language and/or approach for the both of us to leap into and try to focus on doing this stuff properly.
A task we often do:
- Take information in the form of .csv and .xls files, usually tables wit some mixture of data types, saved locally
- Alter that information, table operations and reshaping and all that, making new variables out of old ones, correcting errors
- Report data from these tables/reshaped things
- Most of the data reporting winds up being done in Tableau or like just some raw numbers like "20 people blah, 40 people blahed".
But like basically I feel like the sort of thing we were doing in academia isn't coming up too often here. We're not doing complex stats all that often so much as just some straightforward math and fussing about with graphing in a separate much prettier program because we can actually buy a thing to make nice graphs. We're doing data manipulation more often than we're doing stats, because often the .csv and .xls files MUST come in exactly that layout, and we've got to fuss with them a lot before they work right for our purposes.
And I've definitely run into some issues where we were doing operations on such giant data sets that R would chug for like 20-40 minutes to complete operations. There was one moment with literal millions of rows of thousands of columns where it just was not possible to perform certain operations, even after making the code as efficient as I could manage.
Is there another language that people seem to enjoy for this sort of work? I've had a bit of experience with SQL, a little C#, a bit of Python. I'm open to learning whatever, though, my boss is also eager to pick up something other than STATA scripting. Also would help to have a nice strong style guide/tutorial so we don't get into bad habits again!
I like the Python recommendations; especially if you're the kind of person who wants to run utilities from the command line.
On the other hand, my partner is a data science graduate who did a lot of R in grad school. Since R scripts tend to be very siloed, like you mentioned, data manipulation can be difficult to keep track of. She migrated those to Rails and put them up on Heroku so that her department wouldn't have to requisition server resources from the IT department.
The big upsides are that it can be run by non-technical peers in her department, and there is always one canonical version. The big downside is that learning How2WebApp is an undertaking in and of itself. If you don't already know how and/or you don't want to do it recreationally, sticking with command line Ruby or Python is a better bet.
I think the command-line will make more sense due to the potentially confidential nature of the information we're getting. Some of it is protected health info, and I don't want to risk any sort of opening of a security hole, even though it'd be unlikely. I know for a fact at least one of my boss's scripts used identifiable information to do some of the work. If I can't absolutely guarantee the security I wouldn't want to risk it, and absolutely guaranteeing that would be a like months-long process with the way things go here.
I think the command-line will make more sense due to the potentially confidential nature of the information we're getting. Some of it is protected health info, and I don't want to risk any sort of opening of a security hole, even though it'd be unlikely. I know for a fact at least one of my boss's scripts used identifiable information to do some of the work. If I can't absolutely guarantee the security I wouldn't want to risk it, and absolutely guaranteeing that would be a like months-long process with the way things go here.
This is really helpful, though!
Oh wow yes, if you are in HIPAA land, securing information is a whole other topic. Good call!
If nothing else, tracking your scripts with a VCS (e.g. Git) will help with tracking scripts, and you can add a little ReadMe to the repo that will explain the setup process.
Check out any of the sinatra-based frameworks like sinatra itself, nancyfx, node.js with express/loopback/etc, servicestack (not really sinatra based but it's a neat concept)
Pick a language and I can probably point you in the right direction!
I dunno,
we have a million and one ways of directly accessing the database as it stands, we do raw SQL queries from PHP and C# applications, there's some stuff using entity framework, some stuff using T-SQL through a custom ODBC driver, a shitload of SSIS packages, loads of stored procedures, the list goes on...
Really I'm not sure what would be easier, and/or better in the long run.
Do I advocate going for some kind of data access layer that sits directly on top of the databases that everything goes through, that operates completely independently of any of our applications, or do we go for implementing some standardised library that we can plug into all the apps individually that lets us switch databases relatively easily with some kind of plugin architecture.
To me, it seems like having the data access be a service in and of itself would simplify this kind of migration in the future, and let us be provider agnostic going forward with less effort, but surely it kills performance to have all that extra cruft in the way.
Language wise whatever we do will end up likely being c# or C++.
I'd recommend the DAL through a RESTful service. That's what I'm doing, and I'm finding it much easier to deal with small helper utilities that might need access to data compared to the library system I had before.
With C# you can use dynamic objects and JSON to convert back and forth from applications to the server and not even need classes set up before hand. Whether that's good or not is up to you.
Nancyfx is a nice package, you should look into it, it's C# based, and the one I'm most familiar with anyways.
Basically everything would end up going through something like http://locallansystem:9999/customer/1245033 and you could pull a customer object down based on the ID you sent it. This way php, c#, C++, python, javascript, etc etc etc can do it all
not a doctor, not a lawyer, examples I use may not be fully researched so don't take out of context plz, don't @ me
I have a list of Match objects that have a index and length field
Match
{
index = 5,
length = 10
}
Where index is where they start the match in an arbitrary string and length is the length of the match. So a Match(3,5) means starting at character 3 in the string there is a 5 character long match.
So a list of all Match groups where every indivdual Match does not overlap with another Match. Crucially though I don't wat to see
[Match(50,10), Match(200,10)]
or
[Match(200,10)]
as one of the returned groups. The groups have to be the largest set of non-overlapping Matches possible.
The solution to this is swirling around on the edge of my ability to comprehend it. It seems like it should be a really simple classical algorithmic problem but every time I think I am close I discover and edge case that ruins my approach.
OrcaAlso known as EspressosaurusWrexRegistered Userregular
edited July 2016
A naive solution would be an an O(n!) complexity, trying all permutations in order and returning the non-overlapping portion of each run.
A quick google for disjoint set (which is what you want to turn your data into) turns up the appropriately named disjoint-set data structure, runs in a much more reasonable nlog(n).
edit: disjoint-set would only return *one* of the disjoint sets...problematic. I'll have some coffee and think about it, but getting reasonable time/space complexity might be interesting...
I have a list of Match objects that have a index and length field
Match
{
index = 5,
length = 10
}
Where index is where they start the match in an arbitrary string and length is the length of the match. So a Match(3,5) means starting at character 3 in the string there is a 5 character long match.
So a list of all Match groups where every indivdual Match does not overlap with another Match. Crucially though I don't wat to see
[Match(6,1),Match(50,10), Match(200,10)]
or
[Match(50,10), Match(200,10)]
or
[Match(200,10)]
as one of the returned groups. The groups have to be the largest set of non-overlapping Matches possible.
The solution to this is swirling around on the edge of my ability to comprehend it. It seems like it should be a really simple classical algorithmic problem but every time I think I am close I discover and edge case that ruins my approach.
Generating the possible lists is not too bad. Assuming you're good there and just worried about the largest sets part, you can generate the full list first and then test for each element in that list (the set of matches) that it is not the proper subset of any other, if so remove it.
Performance is a thing I like to look at after I have a functioning and correct algorithm, because usually then I can see the potential optimizations (the code patterns tend to be common even if the problem was novel).
0
Options
OrcaAlso known as EspressosaurusWrexRegistered Userregular
A naive solution would be an an O(n!) complexity, trying all permutations in order and returning the non-overlapping portion of each run.
A quick google for disjoint set (which is what you want to turn your data into) turns up the appropriately named disjoint-set data structure, runs in a much more reasonable nlog(n).
edit: disjoint-set would only return *one* of the disjoint sets...problematic. I'll have some coffee and think about it, but getting reasonable time/space complexity might be interesting...
perhaps start with a disjoint set. Any elements of the set with more than one entry need to have one element removed and disjoint-set run on those elements. I'm going to use starting and ending characters, because it's easier for me to think of it that way.
e.x., let's say there's a a set consisting of [(0, 3), (5, 10), (5, 7), (8, 10), (20, 30)]. We run disjoint set on the input and get:
[[(0,3)],
[(5, 10), (5, 7), (8, 10)],
[(20, 30)]]
The first and last sets consist of only one element; they'll be part of every single reported list. The middle set is more complicated, consisting of 3 elements. What if we could make that set disjoint? We could recursively remove one element and then run disjoint-set on the remainder. The problem is that the order in which we remove elements from consideration is important, which means we're permuting the order in which we remove elements, putting us squarely in O(n!) territory again--but with the right data sets you're at least lowering the size of your n.
This is a surprisingly interesting problem for a Saturday morning.
Yeah, the disjoint set would let you remove the trivial cases, like the omni-present Match(200, 10) in the example, since the real solutions are the same as the answer for the first four elements only, with the last tacked onto each solution.
This is a surprisingly interesting problem for a Saturday morning.
Yeah I initially thought this was a really easy trivial-to-solve-with-recursion problem but it is proving meatier than I thought.
To be honest I'm tempted to say that in my real world application of this the total size of n will never be that large so I could just go for the - generate all combinations of all sizes, filter the combinations to remove overlaps and discard all sequences that are strict subsets approach.
I just can't help but think there is a perfect solution for this that I am missing. Also, if I very hit some pathological input my solution will blow up fast.
This is a surprisingly interesting problem for a Saturday morning.
Yeah I initially thought this was a really easy trivial-to-solve-with-recursion problem but it is proving meatier than I thought.
To be honest I'm tempted to say the total size of n will never be that large so I could just got for the generate all combinations of all sizes, filter the combinations to remove overlaps and discard all sequences that are strict subsets.
I just can't help but think there is a perfect solution for this that I am missing. Also, if I very hit some pathological input my solution will blow up fast.
It's the requirement for all disjoint sets that makes the problem explode. One partition isn't hard.
This is a surprisingly interesting problem for a Saturday morning.
Yeah I initially thought this was a really easy trivial-to-solve-with-recursion problem but it is proving meatier than I thought.
To be honest I'm tempted to say the total size of n will never be that large so I could just got for the generate all combinations of all sizes, filter the combinations to remove overlaps and discard all sequences that are strict subsets.
I just can't help but think there is a perfect solution for this that I am missing. Also, if I very hit some pathological input my solution will blow up fast.
It's the requirement for all disjoint sets that makes the problem explode. One partition isn't hard.
Wait, wait, wait.
I think his might be a time for dynamic programming.
This is a surprisingly interesting problem for a Saturday morning.
Yeah I initially thought this was a really easy trivial-to-solve-with-recursion problem but it is proving meatier than I thought.
To be honest I'm tempted to say the total size of n will never be that large so I could just got for the generate all combinations of all sizes, filter the combinations to remove overlaps and discard all sequences that are strict subsets.
I just can't help but think there is a perfect solution for this that I am missing. Also, if I very hit some pathological input my solution will blow up fast.
It's the requirement for all disjoint sets that makes the problem explode. One partition isn't hard.
Wait, wait, wait.
I think his might be a time for dynamic programming.
Maybe.
I can smell an n² solution.
Oh no, wait that's just dinner I'm cooking.
I was thinking of that, but the ordering dependency seems to me like it makes the memoization have the same complexity. I'd love to be proven wrong though!
Other thought, can I formulate this as a graph traversal/spanning/search problem?
POST DOODLING EDIT: I am smelling gold when it comes to thinking about this as graph traversal - this might just end up as a round about way of doing disjoint subsets but I'm feeling it, hopefully I'll be able to find some time to code this tonight.
POST POST DOODLING EDIT: No, I am smelling anti gold. Guh, horrible. The key thing here is to get a comprehensive set of test cases so I don't keep going down blind alleys.
Other thought, can I formulate this as a graph traversal/spanning/search problem?
POST DOODLING EDIT: I am smelling gold when it comes to thinking about this as graph traversal - this might just end up as a round about way of doing disjoint subsets but I'm feeling it, hopefully I'll be able to find some time to code this tonight.
POST POST DOODLING EDIT: No, I am smelling anti gold. Guh, horrible. The key thing here is to get a comprehensive set of test cases so I don't keep going down blind alleys.
Yeah, my attempts to construct a suitable graph have so far either ended up having O(n!) space complexity (on top of the implied O(n!) to construct it), which is not helpful, or reduced to the same O(n!) runtime complexity as the naive solution.
I think I will implement my naive solution to prove that I can make something work then I'll take a stab at a dynamic programming thingy.
I'm convinced that dynamic programming must be a winner because this looks like a project Euler style problem and the answer to those is always dynamic programming.
Gorilla is a Notebook-based REPL and now became a notebook interface for OpenNARS 2.0.0. Since it is a REPL, it allows full Clojure code and allows one to work with examples in a Notebook-like manner and to document them (also has Latex support) and to nicely present them while the system is running (saving worksheets is possible of course also), making experiments of any kind easy and reproducible. It can be used while Lense is running, and can be used together with the Cursive/IntelliJ IDE.
A naive solution would be an an O(n!) complexity, trying all permutations in order and returning the non-overlapping portion of each run.
A quick google for disjoint set (which is what you want to turn your data into) turns up the appropriately named disjoint-set data structure, runs in a much more reasonable nlog(n).
edit: disjoint-set would only return *one* of the disjoint sets...problematic. I'll have some coffee and think about it, but getting reasonable time/space complexity might be interesting...
perhaps start with a disjoint set. Any elements of the set with more than one entry need to have one element removed and disjoint-set run on those elements. I'm going to use starting and ending characters, because it's easier for me to think of it that way.
e.x., let's say there's a a set consisting of [(0, 3), (5, 10), (5, 7), (8, 10), (20, 30)]. We run disjoint set on the input and get:
[[(0,3)],
[(5, 10), (5, 7), (8, 10)],
[(20, 30)]]
The first and last sets consist of only one element; they'll be part of every single reported list. The middle set is more complicated, consisting of 3 elements. What if we could make that set disjoint? We could recursively remove one element and then run disjoint-set on the remainder. The problem is that the order in which we remove elements from consideration is important, which means we're permuting the order in which we remove elements, putting us squarely in O(n!) territory again--but with the right data sets you're at least lowering the size of your n.
This is a surprisingly interesting problem for a Saturday morning.
The ordering in which you remove entries is constrained though, if you sort the list by starting index, you can do something like:
Pick first element, add to list
If overlaps, recurse with a copy with this element removed to produce other sets
Skip any overlapping elements and continue on
Because you know that you don't have to consider any elements with a lower starting index than the current cursor this should limit it
A naive solution would be an an O(n!) complexity, trying all permutations in order and returning the non-overlapping portion of each run.
A quick google for disjoint set (which is what you want to turn your data into) turns up the appropriately named disjoint-set data structure, runs in a much more reasonable nlog(n).
edit: disjoint-set would only return *one* of the disjoint sets...problematic. I'll have some coffee and think about it, but getting reasonable time/space complexity might be interesting...
perhaps start with a disjoint set. Any elements of the set with more than one entry need to have one element removed and disjoint-set run on those elements. I'm going to use starting and ending characters, because it's easier for me to think of it that way.
e.x., let's say there's a a set consisting of [(0, 3), (5, 10), (5, 7), (8, 10), (20, 30)]. We run disjoint set on the input and get:
[[(0,3)],
[(5, 10), (5, 7), (8, 10)],
[(20, 30)]]
The first and last sets consist of only one element; they'll be part of every single reported list. The middle set is more complicated, consisting of 3 elements. What if we could make that set disjoint? We could recursively remove one element and then run disjoint-set on the remainder. The problem is that the order in which we remove elements from consideration is important, which means we're permuting the order in which we remove elements, putting us squarely in O(n!) territory again--but with the right data sets you're at least lowering the size of your n.
This is a surprisingly interesting problem for a Saturday morning.
The ordering in which you remove entries is constrained though, if you sort the list by starting index, you can do something like:
Pick first element, add to list
If overlaps, recurse with a copy with this element removed to produce other sets
Skip any overlapping elements and continue on
Because you know that you don't have to consider any elements with a lower starting index than the current cursor this should limit it
Hm. Sort by starting index, and items with the same starting index reverse sort by the ending index? Then it's fully constrained.
More than one element returned, and the size of at least one set is greater than one; run on the set of size > 1:
[(5, 10), (5, 7), (8, 10)]
Remove first element and run disjoint set:
[
[(5, 7)],
[(8, 10)]
]
sets are size 1, return...
This looks extremely promising!
Orca on
0
Options
OrcaAlso known as EspressosaurusWrexRegistered Userregular
edited July 2016
So let's see. O(n log(n)) for the sort (the two stage sort could be combined by a sort function that converts the two variables into a single value, e.g. start_index * 1e7 - end_index). O(n log(n)) for each disjoint set operation. In the worst case, that might be done n times. So O(n log(n) + n^2log(n)) -> O(n^2 log(n)). Not a bad reduction from O(n!), assuming this covers all inputs!
Combined with a discussion with my wife and your description Phyphor I think I have got a way to think about this as a tree construction problem that should be reasonably elegant. The difficulty I had was working our where in the tree new nodes should be added but I think I have a way.
Oh, yeah that's the C/C++ convention where the "end" of a list is always one-past-the-end. It has a couple nice properties, end index is always equal to the size for 0-based indices, size of a subset is always the difference between two indices and a zero-length subset doesn't require any special handling
It's simple to convert it be the other way around though
I am hubristically going to announce I think I have an algorithm that's n²ish+n
I am too tired to implement it today but I will take a stab at it tomorrow. It's based on constructing a directed acyclic tree from the matches. It might just be a fancier way of doing Phyphor's solution but we'll see.
Edit: I think my worst case is n³+n and my best case is n+n but I think there is scope for some serious optimisations so that my general case is solidly n² or maybe better.
Yeah that's a fairly straightforward optimization to what I wrote, instead of recalculating the common sequences like I'm doing with the recursion a DAG effectively memoizes them
You can effectively split the problem into multiple pieces too if you can find one element that is in every set, everything below can be calculated separately from everything above
Yeah that's a fairly straightforward optimization to what I wrote, instead of recalculating the common sequences like I'm doing with the recursion a DAG effectively memoizes them
You can effectively split the problem into multiple pieces too if you can find one element that is in every set, everything below can be calculated separately from everything above
Yeah, the disjoint set (however you do it) partitions the problem well. If you can find any, they are completely independent pieces that can be put through your algorithm and stitched at the end.
Yeah that's a fairly straightforward optimization to what I wrote, instead of recalculating the common sequences like I'm doing with the recursion a DAG effectively memoizes them
You can effectively split the problem into multiple pieces too if you can find one element that is in every set, everything below can be calculated separately from everything above
Yeah, the disjoint set (however you do it) partitions the problem well. If you can find any, they are completely independent pieces that can be put through your algorithm and stitched at the end.
Posts
I like the Python recommendations; especially if you're the kind of person who wants to run utilities from the command line.
On the other hand, my partner is a data science graduate who did a lot of R in grad school. Since R scripts tend to be very siloed, like you mentioned, data manipulation can be difficult to keep track of. She migrated those to Rails and put them up on Heroku so that her department wouldn't have to requisition server resources from the IT department.
The big upsides are that it can be run by non-technical peers in her department, and there is always one canonical version. The big downside is that learning How2WebApp is an undertaking in and of itself. If you don't already know how and/or you don't want to do it recreationally, sticking with command line Ruby or Python is a better bet.
I think the command-line will make more sense due to the potentially confidential nature of the information we're getting. Some of it is protected health info, and I don't want to risk any sort of opening of a security hole, even though it'd be unlikely. I know for a fact at least one of my boss's scripts used identifiable information to do some of the work. If I can't absolutely guarantee the security I wouldn't want to risk it, and absolutely guaranteeing that would be a like months-long process with the way things go here.
This is really helpful, though!
recruiter/hr guy called me up just now
'you're really impressive with your React knowledge. But we really need someone that knows javascript, css and html'
....right okay buddy, so you have no idea what you are talking about, that's cool.
PARKER, YOU'RE FIRED! <-- My comic book podcast! Satan look here!
Well you dodged a bullet.
Oh wow yes, if you are in HIPAA land, securing information is a whole other topic. Good call!
If nothing else, tracking your scripts with a VCS (e.g. Git) will help with tracking scripts, and you can add a little ReadMe to the repo that will explain the setup process.
I dunno,
we have a million and one ways of directly accessing the database as it stands, we do raw SQL queries from PHP and C# applications, there's some stuff using entity framework, some stuff using T-SQL through a custom ODBC driver, a shitload of SSIS packages, loads of stored procedures, the list goes on...
Really I'm not sure what would be easier, and/or better in the long run.
Do I advocate going for some kind of data access layer that sits directly on top of the databases that everything goes through, that operates completely independently of any of our applications, or do we go for implementing some standardised library that we can plug into all the apps individually that lets us switch databases relatively easily with some kind of plugin architecture.
To me, it seems like having the data access be a service in and of itself would simplify this kind of migration in the future, and let us be provider agnostic going forward with less effort, but surely it kills performance to have all that extra cruft in the way.
Language wise whatever we do will end up likely being c# or C++.
I'd recommend the DAL through a RESTful service. That's what I'm doing, and I'm finding it much easier to deal with small helper utilities that might need access to data compared to the library system I had before.
With C# you can use dynamic objects and JSON to convert back and forth from applications to the server and not even need classes set up before hand. Whether that's good or not is up to you.
Nancyfx is a nice package, you should look into it, it's C# based, and the one I'm most familiar with anyways.
Basically everything would end up going through something like http://locallansystem:9999/customer/1245033 and you could pull a customer object down based on the ID you sent it. This way php, c#, C++, python, javascript, etc etc etc can do it all
Doing some due diligence by reading up on best practises.
Farrrrrr out now I have to protect against XSS and CSRF attacks for my authentication to have any meaning.
Edit: I mean, yes, they should be done as a matter of course, but man. Security and doing things properly.
I was just going to throw stuff together as a proof of concept, but now I think I need to do it properly.
Edit 2: Woah, I think that NancyFx's simple view engine can generate anti forgery tokens for me in simple HTML files. Yessssss
I have a list of Match objects that have a index and length field
Where index is where they start the match in an arbitrary string and length is the length of the match. So a Match(3,5) means starting at character 3 in the string there is a 5 character long match.
So given the list I want a function that returns
So a list of all Match groups where every indivdual Match does not overlap with another Match. Crucially though I don't wat to see
as one of the returned groups. The groups have to be the largest set of non-overlapping Matches possible.
The solution to this is swirling around on the edge of my ability to comprehend it. It seems like it should be a really simple classical algorithmic problem but every time I think I am close I discover and edge case that ruins my approach.
I made a game, it has penguins in it. It's pay what you like on Gumroad.
Currently Ebaying Nothing at all but I might do in the future.
A quick google for disjoint set (which is what you want to turn your data into) turns up the appropriately named disjoint-set data structure, runs in a much more reasonable nlog(n).
edit: disjoint-set would only return *one* of the disjoint sets...problematic. I'll have some coffee and think about it, but getting reasonable time/space complexity might be interesting...
Generating the possible lists is not too bad. Assuming you're good there and just worried about the largest sets part, you can generate the full list first and then test for each element in that list (the set of matches) that it is not the proper subset of any other, if so remove it.
Performance is a thing I like to look at after I have a functioning and correct algorithm, because usually then I can see the potential optimizations (the code patterns tend to be common even if the problem was novel).
perhaps start with a disjoint set. Any elements of the set with more than one entry need to have one element removed and disjoint-set run on those elements. I'm going to use starting and ending characters, because it's easier for me to think of it that way.
e.x., let's say there's a a set consisting of [(0, 3), (5, 10), (5, 7), (8, 10), (20, 30)]. We run disjoint set on the input and get:
[[(0,3)],
[(5, 10), (5, 7), (8, 10)],
[(20, 30)]]
The first and last sets consist of only one element; they'll be part of every single reported list. The middle set is more complicated, consisting of 3 elements. What if we could make that set disjoint? We could recursively remove one element and then run disjoint-set on the remainder. The problem is that the order in which we remove elements from consideration is important, which means we're permuting the order in which we remove elements, putting us squarely in O(n!) territory again--but with the right data sets you're at least lowering the size of your n.
This is a surprisingly interesting problem for a Saturday morning.
Yeah I initially thought this was a really easy trivial-to-solve-with-recursion problem but it is proving meatier than I thought.
To be honest I'm tempted to say that in my real world application of this the total size of n will never be that large so I could just go for the - generate all combinations of all sizes, filter the combinations to remove overlaps and discard all sequences that are strict subsets approach.
I just can't help but think there is a perfect solution for this that I am missing. Also, if I very hit some pathological input my solution will blow up fast.
I made a game, it has penguins in it. It's pay what you like on Gumroad.
Currently Ebaying Nothing at all but I might do in the future.
It's the requirement for all disjoint sets that makes the problem explode. One partition isn't hard.
Wait, wait, wait.
I think his might be a time for dynamic programming.
Maybe.
I can smell an n² solution.
Oh no, wait that's just dinner I'm cooking.
I made a game, it has penguins in it. It's pay what you like on Gumroad.
Currently Ebaying Nothing at all but I might do in the future.
I was thinking of that, but the ordering dependency seems to me like it makes the memoization have the same complexity. I'd love to be proven wrong though!
POST DOODLING EDIT: I am smelling gold when it comes to thinking about this as graph traversal - this might just end up as a round about way of doing disjoint subsets but I'm feeling it, hopefully I'll be able to find some time to code this tonight.
POST POST DOODLING EDIT: No, I am smelling anti gold. Guh, horrible. The key thing here is to get a comprehensive set of test cases so I don't keep going down blind alleys.
I made a game, it has penguins in it. It's pay what you like on Gumroad.
Currently Ebaying Nothing at all but I might do in the future.
Yeah, my attempts to construct a suitable graph have so far either ended up having O(n!) space complexity (on top of the implied O(n!) to construct it), which is not helpful, or reduced to the same O(n!) runtime complexity as the naive solution.
I'm convinced that dynamic programming must be a winner because this looks like a project Euler style problem and the answer to those is always dynamic programming.
I made a game, it has penguins in it. It's pay what you like on Gumroad.
Currently Ebaying Nothing at all but I might do in the future.
Only for the first half of the problems.
The ordering in which you remove entries is constrained though, if you sort the list by starting index, you can do something like:
Pick first element, add to list
If overlaps, recurse with a copy with this element removed to produce other sets
Skip any overlapping elements and continue on
Because you know that you don't have to consider any elements with a lower starting index than the current cursor this should limit it
Hm. Sort by starting index, and items with the same starting index reverse sort by the ending index? Then it's fully constrained.
So:
[(0,100), (0, 3), (5, 10), (5, 7), (8, 10), (20, 30)] ->
[(0, 100), (0, 3), (5, 7), (5, 10), (8, 10), (20, 30)]
disjoint set:
[
[(0, 100), (0, 3), (5, 10), (5, 7), (8, 10), (20, 30)]
] (well, that wasn't helpful)
Remove first element of disjoint set, recursively run disjoint set on remainder:
[
[(0, 3)],
[(5, 10), (5, 7), (8, 10)]
[(20, 30)]
]
More than one element returned, and the size of at least one set is greater than one; run on the set of size > 1:
[(5, 10), (5, 7), (8, 10)]
Remove first element and run disjoint set:
[
[(5, 7)],
[(8, 10)]
]
sets are size 1, return...
This looks extremely promising!
The results should be:
{0,10} {10,15}
{2,5} {5,8} {10,15}
{2,5} {7,11}
{3,6} {7,11}
{3,6} {10, 15}
The following produces:
(0, 10) (10, 15)
(2, 5) (5, 8) (10, 15)
(2, 5) (7, 11)
(3, 6) (7, 11)
(3, 6) (10, 15)
edit: Okay I fixed it I think
edit2: added explicit sorting
As a minor thing if 5 is the end and start element surely the counts as an overlapping match?
So {2,5} {5,8} {10,15} shouldn't be there?
I made a game, it has penguins in it. It's pay what you like on Gumroad.
Currently Ebaying Nothing at all but I might do in the future.
[start, end)
I made a game, it has penguins in it. It's pay what you like on Gumroad.
Currently Ebaying Nothing at all but I might do in the future.
It's simple to convert it be the other way around though
I am too tired to implement it today but I will take a stab at it tomorrow. It's based on constructing a directed acyclic tree from the matches. It might just be a fancier way of doing Phyphor's solution but we'll see.
Edit: I think my worst case is n³+n and my best case is n+n but I think there is scope for some serious optimisations so that my general case is solidly n² or maybe better.
I made a game, it has penguins in it. It's pay what you like on Gumroad.
Currently Ebaying Nothing at all but I might do in the future.
You can effectively split the problem into multiple pieces too if you can find one element that is in every set, everything below can be calculated separately from everything above
Yeah, the disjoint set (however you do it) partitions the problem well. If you can find any, they are completely independent pieces that can be put through your algorithm and stitched at the end.
Yeah, the disjoint set (however you do it) partitions the problem well. If you can find any, they are completely independent pieces that can be put through your algorithm and stitched at the end.
Where I suppose we focus on the algorithm to solve the problem, and then the implementation (I'm sure some languages would be nicer than others haha).