So I have a system that is set up poorly, the guy that ran it basically did whatever on it piecemeal, took off without leaving any real information about anything he used to do there, and it has virtually no documentation to speak of about the system, it's setup, what anything on it does, or why certain things on it are set up/done on it in the way they are.
Most of that has nothing to do with me, but I do have a process that I need to do (currently, I'm seeing what it is they actually want).
So what I need to do is search a given number of files generated multiple times daily for various information.
I would use a recursive search but it seems to only work for file types *.log, etc. And unfortunately, they way they have it set up they append the dates on the end... after the .log, as opposed to prefixing the file with the date/time as is usually the normal way. Grep simply refuses to do anything when I try and use a *fileIdentifiersIneed* --include with the recursive search.
So essentially here is what I need to do, and I'd love to figure out a better way to do this:
Search a subset of files within the current directory that have file names matching a string. EG: *2015-08-14-*
Within those files, they need to search for data between two sets of data. EG: <Date> as the start value, and </location> as the end value. I need everything in between those.
It then will dump all that to a file in my directory for any values found within that search.
I can then search that much smaller single file for any subsets of information I need.
Right now I can get around it by copying the files to a different directory (and the cp command has the same limitation, so I need to copy the files manually) and then do a normal recursive search that way. But I'd like to cut down the steps and make this as painless as possible.
And also, as an addendum. This is a RHEL6 system, I cannot install anything on it, and I need to be as minimally intrusive as possible in anything I do. So the files themselves can't be touched or changed by my. So I could copy them to manipulate them if need be, but I can't change anything about them or how they are created. But that's one thing I'll be discussing with the people using it whenever they get back with me.
Posts
your stuff sounds complicated enough that i would do it in python, but if you can't install python a bash script would be a pain
i've never really played around in shell scripting, but you should be able to get something with either find or grep -r to find the files, copy to copy them, grep -L (I think -L is what gives you line numbers) to find the right lines, head and tail to get rid of the lines you don't care about, and then maybe sed/awk/cut to get rid of the data you don't care about in the lines you do care about?
I already copy the files over, but as I mentioned, the grep and cp commands just don't work on multiple specified files here due to the way they have (poorly) set them up. I have to manually specify each and every file or it simply refuses to function or do anything.
Kind of a bummer limitation of grep, at least the version on this system anyways. Not sure if there's any updated versions that would let me do what I need.
Of course if they simply named their files in a better way it wouldn't be an issue either. I could just do an --include to the specific files needed and call it a day.
I can work around it for now at least. It's just a pretty abysmally annoying process to go through.
$ find . -name "*2015-08-14-*" -exec grep -Hi "<Date>.*</location>" \{\} \;
sounds like what you want. Find should be installed by default.
Edit: Are you expecting <Date> and </location> (in your example) to be on the same line, or a different one? If they're different lines, it gets more complicated. Maybe something like
$ find . -name "*2015-08-14-*" -exec sed -n '/<Date>/,/<\/location>/ p' \{\} \;
although my sed's a little rusty, so I'm not sure if that will work.
Can you give an example of what you mean by "the grep and cp commands just don't work on multiple specified files here due to the way they have (poorly) set them up"? Which version of grep are you using (i.e. grep -V). RHEL should be GNU grep which is pretty fully featured.
As for the questions:
They are on different lines. I need everything between them ideally. Right now I'm just looking for a match and taking it + the 6 proceeding lines. A bit sloppy, but works. The files are all fairly uniform in terms of structure, so there's only ever 1-2 lines more than any other I need to account for (hence the 6 lines as opposed to something like 4).
As for the commands.
Grep seems to only want to function if you can do an --include with files like date.txt, something.txt, etc.
Where you could then specify you want to search all *.txt files, or you can recursive search within *something.txt files.
But the way these files are structured they are not .anything files. They tack on the date/time to the end, so they wind up being name.logdatetime files. So the file "type" is different for every single file in the directory. So in order for me to cut out the ones I need, I need to take the variable to search for from right in the middle of the file name.
Grep just doesn't seem to even do a thing when I try it.
Example would be the -r --include and *-year-month-day-*
Where I need to have that year-month-day* as the file name.
The CP command does the same thing. When I specify it with the ~/directory and the ~*year-month-time* it simply tells me the file doesn't exist.
In both cases I can simply dump it down to a single file instead of multiples and it does fine. But that's far too annoying for me to want to deal with potentially every single week, or possibly even every single day.
E.g. should be grep -R <Date> --include "*-year-month-day-*" *; not grep -R <Date> --include *-year-month-day-* *; as in the latter case the shell will expand the *-year-month-day-*, which isn't what you want.
Specifically, in a regular expression, '.' represents the wildcard character and '*' is a quantier (0 or more)
So '*2015-08-01-*' as a pattern is invalid - the first asterisk is quantifying nothing. The second asterisk is quantifying '-'. To match something like service.log.15-08-01 for the month of August you would use the following pattern '.*log\.15-08-\d\d'
I think you're definitely going to want to do it in stages - as you are doing, first, grab all of the logs you want, then do the searching.
Are the logs themselves xml? If so, you really don't want to be using regular expressions as they really aren't a great tool for the job (even if you are dealing with a known namespace and regular data - which means you can probably make the regular expressions grab what you want, you aren't going to enjoy the experience). Can you copy the logs off to a machine that you can have more control over?
If you can't do that then you're probably going to want to use awk to do the actual file processing/contents searching.
Edit:
I see what I missed. My apologies.
So what i would suggest is just doing an ls, grepping the output to grab what you want, copying those files to the right place and then using awk to find the stuff you want - make it so you don't need to worry about --include. You could also wrap that into a for loop in bash if you wanted to do it that way
for a in ${ls | grep <pattern>} ; do awk <something> $a >> outfile.txt ;done
Can you have more than one <data>+</location> pair in each file?
Then the content within each file will be the same.
The files appear to be plaintext log files, no XML, it's just pages and pages of data all dumped in plaintext with various fields attached.
The structure is more like a poorly coded plain HTML page than anything, but the data I'm interested in is always located between the same sections of data at least. So grabbing around, above, or below that data is good enough for what I need.
At some point we're looking to move to a Splunk deployment and this will all be moot, but I have no idea how long it will be before then, and how long it will take to get everyone to actually configure their systems to log what they need to it when we do.
But within each file, yes there will likely be several matches, all of them need to be returned.
Otherwise it will match the first data tag with last location tag and everything in between?
I don't know whether grep does multiline matching or if I remembering a different tool that doesn't.
Non-greedy. I'm not sure you can do non-greedy matches with grep though.
I'm pretty sure grep doesn't do multiline matching, which is why I suggested sed as an alternative.
For that matter, one could sidestep much of the issue and write a small Perl or Python program to do both the matching and the file searching.
Yum was removed along with plenty of other items, and other parts were just carved away or disabled in a haphazard fashion.
Updates were done manually on a package by package fashion as they were told to update them. So this system, and others like it, are some beastly mix of things that haven't been updated in ~3-6 years or so and stuff that was updated just earlier this month.
I can do nothing about any of that part of it. Ultimately I may just say screw it and just dump the logs out to my local system and work from there.