The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
The Guiding Principles and New Rules document is now in effect.

fun with grep

EclecticGrooveEclecticGroove Registered User regular
edited August 2015 in Help / Advice Forum
So I have a system that is set up poorly, the guy that ran it basically did whatever on it piecemeal, took off without leaving any real information about anything he used to do there, and it has virtually no documentation to speak of about the system, it's setup, what anything on it does, or why certain things on it are set up/done on it in the way they are.

Most of that has nothing to do with me, but I do have a process that I need to do (currently, I'm seeing what it is they actually want).

So what I need to do is search a given number of files generated multiple times daily for various information.
I would use a recursive search but it seems to only work for file types *.log, etc. And unfortunately, they way they have it set up they append the dates on the end... after the .log, as opposed to prefixing the file with the date/time as is usually the normal way. Grep simply refuses to do anything when I try and use a *fileIdentifiersIneed* --include with the recursive search.

So essentially here is what I need to do, and I'd love to figure out a better way to do this:

Search a subset of files within the current directory that have file names matching a string. EG: *2015-08-14-*
Within those files, they need to search for data between two sets of data. EG: <Date> as the start value, and </location> as the end value. I need everything in between those.

It then will dump all that to a file in my directory for any values found within that search.

I can then search that much smaller single file for any subsets of information I need.


Right now I can get around it by copying the files to a different directory (and the cp command has the same limitation, so I need to copy the files manually) and then do a normal recursive search that way. But I'd like to cut down the steps and make this as painless as possible.


And also, as an addendum. This is a RHEL6 system, I cannot install anything on it, and I need to be as minimally intrusive as possible in anything I do. So the files themselves can't be touched or changed by my. So I could copy them to manipulate them if need be, but I can't change anything about them or how they are created. But that's one thing I'll be discussing with the people using it whenever they get back with me.

EclecticGroove on

Posts

  • PowerpuppiesPowerpuppies drinking coffee in the mountain cabinRegistered User regular
    tl;dr, but it sounds like you need a find | xargs cp or a find | xargs grep

    your stuff sounds complicated enough that i would do it in python, but if you can't install python a bash script would be a pain

    i've never really played around in shell scripting, but you should be able to get something with either find or grep -r to find the files, copy to copy them, grep -L (I think -L is what gives you line numbers) to find the right lines, head and tail to get rid of the lines you don't care about, and then maybe sed/awk/cut to get rid of the data you don't care about in the lines you do care about?

    sig.gif
  • EclecticGrooveEclecticGroove Registered User regular
    Yeah I am very limited with what I can do on the system as I don't own it, and they are super touchy about anything being installed on it, no matter how low/no impact it should be.

    I already copy the files over, but as I mentioned, the grep and cp commands just don't work on multiple specified files here due to the way they have (poorly) set them up. I have to manually specify each and every file or it simply refuses to function or do anything.
    Kind of a bummer limitation of grep, at least the version on this system anyways. Not sure if there's any updated versions that would let me do what I need.

    Of course if they simply named their files in a better way it wouldn't be an issue either. I could just do an --include to the specific files needed and call it a day.

    I can work around it for now at least. It's just a pretty abysmally annoying process to go through.

  • PapillonPapillon Registered User regular
    edited August 2015
    Something like

    $ find . -name "*2015-08-14-*" -exec grep -Hi "<Date>.*</location>" \{\} \;

    sounds like what you want. Find should be installed by default.

    Edit: Are you expecting <Date> and </location> (in your example) to be on the same line, or a different one? If they're different lines, it gets more complicated. Maybe something like

    $ find . -name "*2015-08-14-*" -exec sed -n '/<Date>/,/<\/location>/ p' \{\} \;

    although my sed's a little rusty, so I'm not sure if that will work.

    Can you give an example of what you mean by "the grep and cp commands just don't work on multiple specified files here due to the way they have (poorly) set them up"? Which version of grep are you using (i.e. grep -V). RHEL should be GNU grep which is pretty fully featured.

    Papillon on
  • EclecticGrooveEclecticGroove Registered User regular
    I'd have to be back at work to check the exact version.

    As for the questions:
    They are on different lines. I need everything between them ideally. Right now I'm just looking for a match and taking it + the 6 proceeding lines. A bit sloppy, but works. The files are all fairly uniform in terms of structure, so there's only ever 1-2 lines more than any other I need to account for (hence the 6 lines as opposed to something like 4).

    As for the commands.
    Grep seems to only want to function if you can do an --include with files like date.txt, something.txt, etc.
    Where you could then specify you want to search all *.txt files, or you can recursive search within *something.txt files.

    But the way these files are structured they are not .anything files. They tack on the date/time to the end, so they wind up being name.logdatetime files. So the file "type" is different for every single file in the directory. So in order for me to cut out the ones I need, I need to take the variable to search for from right in the middle of the file name.
    Grep just doesn't seem to even do a thing when I try it.
    Example would be the -r --include and *-year-month-day-*
    Where I need to have that year-month-day* as the file name.

    The CP command does the same thing. When I specify it with the ~/directory and the ~*year-month-time* it simply tells me the file doesn't exist.

    In both cases I can simply dump it down to a single file instead of multiples and it does fine. But that's far too annoying for me to want to deal with potentially every single week, or possibly even every single day.

  • PowerpuppiesPowerpuppies drinking coffee in the mountain cabinRegistered User regular
    Use the find command to get around your --include problem.

    sig.gif
  • PapillonPapillon Registered User regular
    Also remember to quote arguments if you don't want the shell to do file globbing.

    E.g. should be grep -R <Date> --include "*-year-month-day-*" *; not grep -R <Date> --include *-year-month-day-* *; as in the latter case the shell will expand the *-year-month-day-*, which isn't what you want.

  • Apothe0sisApothe0sis Have you ever questioned the nature of your reality? Registered User regular
    edited August 2015
    I am late to this party but every time people have mentioned grep the pattern they have used has omitted at least one wildcard and the OP uses the pattern *fileidentifiersineed*

    Specifically, in a regular expression, '.' represents the wildcard character and '*' is a quantier (0 or more)

    So '*2015-08-01-*' as a pattern is invalid - the first asterisk is quantifying nothing. The second asterisk is quantifying '-'. To match something like service.log.15-08-01 for the month of August you would use the following pattern '.*log\.15-08-\d\d'

    I think you're definitely going to want to do it in stages - as you are doing, first, grab all of the logs you want, then do the searching.

    Are the logs themselves xml? If so, you really don't want to be using regular expressions as they really aren't a great tool for the job (even if you are dealing with a known namespace and regular data - which means you can probably make the regular expressions grab what you want, you aren't going to enjoy the experience). Can you copy the logs off to a machine that you can have more control over?

    If you can't do that then you're probably going to want to use awk to do the actual file processing/contents searching.

    Edit:

    I see what I missed. My apologies.

    So what i would suggest is just doing an ls, grepping the output to grab what you want, copying those files to the right place and then using awk to find the stuff you want - make it so you don't need to worry about --include. You could also wrap that into a for loop in bash if you wanted to do it that way

    for a in ${ls | grep <pattern>} ; do awk <something> $a >> outfile.txt ;done

    Can you have more than one <data>+</location> pair in each file?

    Apothe0sis on
  • Apothe0sisApothe0sis Have you ever questioned the nature of your reality? Registered User regular
    Also does grep support the -- argument to ignore all subsequent -'s? Shouldn't matter, I guess if you don't rely on shell expansion to choose the files.

  • EclecticGrooveEclecticGroove Registered User regular
    The date pattern for file names will always be the same for any given search. It would either be for the current day, or the current month. So it would always be 2015-8- or 2015-8-20 as examples.
    Then the content within each file will be the same.
    The files appear to be plaintext log files, no XML, it's just pages and pages of data all dumped in plaintext with various fields attached.

    The structure is more like a poorly coded plain HTML page than anything, but the data I'm interested in is always located between the same sections of data at least. So grabbing around, above, or below that data is good enough for what I need.

    At some point we're looking to move to a Splunk deployment and this will all be moot, but I have no idea how long it will be before then, and how long it will take to get everyone to actually configure their systems to log what they need to it when we do.

  • BlazeFireBlazeFire Registered User regular
    I'm sure you've considered it but is it possible you're going to have multiple data sections in a single file? As in, multiple <Date>...</location> sections.

  • EclecticGrooveEclecticGroove Registered User regular
    There will only be a limited number of files created each day.
    But within each file, yes there will likely be several matches, all of them need to be returned.

  • Apothe0sisApothe0sis Have you ever questioned the nature of your reality? Registered User regular
    If you are definitely going to use grep then you will need to specify greedy (or non-greedy) I don't remember which.

    Otherwise it will match the first data tag with last location tag and everything in between?

    I don't know whether grep does multiline matching or if I remembering a different tool that doesn't.

  • PapillonPapillon Registered User regular
    edited August 2015
    Apothe0sis wrote: »
    If you are definitely going to use grep then you will need to specify greedy (or non-greedy) I don't remember which.

    Non-greedy. I'm not sure you can do non-greedy matches with grep though.
    I don't know whether grep does multiline matching or if I remembering a different tool that doesn't.

    I'm pretty sure grep doesn't do multiline matching, which is why I suggested sed as an alternative.

    For that matter, one could sidestep much of the issue and write a small Perl or Python program to do both the matching and the file searching.

    Papillon on
  • Apothe0sisApothe0sis Have you ever questioned the nature of your reality? Registered User regular
    It would be very surprising if perl wasn't available on a RHEL system

  • mightyjongyomightyjongyo Sour Crrm East Bay, CaliforniaRegistered User regular
    python ought to be on a RHEL system too, even if it's an older version like 2.6

  • EclecticGrooveEclecticGroove Registered User regular
    It's hard to say. These systems were "hardened" by someone that had no real idea of what they were doing.

    Yum was removed along with plenty of other items, and other parts were just carved away or disabled in a haphazard fashion.

    Updates were done manually on a package by package fashion as they were told to update them. So this system, and others like it, are some beastly mix of things that haven't been updated in ~3-6 years or so and stuff that was updated just earlier this month.

    I can do nothing about any of that part of it. Ultimately I may just say screw it and just dump the logs out to my local system and work from there.

Sign In or Register to comment.