regular expression help

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • camperjohn64
    Confirmed User
    • Feb 2005
    • 1531

    #1

    regular expression help

    I want to clean a database of words that have bad letters in them. I know how to remove bad characters, but how can I remove the word along with it?

    "The qu&&ick brown fox ju&&&mped over the lazy dog."

    assuming & is a bad character, how can I end up with

    "The brown fox over the lazy dog."

    Basically, anything word that doesn't have an alpha-numeric or [.,<>?~@#$%^&*()] I want to remove the word.

    Single preg_replace expression??
    www.gimmiegirlproductions.com
  • fris
    Too lazy to set a custom title
    • Aug 2002
    • 55679

    #2
    is it wordpress? or just another site db, if it was wordpress, you could use the search and replace plugin.
    Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.

    Comment

    • fris
      Too lazy to set a custom title
      • Aug 2002
      • 55679

      #3
      bascially you want to execute

      Code:
      update [table_name] set [field_name] = replace([field_name],'[string_to_find]','[string_to_replace]');
      or here is a tool

      http://sewmyheadon.com/2009/mysql-search-replace-tool/
      Since 1999: 69 Adult Industry awards for Best Hosting Company and professional excellence.

      Comment

      • camperjohn64
        Confirmed User
        • Feb 2005
        • 1531

        #4
        Originally posted by fris
        is it wordpress? or just another site db, if it was wordpress, you could use the search and replace plugin.
        The problem isnt removing bad characters, the problem is deleting words that contain bad characters.
        www.gimmiegirlproductions.com

        Comment

        • raymor
          Confirmed User
          • Oct 2002
          • 3745

          #5
          I want to clean a database of words that have bad letters in them. I know how to remove bad characters, but how can I remove the word along with it?

          "The qu&&ick brown fox ju&&&mped over the lazy dog."

          assuming & is a bad character, how can I end up with

          "The brown fox over the lazy dog."

          Basically, anything word that doesn't have an alpha-numeric or [.,<>?~@#$%^&*()] I want to remove the word.
          You've asked for two very different things. Removing words that DO have bad characters is different from removing words than do NOT have "good" characters. What if it has both?

          To remove words that have the "bad" character:

          \w is the class of word characters. You're looking for a string containing at least one "bad character" and optionally some word characters.
          "Words", as you define them, are strings of word characters and &, which is represented as \w|& .
          So your assuming & is the bad character, the regular expression is:

          (\w|&)*&(\w|&)*

          preg_replace('/(\w|&)*&(\w|&)*/', "", $subject);


          Basically, anything word that doesn't have an alpha-numeric or [.,<>?~@#$%^&*()] I want to remove the word.
          Removing them based on what they do NOT have is a different thing than removing things based on what they DO have as above. In this case, you're looking for strings of [^.,<>?~@#$%^&*()], bracketed by space characters I suppose since you have .,? and other non-word characters part of your class.
          So you're looking for:
          \s[^.,<>?~@#$%^&*()]+\s

          and replacing it with a single space delimiter like this:

          preg_replace('/\s[^.,<>?~@#$%^&*()]+\s/', " ", $subject);
          For historical display only. This information is not current:
          support&#64;bettercgi.com ICQ 7208627
          Strongbox - The next generation in site security
          Throttlebox - The next generation in bandwidth control
          Clonebox - Backup and disaster recovery on steroids

          Comment

          • Bladewire
            StraightBro
            • Aug 2003
            • 56228

            #6
            Originally posted by raymor
            You've asked for two very different things. Removing words that DO have bad characters is different from removing words than do NOT have "good" characters. What if it has both?

            To remove words that have the "bad" character:

            \w is the class of word characters. You're looking for a string containing at least one "bad character" and optionally some word characters.
            "Words", as you define them, are strings of word characters and &, which is represented as \w|& .
            So your assuming & is the bad character, the regular expression is:

            (\w|&)*&(\w|&)*

            preg_replace('/(\w|&)*&(\w|&)*/', "", $subject);




            Removing them based on what they do NOT have is a different thing than removing things based on what they DO have as above. In this case, you're looking for strings of [^.,<>?~@#$%^&*()], bracketed by space characters I suppose since you have .,? and other non-word characters part of your class.
            So you're looking for:
            \s[^.,<>?~@#$%^&*()]+\s

            and replacing it with a single space delimiter like this:

            preg_replace('/\s[^.,<>?~@#$%^&*()]+\s/', " ", $subject);

            DAMN that was quick, well done!


            Skype: CallTomNow

            Comment

            • camperjohn64
              Confirmed User
              • Feb 2005
              • 1531

              #7
              Yes, thanks - you are correct in my typo. And thanks for the answer - coding now :-)
              www.gimmiegirlproductions.com

              Comment

              • raymor
                Confirmed User
                • Oct 2002
                • 3745

                #8
                Originally posted by Squirtit
                DAMN that was quick, well done!



                This stuff was hard in 1997 when we were trying to get referer based .htaccess right.
                We had to watch out for things like goodguy.com.hacker.com

                I've had a bit of practice since then.

                Camperjohn, what I posted is only known to be correct, not tested.
                For historical display only. This information is not current:
                support&#64;bettercgi.com ICQ 7208627
                Strongbox - The next generation in site security
                Throttlebox - The next generation in bandwidth control
                Clonebox - Backup and disaster recovery on steroids

                Comment

                • woj
                  <&(©¿©)&>
                  • Jul 2002
                  • 47882

                  #9
                  I would just write a quick script to do that, fetch text from db, split into words, check each word, unsplit, save it...

                  a bit slower and less efficient, but pretty hard to fuck up... on the other hand with one regexp command, one wrong character and your whole db could get fucked up...
                  Custom Software Development, email: woj#at#wojfun#.#com to discuss details or skype: wojl2000 or gchat: wojfun or telegram: wojl2000
                  Affiliate program tools: Hosted Galleries Manager Banner Manager Video Manager
                  Wordpress Affiliate Plugin Pic/Movie of the Day Fansign Generator Zip Manager

                  Comment

                  • camperjohn64
                    Confirmed User
                    • Feb 2005
                    • 1531

                    #10
                    Originally posted by raymor


                    This stuff was hard in 1997 when we were trying to get referer based .htaccess right.
                    We had to watch out for things like goodguy.com.hacker.com

                    I've had a bit of practice since then.

                    Camperjohn, what I posted is only known to be correct, not tested.
                    I tested first of course
                    www.gimmiegirlproductions.com

                    Comment

                    • raymor
                      Confirmed User
                      • Oct 2002
                      • 3745

                      #11
                      Originally posted by woj
                      I would just write a quick script to do that, fetch text from db, split into words, check each word, unsplit, save it...

                      a bit slower and less efficient, but pretty hard to fuck up... on the other hand with one regexp command, one wrong character and your whole db could get fucked up...

                      yes definitely either way one would first to a database dump or CREATE TABLE backup SELECT * FROM thetable.

                      Messing up is okay, it happens. Breaking things is not. You'd want to backup either way because for example even of you her the word delete perfect, join is not the inverse of split, so data could be lost by splitting and joining.
                      Last edited by raymor; 10-11-2011, 05:14 PM.
                      For historical display only. This information is not current:
                      support&#64;bettercgi.com ICQ 7208627
                      Strongbox - The next generation in site security
                      Throttlebox - The next generation in bandwidth control
                      Clonebox - Backup and disaster recovery on steroids

                      Comment

                      Working...