GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)
-   Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)
-   -   regular expression help (https://gfy.com/showthread.php?t=1041438)

camperjohn64 10-11-2011 01:24 PM

regular expression help
 
I want to clean a database of words that have bad letters in them. I know how to remove bad characters, but how can I remove the word along with it?

"The qu&&ick brown fox ju&&&mped over the lazy dog."

assuming & is a bad character, how can I end up with

"The brown fox over the lazy dog."

Basically, anything word that doesn't have an alpha-numeric or [.,<>?~@#$%^&*()] I want to remove the word.

Single preg_replace expression??

fris 10-11-2011 01:33 PM

is it wordpress? or just another site db, if it was wordpress, you could use the search and replace plugin.

fris 10-11-2011 01:39 PM

bascially you want to execute

Code:

update [table_name] set [field_name] = replace([field_name],'[string_to_find]','[string_to_replace]');
or here is a tool

http://sewmyheadon.com/2009/mysql-search-replace-tool/

camperjohn64 10-11-2011 01:39 PM

Quote:

Originally Posted by fris (Post 18484408)
is it wordpress? or just another site db, if it was wordpress, you could use the search and replace plugin.

The problem isnt removing bad characters, the problem is deleting words that contain bad characters.

raymor 10-11-2011 02:05 PM

Quote:

I want to clean a database of words that have bad letters in them. I know how to remove bad characters, but how can I remove the word along with it?

"The qu&&ick brown fox ju&&&mped over the lazy dog."

assuming & is a bad character, how can I end up with

"The brown fox over the lazy dog."

Basically, anything word that doesn't have an alpha-numeric or [.,<>?~@#$%^&*()] I want to remove the word.
You've asked for two very different things. Removing words that DO have bad characters is different from removing words than do NOT have "good" characters. What if it has both?

To remove words that have the "bad" character:

\w is the class of word characters. You're looking for a string containing at least one "bad character" and optionally some word characters.
"Words", as you define them, are strings of word characters and &, which is represented as \w|& .
So your assuming & is the bad character, the regular expression is:

(\w|&)*&(\w|&)*

preg_replace('/(\w|&)*&(\w|&)*/', "", $subject);


Quote:

Basically, anything word that doesn't have an alpha-numeric or [.,<>?~@#$%^&*()] I want to remove the word.
Removing them based on what they do NOT have is a different thing than removing things based on what they DO have as above. In this case, you're looking for strings of [^.,<>?~@#$%^&*()], bracketed by space characters I suppose since you have .,? and other non-word characters part of your class.
So you're looking for:
\s[^.,<>?~@#$%^&*()]+\s

and replacing it with a single space delimiter like this:

preg_replace('/\s[^.,<>?~@#$%^&*()]+\s/', " ", $subject);

Bladewire 10-11-2011 02:13 PM

Quote:

Originally Posted by raymor (Post 18484484)
You've asked for two very different things. Removing words that DO have bad characters is different from removing words than do NOT have "good" characters. What if it has both?

To remove words that have the "bad" character:

\w is the class of word characters. You're looking for a string containing at least one "bad character" and optionally some word characters.
"Words", as you define them, are strings of word characters and &, which is represented as \w|& .
So your assuming & is the bad character, the regular expression is:

(\w|&)*&(\w|&)*

preg_replace('/(\w|&)*&(\w|&)*/', "", $subject);




Removing them based on what they do NOT have is a different thing than removing things based on what they DO have as above. In this case, you're looking for strings of [^.,<>?~@#$%^&*()], bracketed by space characters I suppose since you have .,? and other non-word characters part of your class.
So you're looking for:
\s[^.,<>?~@#$%^&*()]+\s

and replacing it with a single space delimiter like this:

preg_replace('/\s[^.,<>?~@#$%^&*()]+\s/', " ", $subject);


DAMN that was quick, well done! :thumbsup

camperjohn64 10-11-2011 02:26 PM

Yes, thanks - you are correct in my typo. And thanks for the answer - coding now :-)

raymor 10-11-2011 03:11 PM

Quote:

Originally Posted by Squirtit (Post 18484510)
DAMN that was quick, well done! :thumbsup


:)

This stuff was hard in 1997 when we were trying to get referer based .htaccess right.
We had to watch out for things like goodguy.com.hacker.com

I've had a bit of practice since then.

Camperjohn, what I posted is only known to be correct, not tested.

woj 10-11-2011 03:41 PM

I would just write a quick script to do that, fetch text from db, split into words, check each word, unsplit, save it...

a bit slower and less efficient, but pretty hard to fuck up... on the other hand with one regexp command, one wrong character and your whole db could get fucked up...

camperjohn64 10-11-2011 05:23 PM

Quote:

Originally Posted by raymor (Post 18484635)
:)

This stuff was hard in 1997 when we were trying to get referer based .htaccess right.
We had to watch out for things like goodguy.com.hacker.com

I've had a bit of practice since then.

Camperjohn, what I posted is only known to be correct, not tested.

I tested first of course

raymor 10-11-2011 06:08 PM

Quote:

Originally Posted by woj (Post 18484693)
I would just write a quick script to do that, fetch text from db, split into words, check each word, unsplit, save it...

a bit slower and less efficient, but pretty hard to fuck up... on the other hand with one regexp command, one wrong character and your whole db could get fucked up...


yes definitely either way one would first to a database dump or CREATE TABLE backup SELECT * FROM thetable.

Messing up is okay, it happens. Breaking things is not. You'd want to backup either way because for example even of you her the word delete perfect, join is not the inverse of split, so data could be lost by splitting and joining.


All times are GMT -7. The time now is 05:42 PM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123