need to de-dupe keyword list... solution? - GoFuckYourself.com

Mr Pheer · 05-23-2008, 06:57 AM

I have a list of keywords, one phrase or keyword per line

the list has alot of duplicates... whats the best way to strip them out?

help please

gornyhuy · 05-23-2008, 07:40 AM

One approach:
-Import to excel
-sort alphabetically
-run a formula comparing each entry to the one above and below, and mark it as a dupe (or delete it)
-for example =IF(OR(A3=A4,A3=A2),"Duplicate","")
-then sort by the duplicate status and delete

ish.

gornyhuy · 05-23-2008, 07:46 AM

Here is a less manual excel approach that I haven't tested, but it looks damn sexy:
http://www.rondebruin.nl/easyfilter.htm

Mr Pheer · 05-23-2008, 07:46 AM

what about a solution for people that dont have excel?

I dont have any office applications

mrkris · 05-23-2008, 07:48 AM

If you have access to *nix, try:

$ cat list.txt|uniq > newlist.txt

severe · 05-23-2008, 07:55 AM

in excel you dont need a formula to remove dupes, theres a feature to show only non dupes. in older versions its called something like 'show original content' in 2007 under data tab its just called remove dupes

Mr Pheer · 05-23-2008, 07:58 AM

Quote:

Originally Posted by mrkris

If you have access to *nix, try:

$ cat list.txt|uniq > newlist.txt

I tried that on freebsd and it just made a copy of the same file with a new name

Mr Pheer · 05-23-2008, 07:59 AM

someone help me out with the syntax error on line 18 please?

Code:

#!/usr/bin/perl
use strict;

my $FileName = 'file.txt'; # Modify file name as needed.

my(@List,%List,@NewList)= ();

sub Abandon
{
print join '
',@_;
exit;
} # sub Abandon

print "Content-type: text/plain\n\n";

Abandon("Unable to read file $FileName") unless open R,"<$FileName";
@List = ;
close R;

Abandon("Unable to create temporary file ${FileName}.tmp.txt") unless open W,">${FileName}.tmp.txt";
for(@List) { print W $_; }
close W;

for(@List)
{
next if $List{$_};
$List{$_}++;
push @NewList,$_;
}

Abandon('Something wrong.',"Backup file is ${FileName}.tmp.txt") unless open W,">$FileName";
for(@NewList) { print W $_; }
close W;

unlink "${FileName}.tmp.txt";

print 'D O N E';

react · 05-23-2008, 08:07 AM

You must sort before you can uniq:

cat infile | sort | uniq > outputfile

Mr Pheer · 05-23-2008, 08:11 AM

Quote:

Originally Posted by react

You must sort before you can uniq:

cat infile | sort | uniq > outputfile

w00t!!!

thanks man

gornyhuy · 05-23-2008, 08:30 AM

While we are on the subject, does anybody have a good query for deduping mysql tables across multiple fields?

react · 05-23-2008, 09:56 AM

That multiple fields bit isn't super clear.. but if you want to combine data in several columns of one table into a single unique column create a new table with one column that has unique index on it. Then for each of the columns in the old table:

insert ignore into newtable (newcolumn) select oldcolumn1 from oldtable;
insert ignore into newtable (newcolumn) select oldcolumn2 from oldtable;

If you just want to keep all unique rows then create new table with the same column structure, create a unique index across all columns, then:

insert ignore into newtable select * from oldtable

rowan · 05-23-2008, 01:39 PM

Quote:

Originally Posted by react

You must sort before you can uniq:

cat infile | sort | uniq > outputfile

No need for uniq in that case... or cat

sort -u infile > outfile

05-23-2008, 06:57 AM	#1
Mr Pheer Retired Industry Role: Join Date: Dec 2002 Posts: 21,246	need to de-dupe keyword list... solution? I have a list of keywords, one phrase or keyword per line the list has alot of duplicates... whats the best way to strip them out? help please __________________ 2 lifeguards for Jessica

05-23-2008, 07:40 AM	#2
gornyhuy Chafed. Join Date: May 2002 Location: Face Down in Pussy Posts: 18,041	One approach: -Import to excel -sort alphabetically -run a formula comparing each entry to the one above and below, and mark it as a dupe (or delete it) -for example =IF(OR(A3=A4,A3=A2),"Duplicate","") -then sort by the duplicate status and delete ish. __________________ icq:159548293

05-23-2008, 07:46 AM	#3
gornyhuy Chafed. Join Date: May 2002 Location: Face Down in Pussy Posts: 18,041	Here is a less manual excel approach that I haven't tested, but it looks damn sexy: http://www.rondebruin.nl/easyfilter.htm __________________ icq:159548293

05-23-2008, 07:46 AM	#4
Mr Pheer Retired Industry Role: Join Date: Dec 2002 Posts: 21,246	what about a solution for people that dont have excel? I dont have any office applications __________________ 2 lifeguards for Jessica

05-23-2008, 07:48 AM	#5
mrkris Confirmed User Join Date: May 2005 Posts: 2,737	If you have access to *nix, try: $ cat list.txt\|uniq > newlist.txt __________________ PHP-MySQL-Rails \| ICQ: 342500546

05-23-2008, 07:55 AM	#6
severe Confirmed User Industry Role: Join Date: Dec 2007 Posts: 331	in excel you dont need a formula to remove dupes, theres a feature to show only non dupes. in older versions its called something like 'show original content' in 2007 under data tab its just called remove dupes

05-23-2008, 08:07 AM	#9
react Confirmed User Industry Role: Join Date: Sep 2003 Location: NZ Posts: 673	You must sort before you can uniq: cat infile \| sort \| uniq > outputfile __________________ -- react

05-23-2008, 08:30 AM	#11
gornyhuy Chafed. Join Date: May 2002 Location: Face Down in Pussy Posts: 18,041	While we are on the subject, does anybody have a good query for deduping mysql tables across multiple fields? __________________ icq:159548293

05-23-2008, 09:56 AM	#12
react Confirmed User Industry Role: Join Date: Sep 2003 Location: NZ Posts: 673	That multiple fields bit isn't super clear.. but if you want to combine data in several columns of one table into a single unique column create a new table with one column that has unique index on it. Then for each of the columns in the old table: insert ignore into newtable (newcolumn) select oldcolumn1 from oldtable; insert ignore into newtable (newcolumn) select oldcolumn2 from oldtable; If you just want to keep all unique rows then create new table with the same column structure, create a unique index across all columns, then: insert ignore into newtable select * from oldtable __________________ -- react