Discussion:
Algorithm Used in Similar() function.
(too old to reply)
Rob
2007-04-23 16:27:16 UTC
Permalink
Does anyone know the algorithm used to compare strings in the Similar()
function? I have to create a similar function, excuse the pun, in SQL
Server.

TIA

P.S I hate SQL Server!
John Smirnios
2007-04-23 17:23:28 UTC
Permalink
Yes. I'll send you a sample via email.

-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering

Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Rob
Does anyone know the algorithm used to compare strings in the Similar()
function? I have to create a similar function, excuse the pun, in SQL
Server.
TIA
P.S I hate SQL Server!
Volker Barth
2007-04-24 12:49:06 UTC
Permalink
John,

would it be possible to send the sample to me, too?
I have used SIMILAR since about 10 years to do "automated customer data
clearing" stuff, and it works quite well - in conjunction with LIKE and the
like...
I once had done tests with user-defined and external functions (e.g. with
the "Levenshtein" algorithm) but they were far too slow in contrast to
SIMILAR.

Fortunately, the algorithm seems to have been the same since V5.5...so I
would like to have a chance to look at it more closely.

TIA
Volker
(Feeling fine not to have to port that to MS SQL / ASE)
Post by John Smirnios
Yes. I'll send you a sample via email.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Rob
Does anyone know the algorithm used to compare strings in the Similar()
function? I have to create a similar function, excuse the pun, in SQL
Server.
TIA
P.S I hate SQL Server!
John Smirnios
2007-04-24 14:58:50 UTC
Permalink
You may need to send me an email first or give me another email address.
My email to you bounced with the following message: "Mail rejected for
policy reasons."

-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering

Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Volker Barth
John,
would it be possible to send the sample to me, too?
I have used SIMILAR since about 10 years to do "automated customer data
clearing" stuff, and it works quite well - in conjunction with LIKE and the
like...
I once had done tests with user-defined and external functions (e.g. with
the "Levenshtein" algorithm) but they were far too slow in contrast to
SIMILAR.
Fortunately, the algorithm seems to have been the same since V5.5...so I
would like to have a chance to look at it more closely.
TIA
Volker
(Feeling fine not to have to port that to MS SQL / ASE)
Post by John Smirnios
Yes. I'll send you a sample via email.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Rob
Does anyone know the algorithm used to compare strings in the Similar()
function? I have to create a similar function, excuse the pun, in SQL
Server.
TIA
P.S I hate SQL Server!
Volker Barth
2007-04-27 08:55:48 UTC
Permalink
John,

thanks for sending the source code! - I still have to study it in detail...

I just have read about UCA and collation tailoring in the 10.0.1 docs.

As this seems to one of your favourite topics:

In the application I talked of, we typically have to compare German names
(of persons and places).
As you may know, these may contain "umlauts" like 'ä' or special characters
like 'ß' (the "sharp s").
However, in older, restricted charsets (or in internationalized uses like
mail addresses), these umlauts have often been expanded to two characters,
e.g. 'ä' to 'ae' or 'ß' to 'ss'.
So one task we face is to have 'ä' and 'ae' to compare to be the same.

AFAIK, single-byte collations can only compare characters one by one and
therefore can not treat 'ä' and 'ae' as wanted.
Is this the same for unicode collations, or could I establish some rule to
make 'ä' and 'ae' the same?

(So far, we have solved this problem by storing both the original names and
an "normalized" form, where umlauts are expanded and everything is uppercase
and some phonetic simplifications are done (e.g. 'ph' sounds like 'f' and is
therefore normalized to 'f'). The normalized form is stored as an computed
field and is automatically calculated by an user-defined function.
Comparisons are then done on the normalized forms.
This works well with the typical German '1252LATIN1" single-byte collation.)

Any hint if UCA may give better facilities is highly appreciated...

Volker
Post by John Smirnios
You may need to send me an email first or give me another email address.
My email to you bounced with the following message: "Mail rejected for
policy reasons."
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Volker Barth
John,
would it be possible to send the sample to me, too?
I have used SIMILAR since about 10 years to do "automated customer data
clearing" stuff, and it works quite well - in conjunction with LIKE and the
like...
I once had done tests with user-defined and external functions (e.g. with
the "Levenshtein" algorithm) but they were far too slow in contrast to
SIMILAR.
Fortunately, the algorithm seems to have been the same since V5.5...so I
would like to have a chance to look at it more closely.
TIA
Volker
(Feeling fine not to have to port that to MS SQL / ASE)
Post by John Smirnios
Yes. I'll send you a sample via email.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Rob
Does anyone know the algorithm used to compare strings in the Similar()
function? I have to create a similar function, excuse the pun, in SQL
Server.
TIA
P.S I hate SQL Server!
John Smirnios
2007-04-27 15:39:05 UTC
Permalink
In queries, UCA collations definitely use UCA to perform the comparison
in a linguistically correct way so that 'SS' = 'ß' (not sure off hand if
you need to specify the right locale/tailoring for that though).

However, the code for the "similar" function in SQLAnywhere is still
performed using a character-by-character match as seen in the code I
sent you. When scanning a two strings such as 'SS' and 'ß', it will try
to match the first 'S' with 'ß' and not find a match (since 'S' != 'ß').

If it's any consolation, the UPPER function will convert 'ß' to 'SS' if
you are using an ICU collation (again, the correct locale/tailoring may
be needed).

-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering

Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Volker Barth
John,
thanks for sending the source code! - I still have to study it in detail...
I just have read about UCA and collation tailoring in the 10.0.1 docs.
In the application I talked of, we typically have to compare German names
(of persons and places).
As you may know, these may contain "umlauts" like 'ä' or special characters
like 'ß' (the "sharp s").
However, in older, restricted charsets (or in internationalized uses like
mail addresses), these umlauts have often been expanded to two characters,
e.g. 'ä' to 'ae' or 'ß' to 'ss'.
So one task we face is to have 'ä' and 'ae' to compare to be the same.
AFAIK, single-byte collations can only compare characters one by one and
therefore can not treat 'ä' and 'ae' as wanted.
Is this the same for unicode collations, or could I establish some rule to
make 'ä' and 'ae' the same?
(So far, we have solved this problem by storing both the original names and
an "normalized" form, where umlauts are expanded and everything is uppercase
and some phonetic simplifications are done (e.g. 'ph' sounds like 'f' and is
therefore normalized to 'f'). The normalized form is stored as an computed
field and is automatically calculated by an user-defined function.
Comparisons are then done on the normalized forms.
This works well with the typical German '1252LATIN1" single-byte collation.)
Any hint if UCA may give better facilities is highly appreciated...
Volker
Post by John Smirnios
You may need to send me an email first or give me another email address.
My email to you bounced with the following message: "Mail rejected for
policy reasons."
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Volker Barth
John,
would it be possible to send the sample to me, too?
I have used SIMILAR since about 10 years to do "automated customer data
clearing" stuff, and it works quite well - in conjunction with LIKE and
the
Post by John Smirnios
Post by Volker Barth
like...
I once had done tests with user-defined and external functions (e.g.
with
Post by John Smirnios
Post by Volker Barth
the "Levenshtein" algorithm) but they were far too slow in contrast to
SIMILAR.
Fortunately, the algorithm seems to have been the same since V5.5...so I
would like to have a chance to look at it more closely.
TIA
Volker
(Feeling fine not to have to port that to MS SQL / ASE)
Post by John Smirnios
Yes. I'll send you a sample via email.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the
iAnywhere
Post by John Smirnios
Post by Volker Barth
Post by John Smirnios
Developer Community at http://www.ianywhere.com/developer
Post by Rob
Does anyone know the algorithm used to compare strings in the
Similar()
Post by John Smirnios
Post by Volker Barth
Post by John Smirnios
Post by Rob
function? I have to create a similar function, excuse the pun, in SQL
Server.
TIA
P.S I hate SQL Server!
Volker Barth
2007-04-27 15:48:51 UTC
Permalink
John,

thanks for the explanation.

So I guess I'm going to do some tests with UCA in the (not so near)
future...
...though the particular solution we are using now may still be more
appropriate to treat names like 'Stefan' and 'Stephan' (which are
mis-spelled or mixed up quite often) as equal.

Thanks again!

Volker
Post by John Smirnios
In queries, UCA collations definitely use UCA to perform the comparison
in a linguistically correct way so that 'SS' = 'ß' (not sure off hand if
you need to specify the right locale/tailoring for that though).
However, the code for the "similar" function in SQLAnywhere is still
performed using a character-by-character match as seen in the code I
sent you. When scanning a two strings such as 'SS' and 'ß', it will try
to match the first 'S' with 'ß' and not find a match (since 'S' != 'ß').
If it's any consolation, the UPPER function will convert 'ß' to 'SS' if
you are using an ICU collation (again, the correct locale/tailoring may
be needed).
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Volker Barth
John,
thanks for sending the source code! - I still have to study it in detail...
I just have read about UCA and collation tailoring in the 10.0.1 docs.
In the application I talked of, we typically have to compare German names
(of persons and places).
As you may know, these may contain "umlauts" like 'ä' or special characters
like 'ß' (the "sharp s").
However, in older, restricted charsets (or in internationalized uses like
mail addresses), these umlauts have often been expanded to two characters,
e.g. 'ä' to 'ae' or 'ß' to 'ss'.
So one task we face is to have 'ä' and 'ae' to compare to be the same.
AFAIK, single-byte collations can only compare characters one by one and
therefore can not treat 'ä' and 'ae' as wanted.
Is this the same for unicode collations, or could I establish some rule to
make 'ä' and 'ae' the same?
(So far, we have solved this problem by storing both the original names and
an "normalized" form, where umlauts are expanded and everything is uppercase
and some phonetic simplifications are done (e.g. 'ph' sounds like 'f' and is
therefore normalized to 'f'). The normalized form is stored as an computed
field and is automatically calculated by an user-defined function.
Comparisons are then done on the normalized forms.
This works well with the typical German '1252LATIN1" single-byte collation.)
Any hint if UCA may give better facilities is highly appreciated...
Volker
Post by John Smirnios
You may need to send me an email first or give me another email address.
My email to you bounced with the following message: "Mail rejected for
policy reasons."
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Volker Barth
John,
would it be possible to send the sample to me, too?
I have used SIMILAR since about 10 years to do "automated customer data
clearing" stuff, and it works quite well - in conjunction with LIKE and
the
Post by John Smirnios
Post by Volker Barth
like...
I once had done tests with user-defined and external functions (e.g.
with
Post by John Smirnios
Post by Volker Barth
the "Levenshtein" algorithm) but they were far too slow in contrast to
SIMILAR.
Fortunately, the algorithm seems to have been the same since V5.5...so I
would like to have a chance to look at it more closely.
TIA
Volker
(Feeling fine not to have to port that to MS SQL / ASE)
Post by John Smirnios
Yes. I'll send you a sample via email.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the
iAnywhere
Post by John Smirnios
Post by Volker Barth
Post by John Smirnios
Developer Community at http://www.ianywhere.com/developer
Post by Rob
Does anyone know the algorithm used to compare strings in the
Similar()
Post by John Smirnios
Post by Volker Barth
Post by John Smirnios
Post by Rob
function? I have to create a similar function, excuse the pun, in SQL
Server.
TIA
P.S I hate SQL Server!
Steven J. Serenska
2007-04-26 16:33:03 UTC
Permalink
Post by John Smirnios
Yes. I'll send you a sample via email.
I would like a sample as well.

Would it be possible to post this on a website somewhere (or just
include it in a response to this post)? I'm sure others might be
interested too.

Thanks.

SJS
John Smirnios
2007-04-27 15:42:29 UTC
Permalink
I would love to; however, when I asked for permission a long time ago to
send out the source for "similar" I was told to include a statement to
the effect of "you are free to use it and modify it however you like but
you cannot redistribute the source without Sybase's permission". That
pretty much precludes posting the source. C'est la vie. I don't mind
emailing it to whomever asks for it.

-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering

Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Steven J. Serenska
Post by John Smirnios
Yes. I'll send you a sample via email.
I would like a sample as well.
Would it be possible to post this on a website somewhere (or just
include it in a response to this post)? I'm sure others might be
interested too.
Thanks.
SJS
s***@gmail.com
2012-09-28 11:54:12 UTC
Permalink
Post by John Smirnios
I would love to; however, when I asked for permission a long time ago to
send out the source for "similar" I was told to include a statement to
the effect of "you are free to use it and modify it however you like but
you cannot redistribute the source without Sybase's permission". That
pretty much precludes posting the source. C'est la vie. I don't mind
emailing it to whomever asks for it.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Hi John,
I would like sample as well.

Can you please send it to me?

Thank you in advance.
s***@gmail.com
2013-01-16 22:02:20 UTC
Permalink
Post by John Smirnios
Yes. I'll send you a sample via email.
-john.
--
John Smirnios
Senior Software Developer
iAnywhere Solutions Engineering
Whitepapers, TechDocs, bug fixes are all available through the iAnywhere
Developer Community at http://www.ianywhere.com/developer
Post by Rob
Does anyone know the algorithm used to compare strings in the Similar()
function? I have to create a similar function, excuse the pun, in SQL
Server.
TIA
P.S I hate SQL Server!
John,

I need to implement the Sybase Similar() function on Teradata. It'll be great if you can share the algorithm with me.

Thanks in advance for your help!
s***@gmail.com
2016-06-22 10:46:56 UTC
Permalink
Post by Rob
Does anyone know the algorithm used to compare strings in the Similar()
function? I have to create a similar function, excuse the pun, in SQL
Server.
TIA
P.S I hate SQL Server!
Hi Volker,

Could you please send the sample code to my mail subi.kumar_at_gmail.com


Thanks & Regards
Subi

Continue reading on narkive:
Loading...