Discussion:
32 vs 64 bits and large integer hash
(too old to reply)
Eliot Miranda
2018-11-21 18:45:54 UTC
Permalink
Hi All,

right now we have the following definition of
Large(Positive)Integer>>hash:

hash
^ByteArray hashBytes: self startingWith: self species hash

which means that for all integers outside of the 32-bit SmallInteger range
(-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer
different values for hash.

e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764

This is unsatisfactory. I propose changing Large(Positive)Integer>>hash to

hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]


P.S. Note that this will not break Float hash, which is defined as

Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float
are used. (The bitShift:'s ensure that the intermediate results do not
become a large integer.) Care is taken to answer same hash as an equal
Integer."

(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated
hash].
^ ((self basicAt: 1) bitShift: -4) +
((self basicAt: 2) bitShift: -4)

P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is
equivalent to "self - self = self fractionPart" ;-)

_,,,^..^,,,_
best, Eliot
Nicolas Cellier
2018-11-21 20:23:40 UTC
Permalink
+1
And brilliant hack! though the former code explains by itself, the later
not so much without a comment.
Post by Eliot Miranda
Hi All,
right now we have the following definition of
hash
^ByteArray hashBytes: self startingWith: self species hash
which means that for all integers outside of the 32-bit SmallInteger range
(-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer
different values for hash.
e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764
This is unsatisfactory. I propose changing Large(Positive)Integer>>hash to
hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
P.S. Note that this will not break Float hash, which is defined as
Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float
are used. (The bitShift:'s ensure that the intermediate results do not
become a large integer.) Care is taken to answer same hash as an equal
Integer."
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated
hash].
^ ((self basicAt: 1) bitShift: -4) +
((self basicAt: 2) bitShift: -4)
P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is
equivalent to "self - self = self fractionPart" ;-)
_,,,^..^,,,_
best, Eliot
Juan Vuletich
2018-11-21 21:37:33 UTC
Permalink
Hi Eliot,

On Cuis I have been recently working on hash consistency for numeric
types, making same value (or different values that compare as equal due
to rounding) answer same hash regardless of class. But I have missed the
SmallInteger / LargeInteger consistency across 32 / 64 bits images.
Thanks! Fixes for Cuis now at GitHub repo. In particular, it is
necessary for the following to answer the same values in 32 and 64 bits
(it wasn't the case!):


largeInteger := (LargePositiveInteger new: 4)
digitAt: 1 put: 1;
digitAt: 2 put: 2;
digitAt: 3 put: 3;
digitAt: 4 put: 4;
yourself.
smallInteger := largeInteger normalize.
float := smallInteger asFloat.
boxedFloat := BoxedFloat64 new basicAt: 1 put: (float basicAt: 1);
basicAt: 2 put: (float basicAt: 2); yourself.
{largeInteger class. smallInteger class. float class. boxedFloat class.
largeInteger hash. smallInteger hash. float hash. boxedFloat hash } print.



largeInteger _ (LargePositiveInteger new: 4)
digitAt: 1 put: 1;
digitAt: 2 put: 2;
digitAt: 3 put: 3;
digitAt: 4 put: 80;
yourself.
smallIntIn64ButLargeIntIn32Bits := largeInteger normalize.
float := smallIntIn64ButLargeIntIn32Bits asFloat.
boxedFloat := BoxedFloat64 new basicAt: 1 put: (float basicAt: 1);
basicAt: 2 put: (float basicAt: 2); yourself.
{largeInteger class. smallIntIn64ButLargeIntIn32Bits class. float class.
boxedFloat class.
largeInteger hash. smallIntIn64ButLargeIntIn32Bits hash. float hash.
boxedFloat hash } print.


Note that I also included consistency between un-normalized
LargeIntegers in the SmallInteger range (just in case).
Post by Eliot Miranda
Hi All,
right now we have the following definition of
hash
^ByteArray hashBytes: self startingWith: self species hash
which means that for all integers outside of the 32-bit SmallInteger
range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system
answer different values for hash.
e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764
This is unsatisfactory. I propose changing
Large(Positive)Integer>>hash to
hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
P.S. Note that this will not break Float hash, which is defined as
Float>>hash
"Hash is reimplemented because = is implemented. Both words of the
float are used. (The bitShift:'s ensure that the intermediate results
do not become a large integer.) Care is taken to answer same hash as
an equal Integer."
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self
truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
((self basicAt: 2) bitShift: -4)
P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])"
is equivalent to "self - self = self fractionPart" ;-)
_,,,^..^,,,_
best, Eliot
Cheers,
--
Juan Vuletich
www.cuis-smalltalk.org
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
https://github.com/jvuletich
https://www.linkedin.com/in/juan-vuletich-75611b3
@JuanVuletich
Levente Uzonyi
2018-11-22 00:15:22 UTC
Permalink
Why not do it the other way around and implement SmallInteger >> #hash as
a primitive with the following fallback code?

SmallInteger >> #hash

<primitive: XXX>
| remainder hash |
self < 0
ifTrue: [
remainder := 0 - self.
hash := LargeNegativeInteger hash ]
ifFalse: [
remainder := self.
hash := LargePositiveInteger hash ].
[ remainder > 0 ] whileTrue: [
hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
remainder := remainder bitShift: -8 ].
^hash

The only problem to solve is the calculation of the initial hash value.
The VM has to know the initial hash value (e.g. it could be a constant
based on the sign of the receiver) or it has to know how to calculate it
(but that's currently done differently among forks) or the value has to be
an argument of the primitive.

Levente

P.S.: This is another case where Behavior >> #hash bites Squeak and causes
additional slowdown.
P.P.S.: Float >> #hash should use #bitXor: and #hashMultiply instead of
Post by Eliot Miranda
Hi All,
hash
^ByteArray hashBytes: self startingWith: self species hash
which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
 but in 32 bits: (2 raisedTo: 30) hash 230045764
This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to
hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
P.S. Note that this will not break Float hash, which is defined as
Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
hash as an equal Integer."
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
  ((self basicAt: 2) bitShift: -4)
P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)
_,,,^..^,,,_
best, Eliot
Tobias Pape
2018-11-22 07:48:10 UTC
Permalink
Why not do it the other way around and implement SmallInteger >> #hash as a primitive with the following fallback code?
SmallInteger >> #hash
<primitive: XXX>
| remainder hash |
self < 0
ifTrue: [
remainder := 0 - self.
hash := LargeNegativeInteger hash ]
ifFalse: [
remainder := self.
hash := LargePositiveInteger hash ].
[ remainder > 0 ] whileTrue: [
hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
remainder := remainder bitShift: -8 ].
^hash
The only problem to solve is the calculation of the initial hash value.
The VM has to know the initial hash value (e.g. it could be a constant based on the sign of the receiver) or it has to know how to calculate it (but that's currently done differently among forks) or the value has to be an argument of the primitive.
Levente
I'm a bit puzzled. I thought (small)Integers being their own hash is a good thing?

Best regards
-Tobias
P.S.: This is another case where Behavior >> #hash bites Squeak and causes additional slowdown.
Post by Eliot Miranda
Hi All,
hash
^ByteArray hashBytes: self startingWith: self species hash
which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764
This is unsatisfactory. I propose changing Large(Positive)Integer>>hash to
hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
P.S. Note that this will not break Float hash, which is defined as
Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
hash as an equal Integer."
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
((self basicAt: 2) bitShift: -4)
P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)
_,,,^..^,,,_
best,
Levente Uzonyi
2018-11-22 09:59:23 UTC
Permalink
Post by Tobias Pape
Why not do it the other way around and implement SmallInteger >> #hash as a primitive with the following fallback code?
SmallInteger >> #hash
<primitive: XXX>
| remainder hash |
self < 0
ifTrue: [
remainder := 0 - self.
hash := LargeNegativeInteger hash ]
ifFalse: [
remainder := self.
hash := LargePositiveInteger hash ].
[ remainder > 0 ] whileTrue: [
hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
remainder := remainder bitShift: -8 ].
^hash
The only problem to solve is the calculation of the initial hash value.
The VM has to know the initial hash value (e.g. it could be a constant based on the sign of the receiver) or it has to know how to calculate it (but that's currently done differently among forks) or the value has to be an argument of the primitive.
Levente
I'm a bit puzzled. I thought (small)Integers being their own hash is a good thing?
I would call it simple but not necessarily good.
The problem with it is that consecutive numbers generate long chains in
HashedCollections:

a := (1 to: 1000) asArray.
s := Set withAll: a.
[ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
"==> 7014"

The solution in Squeak is to use PluggableSet instead of Set,
because it applies #hashMultiply on the hash value:

ps := PluggableSet integerSet.
ps addAll: a.
[ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
"==> 95"

IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid the
long chains. That was probably the main reason for the push to make
#hashMultply a numbered primitive.

Levente
Post by Tobias Pape
Best regards
-Tobias
P.S.: This is another case where Behavior >> #hash bites Squeak and causes additional slowdown.
Post by Eliot Miranda
Hi All,
hash
^ByteArray hashBytes: self startingWith: self species hash
which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764
This is unsatisfactory. I propose changing Large(Positive)Integer>>hash to
hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
P.S. Note that this will not break Float hash, which is defined as
Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
hash as an equal Integer."
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
((self basicAt: 2) bitShift: -4)
P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)
_,,,^..^,,,_
best, Eliot
Tobias Pape
2018-11-22 10:12:56 UTC
Permalink
Post by Levente Uzonyi
Post by Tobias Pape
Why not do it the other way around and implement SmallInteger >> #hash as a primitive with the following fallback code?
SmallInteger >> #hash
<primitive: XXX>
| remainder hash |
self < 0
ifTrue: [
remainder := 0 - self.
hash := LargeNegativeInteger hash ]
ifFalse: [
remainder := self.
hash := LargePositiveInteger hash ].
[ remainder > 0 ] whileTrue: [
hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
remainder := remainder bitShift: -8 ].
^hash
The only problem to solve is the calculation of the initial hash value.
The VM has to know the initial hash value (e.g. it could be a constant based on the sign of the receiver) or it has to know how to calculate it (but that's currently done differently among forks) or the value has to be an argument of the primitive.
Levente
I'm a bit puzzled. I thought (small)Integers being their own hash is a good thing?
I would call it simple but not necessarily good.
a := (1 to: 1000) asArray.
s := Set withAll: a.
[ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
"==> 7014"
ps := PluggableSet integerSet.
ps addAll: a.
[ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
"==> 95"
IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid the long chains. That was probably the main reason for the push to make #hashMultply a numbered primitive.
Interesting!
Post by Levente Uzonyi
Levente
Post by Tobias Pape
Best regards
-Tobias
P.S.: This is another case where Behavior >> #hash bites Squeak and causes additional slowdown.
Post by Eliot Miranda
Hi All,
hash
^ByteArray hashBytes: self startingWith: self species hash
which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764
This is unsatisfactory. I propose changing Large(Positive)Integer>>hash to
hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
P.S. Note that this will not break Float hash, which is defined as
Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
hash as an equal Integer."
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
((self basicAt: 2) bitShift: -4)
P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart
Chris Muller
2018-11-23 00:35:37 UTC
Permalink
Post by Tobias Pape
Post by Levente Uzonyi
Post by Tobias Pape
I'm a bit puzzled. I thought (small)Integers being their own hash is a good thing?
I was wondering exactly the same thing!
Post by Tobias Pape
Post by Levente Uzonyi
I would call it simple but not necessarily good.
a := (1 to: 1000) asArray.
s := Set withAll: a.
[ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
"==> 7014"
ps := PluggableSet integerSet.
ps addAll: a.
[ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
"==> 95"
IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid the long chains. That was probably the main reason for the push to make #hashMultply a numbered primitive.
Interesting!
Indeed! When making a #hash methods, one always focuses on hash
distribution and finding the elements, but its easy to forget about
performa
Luciano Notarfrancesco
2018-11-24 15:41:22 UTC
Permalink
Post by Tobias Pape
Post by Tobias Pape
Post by Levente Uzonyi
Post by Tobias Pape
I'm a bit puzzled. I thought (small)Integers being their own hash is
a good thing?
I was wondering exactly the same thing!
Post by Tobias Pape
Post by Levente Uzonyi
I would call it simple but not necessarily good.
The problem with it is that consecutive numbers generate long chains
a := (1 to: 1000) asArray.
s := Set withAll: a.
[ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
"==> 7014"
The solution in Squeak is to use PluggableSet instead of Set, because
ps := PluggableSet integerSet.
ps addAll: a.
[ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
"==> 95"
IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid
the long chains. That was probably the main reason for the push to make
#hashMultply a numbered primitive.
Post by Tobias Pape
Interesting!
Indeed! When making a #hash methods, one always focuses on hash
distribution and finding the elements, but its easy to forget about
performance of NOT finding an element.
Yes! I had this problem blow in my face a couple of years ago, and Juan
agreed to include the change in Cuis to make SmallIntegers NOT their own
hash.

I think it is a good idea in general when programming a hash method in
Smalltalk to make it somewhat random. It will never be perfect for every
use case, and it doesn't need to be cryptographically secure, but there
shouldn't be simple use cases (e.g., consecutive integers) that produce
hashes that are not uniformly distributed. This can be achieved with
minimal performance impact (and potentially big performance gains in hashed
collections) quite simply by sending some hashMultiply message in your hash
method.

Regards,
Luciano
Eliot Miranda
2018-11-24 21:18:14 UTC
Permalink
Post by Luciano Notarfrancesco
Post by Tobias Pape
Post by Tobias Pape
Post by Levente Uzonyi
Post by Tobias Pape
I'm a bit puzzled. I thought (small)Integers being their own hash
is a good thing?
I was wondering exactly the same thing!
Post by Tobias Pape
Post by Levente Uzonyi
I would call it simple but not necessarily good.
The problem with it is that consecutive numbers generate long chains
a := (1 to: 1000) asArray.
s := Set withAll: a.
[ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
"==> 7014"
The solution in Squeak is to use PluggableSet instead of Set, because
ps := PluggableSet integerSet.
ps addAll: a.
[ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
"==> 95"
IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid
the long chains. That was probably the main reason for the push to make
#hashMultply a numbered primitive.
Post by Tobias Pape
Interesting!
Indeed! When making a #hash methods, one always focuses on hash
distribution and finding the elements, but its easy to forget about
performance of NOT finding an element.
Yes! I had this problem blow in my face a couple of years ago, and Juan
agreed to include the change in Cuis to make SmallIntegers NOT their own
hash.
This seems to be a basic error. The idea of a hash function is to produce
a well-distributed set of integers for some set of values. Since the
SmallIntegers are themselves perfectly distributed (each unique
SmallInteger is a unique value), it is impossible to produce a better
distributed hash function than the integers themselves. For some
application it may indeed be possible to produce a better distributed set
of hashes for the integers; for example an application which considers only
powers of two could use the log base 2 to produce a smaller and better
distributed set fo values modulo N than the integers themselves. But in
general the integers are definitionally well-distributed. In fact, unless
one has a perfect hash function one is in danger of producing a less
well-distributed set of values from a hash function than the SmallIntegers
themselves.

This argument doesn't apply as integers grow beyond the SmallInteger, but
not because we want better distribution of values than the large integers
themselves, but b because we want to avoid large integer arithmetic.
Post by Luciano Notarfrancesco
I think it is a good idea in general when programming a hash method in
Smalltalk to make it somewhat random.
As I've indicated, I think this is impossible in general. It is only in
specific applications, using specific subsets of the SmallIn tigers for
which a better hash function could be derived, but this would be specific
to that application. For purposes such as these we have
PluggableDictionary et al which can exploit an application-specific hash.
But in general the SmallIntegers are ideal values for their own hashes.
Post by Luciano Notarfrancesco
It will never be perfect for every use case, and it doesn't need to be
cryptographically secure, but there shouldn't be simple use cases (e.g.,
consecutive integers) that produce hashes that are not uniformly
distributed. This can be achieved with minimal performance impact (and
potentially big performance gains in hashed collections) quite simply by
sending some hashMultiply message in your hash method.
Regards,
Luciano
_,,,^..^,,,_
best, Eliot
Luciano Notarfrancesco
2018-11-25 05:06:50 UTC
Permalink
Hi Eliot!
Post by Eliot Miranda
Post by Luciano Notarfrancesco
Yes! I had this problem blow in my face a couple of years ago, and Juan
agreed to include the change in Cuis to make SmallIntegers NOT their own
hash.
This seems to be a basic error. The idea of a hash function is to produce
a well-distributed set of integers for some set of values. Since the
SmallIntegers are themselves perfectly distributed (each unique
SmallInteger is a unique value), it is impossible to produce a better
distributed hash function than the integers themselves. For some
application it may indeed be possible to produce a better distributed set
of hashes for the integers; for example an application which considers only
powers of two could use the log base 2 to produce a smaller and better
distributed set fo values modulo N than the integers themselves. But in
general the integers are definitionally well-distributed. In fact, unless
one has a perfect hash function one is in danger of producing a less
well-distributed set of values from a hash function than the SmallIntegers
themselves.
The problem I see is that sequences of SmallIntegers that we humans tend
add to hashed collections are not usually uniformly distributed. Hopefully
they have some meaning, and thus some patterns to them, most of the time
they don't look random, and thus if they are their own hash the performance
of hashed collections will most likely suffer.

I found this problem when profiling my code. I guess some people might
assume that Sets or Dictionaries are fast and use them with SmallIntegers
without much care, and never discover that their code could be orders of
magnitude faster. So, another option might be to modify scanFor: to
randomize the hash of the argument (for example sending an additional
hashMultiply). This is because scanFor: is assuming the arguments of
successive calls will have uniformly distributed hashes:
start := anObject hash \\ array size + 1.
the algorithm needs 'start' to be uniformly distributed between 1 and array
size, and in the case of SmallIntegers current hash this is very unlikely
and lookups degenerate in linear searches.

Also, any cryptographic hash will do great in pretty much ALL use cases.
Finding a sequence of integers that produces non-uniform hashes is very
hard, and is equivalent to breaking the cryptographic hash and make it
unusable for cryptography. This is a sort of silver bullet, I seem to
remember that Java does this now, but I think it is overkill and something
like hashMultiply is enough (plus, it is cheaper).

My 2 maos,
Luciano
Eliot Miranda
2018-11-25 21:21:29 UTC
Permalink
Hi Luciano,
Post by Luciano Notarfrancesco
Hi Eliot!
Yes! I had this problem blow in my face a couple of years ago, and Juan agreed to include the change in Cuis to make SmallIntegers NOT their own hash.
This seems to be a basic error. The idea of a hash function is to produce a well-distributed set of integers for some set of values. Since the SmallIntegers are themselves perfectly distributed (each unique SmallInteger is a unique value), it is impossible to produce a better distributed hash function than the integers themselves. For some application it may indeed be possible to produce a better distributed set of hashes for the integers; for example an application which considers only powers of two could use the log base 2 to produce a smaller and better distributed set fo values modulo N than the integers themselves. But in general the integers are definitionally well-distributed. In fact, unless one has a perfect hash function one is in danger of producing a less well-distributed set of values from a hash function than the SmallIntegers themselves.
The problem I see is that sequences of SmallIntegers that we humans tend add to hashed collections are not usually uniformly distributed. Hopefully they have some meaning, and thus some patterns to them, most of the time they don't look random, and thus if they are their own hash the performance of hashed collections will most likely suffer.
As I said, god specific cases where the distribution of interludes used fits some pattern, there is PluggableDictionary and PluggableSet. In cases where integers are randomly or evenly distributed one cannot improve on integers being their own hashes, so KISS, and use the available alternatives.
Post by Luciano Notarfrancesco
start := anObject hash \\ array size + 1.
the algorithm needs 'start' to be uniformly distributed between 1 and array size, and in the case of SmallIntegers current hash this is very unlikely and lookups degenerate in linear searches.
Whether it is likely or unlikely depends on the specific set of integers. Any hash function can suffer from the same issue; and if a hash function produces fewer values than its input set then that hash function may make things worse. Hence it makes sense to keep things simple and be aware of alternatives for specific situations.
Post by Luciano Notarfrancesco
Also, any cryptographic hash will do great in pretty much ALL use cases. Finding a sequence of integers that produces non-uniform hashes is very hard,
Since one takes the result of the hash modulo N I think this statement is false. It depends on the hash table size and the specific set of integers one is hashing.
Post by Luciano Notarfrancesco
and is equivalent to breaking the cryptographic hash and make it unusable for cryptography. This is a sort of silver bullet, I seem to remember that Java does this now, but I think it is overkill and something like hashMultiply is enough (plus, it is cheaper).
My 2 maos,
Luciano
Nicolas Cellier
2018-11-25 23:22:13 UTC
Permalink
Post by Eliot Miranda
Hi Luciano,
Hi Eliot!
On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <
Post by Luciano Notarfrancesco
Yes! I had this problem blow in my face a couple of years ago, and Juan
agreed to include the change in Cuis to make SmallIntegers NOT their own
hash.
This seems to be a basic error. The idea of a hash function is to
produce a well-distributed set of integers for some set of values. Since
the SmallIntegers are themselves perfectly distributed (each unique
SmallInteger is a unique value), it is impossible to produce a better
distributed hash function than the integers themselves. For some
application it may indeed be possible to produce a better distributed set
of hashes for the integers; for example an application which considers only
powers of two could use the log base 2 to produce a smaller and better
distributed set fo values modulo N than the integers themselves. But in
general the integers are definitionally well-distributed. In fact, unless
one has a perfect hash function one is in danger of producing a less
well-distributed set of values from a hash function than the SmallIntegers
themselves.
The problem I see is that sequences of SmallIntegers that we humans tend
add to hashed collections are not usually uniformly distributed. Hopefully
they have some meaning, and thus some patterns to them, most of the time
they don't look random, and thus if they are their own hash the performance
of hashed collections will most likely suffer.
As I said, god specific cases where the distribution of interludes used
fits some pattern, there is PluggableDictionary and PluggableSet. In
cases where integers are randomly or evenly distributed one cannot improve
on integers being their own hashes, so KISS, and use the available
alternatives.
Hi Eliot,
my bet is as Luciano: set of integers with uniform distribution is the
least probable in my experience: it's just noise. Every other data of
interest will show different patterns.
So I'm not sure that forcing usage of Pluggable* is that simple, for
preserving theoretical property that we don't encounter...
Post by Eliot Miranda
I found this problem when profiling my code. I guess some people might
assume that Sets or Dictionaries are fast and use them with SmallIntegers
without much care, and never discover that their code could be orders of
magnitude faster. So, another option might be to modify scanFor: to
randomize the hash of the argument (for example sending an additional
hashMultiply). This is because scanFor: is assuming the arguments of
start := anObject hash \\ array size + 1.
the algorithm needs 'start' to be uniformly distributed between 1 and
array size, and in the case of SmallIntegers current hash this is very
unlikely and lookups degenerate in linear searches.
Whether it is likely or unlikely depends on the specific set of integers.
Any hash function can suffer from the same issue; and if a hash function
produces fewer values than its input set then that hash function may make
things worse. Hence it makes sense to keep things simple and be aware of
alternatives for specific situations.
But that is the case for any other hash. What is important is the length of
hash bits compared to the size of the Set.
If hashMultiply result in too short hashes, then we get a problem for Set
of any other objects...
Post by Eliot Miranda
Also, any cryptographic hash will do great in pretty much ALL use cases.
Finding a sequence of integers that produces non-uniform hashes is very
hard,
Since one takes the result of the hash modulo N I think this statement is
false. It depends on the hash table size and the specific set of integers
one is hashing.
Hmm number theory, let's see... Hmm, that's really a question for Luciano
or Andres ;)
Post by Eliot Miranda
and is equivalent to breaking the cryptographic hash and make it unusable
for cryptography. This is a sort of silver bullet, I seem to remember that
Java does this now, but I think it is overkill and something like
hashMultiply is enough (plus, it is cheaper).
My 2 maos,
Luciano
Luciano Notarfrancesco
2018-11-26 04:05:55 UTC
Permalink
Post by Eliot Miranda
Hi Luciano,
Also, any cryptographic hash will do great in pretty much ALL use cases.
Finding a sequence of integers that produces non-uniform hashes is very
hard,
Since one takes the result of the hash modulo N I think this statement is
false. It depends on the hash table size and the specific set of integers
one is hashing.
Well, in fact there might be small biases in the residues modulo N. For
example, if the hash function produces uniformly distributed outputs of 8
bits (0 to 255 with probability 1/256), and you take the residues modulo
255, you'll find a small bias for 0: 1 to 254 have probability 1/254 while
0 has probability 2/254 (because both '0 \\ 255' and '255 \\ 255' are 0).
Small biases like this can make all the difference for the security of a
cryptographic system, but for a hashed collection it's fine.

Cheers,
Luciano
Luciano Notarfrancesco
2018-11-26 04:14:02 UTC
Permalink
1 to 254 have probability 1/254 while 0 has probability 2/254
Oops, that should be 1/256 and 2/256, sorry.
When probabilities don't sum 1, unexpected things can happen ;)
Luciano Notarfrancesco
2018-11-26 10:34:48 UTC
Permalink
Perhaps it is best to change Set>>scanFor:, as I suggested in other mail. I
think it might be best because: 1) adheres better to KISS principles,
keeping Integer>>hash unchanged and simple; 2) it might address similar
performance problems with other types of objects, not only SmallIntegers;
3) leaves the responsibility of 'mixing' the hash bits to the hashed
collections where it seems to belong.

If we do this, then we can have simpler requirements for hash; it has to be
a SmallInteger, equal objects must have equal hash, and ideally it has to
be well distributed. But it doesn't need to look random, so similar objects
can have similar hashes (as it is currently the case with SmallInteger, and
maybe with Strings too? it might be even desirable for similar Strings to
have similar hashes). Then the only required change is in Dictionary and
Set scanFor:
start _ (anObject hash \\ array size) + 1.
should be changed to:
start _ (anObject hash hashMultiply \\ array size) + 1.
(or something similar, anything that 'mixes' the bits of the hash, perhaps
with a primitive that does this and the modulus all at once). This would
ensure that lookup for hashed collections is done in constant time and
doesn't degenerate into linear searches, at a very small cost (the lookup
time is increased slightly due to the extra hashMultiply).

And there's always the alternative of PluggableSet / PluggableDictionary
for the cases when a custom hash is desired.

My other 2 cents,
Luciano
Post by Luciano Notarfrancesco
1 to 254 have probability 1/254 while 0 has probability 2/254
Oops, that should be 1/256 and 2/256, sorry.
When probabilities don't sum 1, unexpected things can happen ;)
Eliot Miranda
2018-11-26 16:02:33 UTC
Permalink
Hi Luciano,
Perhaps it is best to change Set>>scanFor:, as I suggested in other mail. I think it might be best because: 1) adheres better to KISS principles, keeping Integer>>hash unchanged and simple; 2) it might address similar performance problems with other types of objects, not only SmallIntegers; 3) leaves the responsibility of 'mixing' the hash bits to the hashed collections where it seems to belong.
start _ (anObject hash \\ array size) + 1.
start _ (anObject hash hashMultiply \\ array size) + 1.
(or something similar, anything that 'mixes' the bits of the hash, perhaps with a primitive that does this and the modulus all at once). This would ensure that lookup for hashed collections is done in constant time and doesn't degenerate into linear searches, at a very small cost (the lookup time is increased slightly due to the extra hashMultiply).
I find this much more compelling. I have no objection to this and it suggests an obvious extension to the hashMultiply primitive, which is to support large Integer receivers.

(p.s. _ in place of := is deprecated).
And there's always the alternative of PluggableSet / PluggableDictionary for the cases when a custom hash is desired.
My other 2 cents,
Luciano
Post by Luciano Notarfrancesco
1 to 254 have probability 1/254 while 0 has probability 2/254
Oops, that should be 1/256 and 2/256, sorry.
When probabilities don't sum 1, unexpected things can happen ;)
Eliot Miranda
2018-11-22 18:32:54 UTC
Permalink
Post by Tobias Pape
Why not do it the other way around and implement SmallInteger >> #hash as a primitive with the following fallback code?
SmallInteger >> #hash
<primitive: XXX>
| remainder hash |
self < 0
ifTrue: [
remainder := 0 - self.
hash := LargeNegativeInteger hash ]
ifFalse: [
remainder := self.
hash := LargePositiveInteger hash ].
[ remainder > 0 ] whileTrue: [
hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
remainder := remainder bitShift: -8 ].
^hash
The only problem to solve is the calculation of the initial hash value.
The VM has to know the initial hash value (e.g. it could be a constant based on the sign of the receiver) or it has to know how to calculate it (but that's currently done differently among forks) or the value has to be an argument of the primitive.
Levente
I'm a bit puzzled. I thought (small)Integers being their own hash is a good thing?
+1
Post by Tobias Pape
Best regards
-Tobias
P.S.: This is another case where Behavior >> #hash bites Squeak and causes additional slowdown.
Post by Eliot Miranda
Hi All,
hash
^ByteArray hashBytes: self startingWith: self species hash
which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764
This is unsatisfactory. I propose changing Large(Positive)Integer>>hash to
hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
P.S. Note that this will not break Float hash, which is defined as
Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
hash as an equal Integer."
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
((self basicAt: 2) bitShift: -4)
P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fraction
Marcus Denker
2018-11-22 09:47:26 UTC
Permalink
Post by Eliot Miranda
Hi All,
hash
^ByteArray hashBytes: self startingWith: self species hash
which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764
This is unsatisfactory. I propose changing Large(Positive)Integer>>hash to
hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
Thanks, I added an issue tracker entry:

https://pharo.fogbugz.com/f/cases/22690/32-vs-64-bits-and-large-integer-hash

I will do a pull request later (if not someone else is faster to do it).

Marcus


--
Marcus Denker
Marcus Denker
2018-11-22 10:18:01 UTC
Permalink
Post by Marcus Denker
Post by Eliot Miranda
Hi All,
hash
^ByteArray hashBytes: self startingWith: self species hash
which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764
This is unsatisfactory. I propose changing Large(Positive)Integer>>hash to
hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
https://pharo.fogbugz.com/f/cases/22690/32-vs-64-bits-and-large-integer-hash
I will do a pull request later (if not someone else is faster to do it).
https://github.com/pharo-project/pharo/pull/2009 <https://github.com/pharo-project/pharo/pull/2009>
Post by Marcus Denker
--
http://www.zweidenker.de
Loading...