Hash Function For Sequence of Unique Ids (UUID)

Hash Function For Sequence of Unique Ids (UUID)

I am storing message sequences in the database each sequence can have up to N number of messages. I want to create a hash function which will represent the message sequence and enable to check faster if message sequence exists.

N

Each message has a case-sensitive alphanumeric universal unique id (UUID).
Consider following messages (M1, M2, M3) with ids-

(M1, M2, M3)

M1 - a3RA0000000e0taBB M2 - a3RA00033000e0taC M3 - a3RA0787600e0taBB

Message sequences can be

Sequence-1 : (M1,M2,M3) Sequence-2 : (M1,M3,M2) Sequence-3 : (M2,M1,M3) Sequence-4 : (M1,M2) Sequence-5 : (M2,M3)
...etc...

Sequence-1 : (M1,M2,M3) Sequence-2 : (M1,M3,M2) Sequence-3 : (M2,M1,M3) Sequence-4 : (M1,M2) Sequence-5 : (M2,M3)

Following is the database structure example for storing message sequence

enter image description here

Given the message sequence, we need to check whether that message sequence exists in the database. For example, check if message sequence M1 -> M2 -> M3 i.e. with UIDs (a3RA0000000e0taBB -> a3RA00033000e0taC -> a3RA0787600e0taBB) exists in the database.

M1 -> M2 -> M3

(a3RA0000000e0taBB -> a3RA00033000e0taC -> a3RA0787600e0taBB)

Instead of scanning the rows in the table, I want to create a hash function which represents the message sequence with a hash value. Using the hash value lookup in the table supposedly faster.

My simple hash function is-
enter image description here

I am wondering what would be an optimal hash function for storing the message sequence hash for faster is exists check.

This question has not received enough attention.

I am looking for Hash function suitable for specified scenario (i.e., hashing ordered sets of UIDs). Please refrain from generic and vague answers.

3 Answers
3

You don't need a full-blown cryptographic hash, just a fast one, so how about having a look at FastHash: https://github.com/ZilongTan/Coding/tree/master/fast-hash. If you believe 32 or 64 bit hashes are not enough (i.e. produce too many collisions) then you could use the longer MurmurHash: https://en.wikipedia.org/wiki/MurmurHash (actually, the author of FastHash recommends this approach)

There's a list of more algorithms on Wikipedia: https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cryptographic_hash_functions

In any case, hashes using bit operations (SHIFT, XOR ...) should be faster than the multiplication in your approach, even on modern machines.

He didn't ask for the fastest hash function -- he asked for a 'faster is exists check'. I'd be surprised if the built-in hash function in whatever language was slower than the database lookup.
– Mohan
yesterday

How about using MD5 algorithm to generate the hash for a concatenated string of messageUIDs.

For instance- consider messages

M1 - a3RA0000000e0taBB M2 - a3RA00033000e0taC M3 - a3RA0787600e0taBB

For message sequence M1->M2->M3 string would be a3RA0000000e0taBB;a3RA00033000e0taC;a3RA0787600e0taBB which will have MD5 hash as 176B1CDE75EDFE1554888DAA863671C4.

M1->M2->M3

a3RA0000000e0taBB;a3RA00033000e0taC;a3RA0787600e0taBB

176B1CDE75EDFE1554888DAA863671C4

According to this answer MD5 is robust against collisions. In the given scenario there is no need for security so MD5 may suffice.

MD5 will be a lot slower than the hashes in @memo's answer, and will probably not do be any better at avoiding collisions than any other 128-bit hash function, and probably not much or any better than your simple hash function.
– James K Polk
Aug 23 at 21:53

@James Okay. I see. Is there any specific reason that you think- "MD5 will be a lot slower than the hashes in memo's answer (Fast-Hash and MurMurHash3)". Thanks!
– EngineeredBrain
Aug 23 at 22:21

Just my intuition based on my understanding of the number of steps in each hash functions.
– James K Polk
Aug 24 at 2:05

Premature optimisation is the root of all evil. Start with the hashing function that is built into your language of choice, and then hash the lists (M1, M2), etc.. Then profile it and see if that's the bottleneck before you start using third-party hash libraries.

(M1, M2)

My guess is that database lookup will be slower than the hash computation, so it won't matter which hash you use.

In Python you can just call
hash([m1, m2, m3])

hash([m1, m2, m3])

In Java call the hashCode method on your ArrayList.

hashCode

ArrayList

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

CPyk9De2GwP,cxRUyl,n,ykx BVwzbkfWaQGqFl3ioGYo

搜尋此網誌

Cyufjy