View Issue Details

IDProjectCategoryView StatusLast Update
0000518LDMud 3.5Implementationpublic2011-02-13 21:38
Reporterzesstra Assigned Tozesstra  
PrioritynormalSeverityfeatureReproducibilityN/A
Status resolvedResolutionfixed 
Target Version3.5.0Fixed in Version3.5.0 
Summary0000518: Support for longer (double-word) string hashes and larger string tables
DescriptionThe string hashes (whash_t) are currently an unsigned short (16 bits, a word). Thus the string table is limited to about 65k hash chains.
I wondered if it would beneficial for muds with much more tabled string than that (MorgenGrauen has now about 463916, but the uptime is still short), to increase the hash length of string hashes to 32 bits (double-word, p_uint). Of course, the actual hash table length will be smaller than 2^32 entries. ;-)
The disadvantage is an increase of used memory per string (16 bits) and some for the string table (calculation of the hash isn't an issue here).
The hash function D.J. Bernstein uses in his CDB is reportedly very fast and efficient. Therefore I just decided to give it a try this afternoon.

The attached patch introduces a new compile-time switch USE_DWHASH. Only if set, the larger hashes and the DJB hash function will be used.
* new typedef: p_uint dwhash_t
* 2 new hash functions: dwhashmem2() and dwashstr(). dwhashmem2() is aliased to dwhasmem() by a define.

If USE_DWHASH is defined:
* changes the hash in string_s to dwhash_t
* changes the hash calculation and returning function to dwhash_t and to use dwhashmem(), dwashstr() in mstrings.c|h
* likewise changes to otable.h|c, lex.c (lex.c is not strictly necessary, as the identifer table is anyway much shorter than 2^16, but is hasn't any impact on the hash calculation speed.)

I haven't done more than initial tests yet.
I compared the pure calculation speed of the old and new hash functions and found, that the new one is actually a tiny bit faster. (all done with the same string)
1000000 dwhashstr():
real 0m0.499s
user 0m0.486s
sys 0m0.006s

1000000 whashstr():
real 0m0.570s
user 0m0.556s
sys 0m0.006s

1000000 chashstr():
real 0m0.578s
user 0m0.571s
sys 0m0.005s

In a freshly started homemud (with 131072 entries in the string table) the average search length by content decreased from about 1.5 to 1.2, the amount of used hash chains obviously increases (about 40%). Memory consumption increased by a few hundred kb, but this number is crude.

This patch a just an experiment and I may well have missed some points in the code, obviously it shouldn't be used in a real mud right now. But if anybody would like to do some tests with it concerning performance and stability or has some comments and ideas, I would be happy.
Additional Informationhttp://cr.yp.to/cdb/cdb.txt
http://en.wikipedia.org/wiki/Constant_Data_Base
http://www.mathematik.uni-ulm.de/sai/ws02/oodb/slides/physindex-03.html
TagsNo tags attached.
Attached Files
dwhash.diff (11,022 bytes)   
diff -ur ldmud-3.3.714.orig/src/autoconf/configure.in ldmud-3.3.714.mod/src/autoconf/configure.in
--- ldmud-3.3.714.orig/src/autoconf/configure.in	2006-07-10 04:44:05.000000000 +0200
+++ ldmud-3.3.714.mod/src/autoconf/configure.in	2007-09-23 17:26:08.000000000 +0200
@@ -371,6 +371,7 @@
 AC_CDEF_FROM_ENABLE(malloc_sbrk_trace)
 AC_CDEF_FROM_ENABLE(dynamic_costs)
 AC_CDEF_FROM_ENABLE(trace_code)
+AC_CDEF_FROM_ENABLE(use_dwhash)
 
 AC_CDEF_FROM_ENABLE(rxcache_table)
 AC_CDEF_FROM_ENABLE(synchronous_heart_beat)
@@ -2657,6 +2658,7 @@
 AC_SUBST(cdef_malloc_lpc_trace)
 AC_SUBST(cdef_malloc_sbrk_trace)
 AC_SUBST(cdef_dynamic_costs)
+AC_SUBST(cdef_use_dwhash)
 
 AC_SUBST(cdef_rxcache_table)
 AC_SUBST(cdef_wizlist_file)
diff -ur ldmud-3.3.714.orig/src/config.h.in ldmud-3.3.714.mod/src/config.h.in
--- ldmud-3.3.714.orig/src/config.h.in	2006-07-10 04:44:06.000000000 +0200
+++ ldmud-3.3.714.mod/src/config.h.in	2007-09-23 19:45:58.000000000 +0200
@@ -486,6 +486,7 @@
  */
 @cdef_rxcache_table@ RXCACHE_TABLE            @val_rxcache_table@
 
+
 /* --- Current Developments ---
  * These options can be used to disable developments-in-progress if their
  * code turns out to be interrrupting.
@@ -500,6 +501,11 @@
  */
 @cdef_use_new_inlines@ USE_NEW_INLINES
 
+/* Use larger string hashes (double-words, 32 bit) and D.J. Bernsteins
+ * Hash function?
+ */
+@cdef_use_dwhash@ USE_DWHASH
+
 
 /* --- Profiling ---
  * For profiling of the VM instruction implementations, refer to the Makefile
diff -ur ldmud-3.3.714.orig/src/hash.c ldmud-3.3.714.mod/src/hash.c
--- ldmud-3.3.714.orig/src/hash.c	2006-07-10 04:42:06.000000000 +0200
+++ ldmud-3.3.714.mod/src/hash.c	2007-09-23 19:33:54.000000000 +0200
@@ -161,5 +161,52 @@
     return h;
 } /* chashstr() */
 
-/***************************************************************************/
+/*-------------------------------------------------------------------------*/
+
+/* The following functions basically implement Dan J. Bernsteins Hash 
+ * function, also used in his CDB */
+dwhash_t
+dwhashmem2 (const char *s, size_t len, int maxn, dwhash_t initial)
+
+/* Hash the first min(<len>,<maxn>) characters of string <s> into an 
+ * unsigned long integer (p_uint, double word) and return this hashed value.
+ * The hash value is initialized with <initial>, so that this function
+ * can be used to hash a series of strings into one result.
+ * Aliased to dwhashmem()
+ */
+{
+    register dwhash_t hash = initial;
+    register size_t i = maxn;
+    
+    if (i > len)
+        i = len;
+    
+    while(i--) {
+        hash += (hash << 5);
+        hash = hash ^ *(s++);
+    }
+    return (hash);
+}
+
+/*-------------------------------------------------------------------------*/
 
+dwhash_t
+dwhashstr (const char *s, int maxn)
+
+/* Hash the first min(<len>,<maxn>) characters of string <s> (or until \0,
+ * whatever comes first) into an unsigned long integer (p_uint, double word) 
+ * and return this hashed value.
+ */
+{
+    register dwhash_t hash = INITIAL_DWHASH;
+    register size_t i = maxn;
+    register unsigned char c;
+    
+    while(i-- && (c = *s++) != '\0') {
+        hash += (hash << 5);
+        hash = hash ^ c;
+    }
+    return (hash);
+}
+
+/***************************************************************************/
diff -ur ldmud-3.3.714.orig/src/hash.h ldmud-3.3.714.mod/src/hash.h
--- ldmud-3.3.714.orig/src/hash.h	2006-07-10 04:42:06.000000000 +0200
+++ ldmud-3.3.714.mod/src/hash.h	2007-09-23 19:29:21.000000000 +0200
@@ -3,16 +3,25 @@
 
 #include <stdlib.h>
 #include <limits.h>
+#include "driver.h"
 
 typedef unsigned short whash_t;
 typedef unsigned char  chash_t;
+typedef p_uint         dwhash_t;
 
-#define MAX_WHASH (USHRT_MAX)
-#define MAX_CHASH (UCHAR_MAX)
+#define MAX_WHASH  (USHRT_MAX)
+#define MAX_CHASH  (UCHAR_MAX)
+#define MAX_DWHASH (PUINT_MAX)
+
+#define INITIAL_DWHASH 5381
 
 extern whash_t whashmem (const char *s, size_t len, int maxn);
 extern whash_t whashmem2 (const char *s, size_t len, int maxn, whash_t initial);
 extern whash_t whashstr (const char *s, int maxn);
 extern chash_t chashstr (const char *s, int maxn);
+extern dwhash_t dwhashmem2 (const char *s, size_t len, int maxn, dwhash_t initial);
+extern dwhash_t dwhashstr (const char *s, int maxn);
+
+#define dwhashmem(pTxt,len,maxn)   dwhashmem2(pTxt, len, maxn, INITIAL_DWHASH)
 
 #endif /* HASH_H__ */
diff -ur ldmud-3.3.714.orig/src/lex.c ldmud-3.3.714.mod/src/lex.c
--- ldmud-3.3.714.orig/src/lex.c	2006-07-10 04:43:48.000000000 +0200
+++ ldmud-3.3.714.mod/src/lex.c	2007-09-23 18:05:54.000000000 +0200
@@ -367,7 +367,11 @@
 #if ITABLE_SIZE == 256
 #    define identhash(s) chashstr((s), 12)
 #else
-#    define identhash(s) (whashstr((s), 12) % ITABLE_SIZE)
+#    ifndef USE_DWHASH
+#        define identhash(s) (whashstr((s), 12) % ITABLE_SIZE)
+#    else
+#        define identhash(s) (dwhashstr((s), 12) % ITABLE_SIZE)
+#    endif
 #endif
   /* Hash an identifier name (c-string) into a table index.
    */
diff -ur ldmud-3.3.714.orig/src/mstrings.c ldmud-3.3.714.mod/src/mstrings.c
--- ldmud-3.3.714.orig/src/mstrings.c	2006-07-10 04:43:25.000000000 +0200
+++ ldmud-3.3.714.mod/src/mstrings.c	2007-09-23 17:02:47.000000000 +0200
@@ -179,7 +179,11 @@
 #endif /* EXT_STRING_STATS */
 
 /*-------------------------------------------------------------------------*/
+#ifndef USE_DWHASH
 static INLINE whash_t
+#else
+static INLINE dwhash_t
+#endif
 hash_string (const char * const s, size_t size)
 
 /* Compute the hash for string <s> of length <size> and return it.
@@ -187,16 +191,22 @@
  */
 
 {
-    whash_t hash;
-
-    hash = whashmem(s, size, MSTRING_HASH_LENGTH);
+#ifndef USE_DWHASH
+    whash_t hash = whashmem(s, size, MSTRING_HASH_LENGTH);
+#else
+    dwhash_t hash = dwhashmem(s, size, MSTRING_HASH_LENGTH);
+#endif
     if (!hash)
         hash = 1 << (sizeof (hash) * CHAR_BIT - 1);
     return hash;
 } /* hash_string() */
 
 /*-------------------------------------------------------------------------*/
+#ifndef USE_DWHASH
 static INLINE whash_t
+#else
+static INLINE dwhash_t
+#endif
 get_hash (string_t * pStr)
 
 /* Return the hash of string <pStr>, computing it if necessary.
@@ -210,7 +220,11 @@
 } /* get_hash() */
 
 /*-------------------------------------------------------------------------*/
+#ifndef USE_DWHASH
 whash_t
+#else
+dwhash_t
+#endif
 mstring_get_hash (string_t * pStr)
 
 /* Aliased to: mstr_get_hash()
@@ -224,8 +238,11 @@
 
 /*-------------------------------------------------------------------------*/
 static INLINE string_t *
+#ifndef USE_DWHASH
 find_and_move (const char * const s, size_t size, whash_t hash)
-
+#else
+find_and_move (const char * const s, size_t size, dwhash_t hash)
+#endif
 /* If <s> is a tabled string of length <size> and <hash> in the related
  * stringtable chain: find it, move it to the head of the chain and return its
  * string_t*.
@@ -314,8 +331,11 @@
 
 /*-------------------------------------------------------------------------*/
 static INLINE string_t *
+#ifndef USE_DWHASH
 make_new_tabled (const char * const pTxt, size_t size, whash_t hash MTRACE_DECL)
-
+#else
+make_new_tabled (const char * const pTxt, size_t size, dwhash_t hash MTRACE_DECL)
+#endif
 /* Helper function for mstring_new_tabled() and mstring_new_n_tabled().
  *
  * Create a new tabled string by copying the data string <pTxt> of length
@@ -481,7 +501,11 @@
  */
 
 {
+#ifndef USE_DWHASH
     whash_t    hash;
+#else
+    dwhash_t   hash;
+#endif
     size_t     size;
     string_t * string;
 
@@ -513,7 +537,11 @@
  */
 
 {
+#ifndef USE_DWHASH
     whash_t    hash;
+#else
+    dwhash_t   hash;
+#endif
     string_t * string;
 
     hash = hash_string(pTxt, size);
@@ -546,7 +574,11 @@
 
 {
     string_t *string;
+#ifndef USE_DWHASH
     whash_t    hash;
+#else
+    dwhash_t   hash;
+#endif
     int        idx;
     size_t     size;
     size_t     msize;
@@ -749,7 +781,11 @@
  */
 
 {
-    whash_t hash;
+#ifndef USE_DWHASH
+    whash_t    hash;
+#else
+    dwhash_t   hash;
+#endif
     size_t  size;
 
 #ifdef EXT_STRING_STATS
@@ -790,7 +826,11 @@
  */
 
 {
-    whash_t hash;
+#ifndef USE_DWHASH
+    whash_t    hash;
+#else
+    dwhash_t   hash;
+#endif
 
     hash = hash_string(pTxt, size);
 
@@ -1019,8 +1059,13 @@
 
     /* Strings with a smaller hash also count as 'less'. */
     {
+#ifndef USE_DWHASH
         whash_t hash1 = get_hash(pStr1);
         whash_t hash2 = get_hash(pStr2);
+#else
+        dwhash_t hash1 = get_hash(pStr1);
+        dwhash_t hash2 = get_hash(pStr2);
+#endif
         if (hash1 != hash2)
             return  hash1 < hash2 ? -1 : 1;
     }
diff -ur ldmud-3.3.714.orig/src/mstrings.h ldmud-3.3.714.mod/src/mstrings.h
--- ldmud-3.3.714.orig/src/mstrings.h	2006-07-10 04:43:26.000000000 +0200
+++ ldmud-3.3.714.mod/src/mstrings.h	2007-09-23 16:54:35.000000000 +0200
@@ -47,7 +47,11 @@
     } info;
     string_t * next;    /* Linkpointer in string table. */
     size_t     size;    /* Length of the string */
+#ifndef USE_DWHASH
     whash_t    hash;    /* 0, or the hash of the string */
+#else
+    dwhash_t   hash;    /* 0, or the hash of the string */
+#endif
     char       txt[1];  /* In fact .size characters plus one '\0' */
       /* The string text follows here */
 };
@@ -65,7 +69,11 @@
 /* --- Prototypes --- */
 
 extern void mstring_init (void);
+#ifndef USE_DWHASH
 extern whash_t    mstring_get_hash (string_t * pStr);
+#else
+extern dwhash_t   mstring_get_hash (string_t * pStr);
+#endif
 extern string_t * mstring_alloc_string (size_t iSize MTRACE_DECL);
 extern string_t * mstring_new_string (const char * const pTxt MTRACE_DECL);
 extern string_t * mstring_new_n_string (const char * const pTxt, size_t len MTRACE_DECL);
diff -ur ldmud-3.3.714.orig/src/otable.c ldmud-3.3.714.mod/src/otable.c
--- ldmud-3.3.714.orig/src/otable.c	2006-07-10 04:42:56.000000000 +0200
+++ ldmud-3.3.714.mod/src/otable.c	2007-09-23 17:47:34.000000000 +0200
@@ -48,13 +48,20 @@
 /*=========================================================================*/
 /*                           OBJECT TABLE                                  */
 /*-------------------------------------------------------------------------*/
-
 #if !( (OTABLE_SIZE) & (OTABLE_SIZE)-1 )
-#    define ObjHash(s)        (mstr_get_hash(s) & ((OTABLE_SIZE)-1) )
+#  define ObjHash(s)        (mstr_get_hash(s) & ((OTABLE_SIZE)-1) )
+#  ifndef USE_DWHASH
 #    define ObjHashStr(s,len) (whashmem(s, len, MSTRING_HASH_LENGTH) & ((OTABLE_SIZE)-1) )
+#  else
+#    define ObjHashStr(s,len) (dwhashmem(s, len, MSTRING_HASH_LENGTH) & ((OTABLE_SIZE)-1) )
+#  endif
 #else
-#    define ObjHash(s)        (mstr_get_hash(s) % OTABLE_SIZE)
+#  define ObjHash(s)        (mstr_get_hash(s) % OTABLE_SIZE)
+#  ifndef USE_DWHASH
 #    define ObjHashStr(s,len) (whashmem(s, len, MSTRING_HASH_LENGTH) % OTABLE_SIZE)
+#  else
+#    define ObjHashStr(s,len) (dwhashmem(s, len, MSTRING_HASH_LENGTH) % OTABLE_SIZE)
+#  endif
 #endif
 /* Hash the string <s> and compute the appropriate table index
  */
dwhash.diff (11,022 bytes)   
0014-Dump-tables-strings-upon-shutdown.patch (2,113 bytes)   
From 69c93c001578184b238699f15b0170dee22ef0d9 Mon Sep 17 00:00:00 2001
From: zesstra <zesstra@zesstra.de>
Date: Thu, 4 Jun 2009 01:29:57 +0200
Subject: [PATCH 14/14] Dump tables strings upon shutdown.

---
 src/main.c     |    5 +++++
 src/mstrings.c |   21 +++++++++++++++++++++
 src/mstrings.h |    4 +++-
 3 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/src/main.c b/src/main.c
index 861f387..362979d 100644
--- a/src/main.c
+++ b/src/main.c
@@ -698,6 +698,11 @@ main (int argc, char **argv)
     tls_global_deinit();
 #endif
 
+#ifdef DUMP_STRINGS
+    // dump strings
+    mstrings_dump_all_strings();
+#endif
+    
     return rc; /* TODO: There are constants for this */
 } /* main() */
 
diff --git a/src/mstrings.c b/src/mstrings.c
index 6458425..95eb2ab 100644
--- a/src/mstrings.c
+++ b/src/mstrings.c
@@ -1862,4 +1862,25 @@ string_dinfo_status (svalue_t *svp, int value)
 #undef ST_NUMBER
 } /* string_dinfo_status() */
 
+void mstrings_dump_all_strings()
+{
+#ifdef DUMP_STRINGS
+    FILE *f = fopen(DUMP_STRINGS, "a");
+    if (!f)
+        return;
+    
+    int i;
+    for(i=0; i < HTABLE_SIZE; i++) {
+        string_t *s = stringtable[i];
+        while (s) {
+            fwrite(get_txt(s), 1, mstrsize(s), f);
+            if (*(get_txt(s)+mstrsize(s)-1) != '\n')
+                fwrite("\n", 1, 1, f);
+            s = s->next;
+        }
+    }
+    
+    fclose(f);
+#endif
+}
 /***************************************************************************/
diff --git a/src/mstrings.h b/src/mstrings.h
index bc95d1f..9acc0be 100644
--- a/src/mstrings.h
+++ b/src/mstrings.h
@@ -6,6 +6,8 @@
 
 #include "hash.h"
 
+#define DUMP_STRINGS "strings.dump"
+
 /* --- Types --- */
 
 /* --- struct string_s : String structure ---
@@ -113,7 +115,7 @@ extern void mstring_gc_table (void);
 
 extern mp_int add_string_status (strbuf_t *sbuf, Bool verbose);
 extern void   string_dinfo_status(svalue_t *svp, int value);
-
+extern void   mstrings_dump_all_strings();
 
 /* --- Inline functions and macros --- */
 static INLINE size_t mstr_mem_size(const string_t * const s) 
-- 
1.6.1

Relationships

child of 0000564 closedzesstra LDMud 3.6 Consolidate hash functions 

Activities

Gnomi

2008-07-01 03:09

manager   ~0000640

Just to add a number: In UNItopia we currently have 713974 tabled and 165125 untabled strings (after an uptime of 106 days). So we have 100% collisions and a bigger table would be appropriate. It would cost 2MB, so it would not hurt, but also not be taken lightly.

Is that a candidate for 3.5?

zesstra

2008-07-01 05:37

administrator   ~0000645

It is similar in Morgengrauen: 681094 tabled strings, 130178 untabled, uptime: 137 days.
I would like to try this in 3.5. I will also have to keep a 32-vs-64-bit issue here in mind, because my initial patch uses p_uint for storing the hash value, this should be changed.

zesstra

2009-06-01 02:48

administrator   ~0001174

I started working on updating my old patch and did some more research about the CDB hash function. According to http://burtleburtle.net/bob/hash/doobs.html it works especially well for short keys (5 characters mixed case english words have no collisions), but on other cases, it has its problems.
From the same page I conclude, that there are 2-3 alternatives:
* Paul Hsieh's hash (also on http://www.azillionmonkeys.com/qed/hash.html)
* Bob Jenkins hash (lookup3.c)
* Murmurhash (also on http://murmurhash.googlepages.com/)

From these, lookup3.c seems to be the strongest one, Paul Hsiehs' SuperfastHast has a small performance gain, but has funnels of 3 bits into 2. MurmurHash is the fastest. Bob Jenkins mentions, it is weaker than lookup3, but he doesn't eloborate on details.
I would probably favor Bob Jenkins hash function - the downside is a more complicated code.

zesstra

2009-06-02 14:57

administrator   ~0001186

Ok, I experimentally used Bob Jenkins lookup3 hash function.

Some speed comparison:
(Test case: Hashing the string "Also ich oute mich mal jetzt als Wurstseher, der von einer Quest oefters nur 90% (oder so) selber rauskriegt." 10000000 times.)

Old hash, non-optimized
real 0m7.806s
user 0m7.745s
sys 0m0.028s

New hash, non-optimized
real 0m4.066s
user 0m4.041s
sys 0m0.014s

Old hash, -O3
real 0m2.880s
user 0m2.857s
sys 0m0.010s

New hash, -O3
real 0m1.181s
user 0m1.156s
sys 0m0.008s

So far, so good. Now remains to find out, if the strength of Jenkins hash concerning collisions is really as high as promised... ;-)

zesstra

2009-06-02 16:01

administrator   ~0001187

Ok, one more: MurmurHash2.
Murmur, -O0
real 0m3.418s
user 0m3.394s
sys 0m0.013s

Murmur, -O3
real 0m0.602s
user 0m0.594s
sys 0m0.005s

Looks quite impressive. That is a factor of 5 compared to our current hash. ;-)

I found some additional information about the strength of the hashes:
http://murmurhash.googlepages.com/statistics
http://murmurhash.googlepages.com/avalanche
http://home.comcast.net/~bretm/hash/

zesstra

2009-06-03 15:45

administrator   ~0001191

Ok, I did some tests concerning the hash quality of the three hashes: Pearsons (the old one), Jenkins, Murmur. I used a dictionary containning 51248 distinct english words.
Explanation of the output below: I used a hash table for storing the strings, but used only 20 of 32 bits of the hash for the table. Therefore, the number of table collisions (hash chains with a length > 1) is significantly larger than the number of real hash collisions, but nonetheless interesting for the application in the driver I think.

1) Jenkins
51248 Strings added, 48887 hash chains created.
2361 table collisions occurred
1 real hash collisions
Average hash chain length: 1.0483
Collisions:
6bd5a61c - torsion; 6bd5a61c - Heineken


2) Murmur
51248 Strings added, 48781 hash chains created.
2467 table collisions occurred
1 real hash collisions
Average hash chain length: 1.0506
Collisions:
fc95820d - ope; fc95820d - dodger


There is not much difference concerning the number of real collisions, but Jenkins hash seems to be slightly better distributed, the are less table collisions, more hash chains used and the average hash chain length is smaller.

For comparison the old hash:
51248 Strings added, 25888 hash chains created.
25360 table collisions occurred
15276 real hash collisions
Average hash chain length: 1.7175
Collisions:

I spare the output of 15276 collision. ;-)

zesstra

2009-06-04 01:52

administrator   ~0001193

I would like to do some real world tests with strings used in several muds. The idea is to dump all tabled strings upon shutdown into a file. I attached a patch to do this.
Obviously, it is probably a bad idea to apply that patch to live muds, but if you have a test mud, I would be happy if would try it.

Sending me the string dump may be a potential danger, because the strings may contain private/confidential information (decide yourself). For this case, I will later attach a collection of small test programs for testing a few hashes, so you can test locally on you string collection.

_xtian_

2009-06-10 03:42

reporter   ~0001197

Do you need this patch/test applied against a specific version? (also: there ist no tarball available for the 3,3.719 release, yet)

zesstra

2009-06-11 03:21

administrator   ~0001198

The patch was build for the head of the trunk, but it should apply cleanly on the head of 3.3 and 3.3.719 as well.
Yes, the release was tagged, but not yet published to the homepage.

zesstra

2009-06-22 02:24

administrator   ~0001227

I did some tests with a collection of 738875 unique strings from MorgenGrauen yesterday. The results are shown below for different hash table sizes. Pearson is our old hash function.
Seems like the difference between Jenkins and Murmur is not significant, but Murmur ist faster (see 0000518:0001186 and 0000518:0001187).

Columns of the tables:
Chain -> no. of chains created.
Table Collisions -> no. of collision in the table
Coll -> no. of real hash collisions
S -> length of shortest hash chain
LC -> length of longest hash chain
Avg. len -> average hash chain length.

16bit
  Hash |Chain|T-Coll|Coll |S|LC|Avg. len
"Pearson" |65530|673345|673343|1|30|11,2754
"Murmur" |65531|673344|71 |1|28|11,2752
"Jenkins3"|65534|673341|72 |1|28|11,2747

17bit
  Hash |Chain|T-Coll|Coll |S|LC|Avg. len
"Pearson" |65532 |673343|673343|1|30|11,2750
"Murmur" |130597|608278|71 |1|18|5,6577
"Jenkins3"|130593|608282|72 |1|28|5,6578

18bit
  Hash |Chain|T-Coll|Coll |S|LC|Avg. len
"Pearson" |65532 |673343|673343|1|30|11,2750
"Murmur" |246585|492290|71 |1|14|2,9964
"Jenkins3"|246383|492492|72 |1|14|2,9989

19bit
  Hash |Chain|T-Coll|Coll |S|LC|Avg. len
"Pearson" |65532 |673343|673343|1|30|11,2750
"Murmur" |396070|342805|71 |1|10|1,8655
"Jenkins3"|396078|342797|72 |1|10|1,8655

20bit
  Hash |Chain|T-Coll|Coll |S|LC|Avg. len
"Pearson" |65532 |673343|673343|1|30|11,2750
"Murmur" |529885|208990|71 |1|7|1,3944
"Jenkins3"|530800|208075|72 |1|8|1,3920

zesstra

2009-09-13 08:50

administrator   ~0001264

Ok, some tests with strings from Avalon were similar to the ones with the string from MG. I replaced now the Jenkins Hash by Appleby's Murmurhash2. This concludes the issue for now.
One thing left: the hash function in its current from doesn't care about alignment. Or rather: it assumes, it can read a uint32_t from any pointer it gets. If that causes problems, we have to introduce a second variant and check for alignment problems with autoconf.

Issue History

Date Modified Username Field Change
2007-09-23 12:16 zesstra New Issue
2007-09-23 12:17 zesstra File Added: dwhash.diff
2008-06-30 02:56 zesstra Status new => assigned
2008-06-30 02:56 zesstra Assigned To => zesstra
2008-07-01 03:09 Gnomi Note Added: 0000640
2008-07-01 05:37 zesstra Note Added: 0000645
2008-07-02 01:13 Gnomi Project LDMud 3.3 => LDMud 3.5
2008-09-09 07:45 zesstra Relationship added child of 0000564
2009-06-01 02:48 zesstra Note Added: 0001174
2009-06-02 14:57 zesstra Note Added: 0001186
2009-06-02 16:01 zesstra Note Added: 0001187
2009-06-03 15:45 zesstra Note Added: 0001191
2009-06-04 01:47 zesstra File Added: 0014-Dump-tables-strings-upon-shutdown.patch
2009-06-04 01:52 zesstra Note Added: 0001193
2009-06-10 03:42 _xtian_ Note Added: 0001197
2009-06-11 03:21 zesstra Note Added: 0001198
2009-06-22 02:24 zesstra Note Added: 0001227
2009-09-13 08:50 zesstra Note Added: 0001264
2009-09-13 08:50 zesstra Status assigned => resolved
2009-09-13 08:50 zesstra Fixed in Version => 3.5.0
2009-09-13 08:50 zesstra Resolution open => fixed
2011-02-13 21:38 zesstra Target Version => 3.5.0