conclist/2:
A useful String-list Concatenation predicate for Visual Prolog

implemented in hand-optimised 32-bit X86 Assembly Language

or:
"Proving that Prolog can become a competitive industrial tool, through a specific example!"

       by George A. Stathis ©2005


Linked lists in Visual Prolog are implemented in a specific way, which is also compatible with 'C': - A "Visual Prolog list" consists of 12-byte elements, where each element's first 4 bytes contain an indication of the element's type (1 for "end of list" or 2 for a "normal list-element"), where the next 4 bytes contain the list-element itself (which could be a pointer to another list or structure), and where the last 4 bytes of each element contain a pointer to the next list-element.

During recent years of intensive Visual Prolog programming, the need for a fast and efficient string-list concatenation predicate arose in my own work quite often. 
Such a predicate could also have been implemented in 'C', but -since long ago- I have been implementing such predicates in Assembly Language, which is even better.

The need for such a predicate arose repeatedly in projects of encoding and decoding grammatical data of human dictionaries, in Natural Language programming. Certain types of work with large text-corpora, entailed vast numbers of repeated calls of "conclist/2" within loops, and eventually necessitated this upgrade.

Of course, such a predicate (for concatenating string-lists) can be written very quickly in (any) Prolog, and it consists of less than ten lines of code, using recursion.

Soon enough, however, it becomes evident that any Prolog implementation of this predicate is bound to suffer from very serious drawbacks, especially intense memory problems, due to recursion:

- If each string inside the list is concatenated to an "accumulator argument" (as usually happens in Prolog programming) then the memory allocated internally for this accumulator increases at each step, until it reaches a point where it becomes impossible to process "large strings" (or "long string lists") without crashing the stack! Now, this is typically a very good reason why the industrial software community usually avoids Prolog programming, while the academic community, usuallly interested in theoretical issues (rather than getting real results from real data in the real world)... adores Prolog, in all its (often exotic) abstract facets and specialised "higher order" implementations....

Well, here is a typical Prolog implementation of string-list concatenation, which uses recursion:

/** a "traditional recursive" Prolog implementation: */
PREDICATES
   prolog_conclist(SLIST,CHAR,STRING,STRING)

CLAUSES
   prolog_conclist([S],_,Old,Sx):-
           concat(Old,S,Sx), !.
   prolog_conclist([S|SL],'\0',Old,Sx):-
           concat(Old,S,New),  !,
           prolog_conclist(SL,'\0',New,Sx).
   prolog_conclist([S|SL],Char,Old,Sx):-
           str_char(Sch,Char),
           concat(S,Sch,S2),
           concat(Old,S2,New),  !,
           prolog_conclist(SL,Char,New,Sx).


This does produce the correct result, but at a huge memory cost: - Each string in the list (argument 1) is concatenated with the previous "accumulator argument" 3, causing memory allocations to increase, at every stage. Eventually, this predicate crashes, if the strings are large or if the string-list has too many elements!

To avoid such crashes due to memory, a two-phase strategy may be implemented:


In Visual Prolog (but not in too many other Prologs!) it is possible to implement such a less memory-wasteful solution, by using some internal predicates which are not easily noticed by novice programmers: E.g. the 'C' function 'movmem', which is handy for "destructive" copying of memory blocks from any memory location (such as a string) to any new location (such as a freshly allocated string):

GLOBAL PREDICATES
   PROCEDURE movmem(STRING Source,STRING Dest,UNSIGNED Len) -(i,i,i) language c as "_MEM_MovMem"

Using this predicate, it becomes possible to solve the Stack overflow problem, by copying (instead of concatenating) the string-elements to successive positions inside a large string-butfer, which has been allocated only once. However, the resulting solution is still far from optimal:
Ah well, there is nothing more optimal than hand-optimised Assembly Language. However, traditional X86 Assembly Language programming relies on certain old techniques which have been made obsolete by modern Pentium processors. For this reason, the very first version of this predicate, implemented a few years ago in "naive Assembly Language", was re-written recently, after learning some of the new Pentium optimisation techniques.

What are these Pentium optimisations? - First of all, the "classic" X86 string-instructions 'lodsb', 'stosb', and 'loop' were optimal in 386 and 486 processors, but are no longer optimal for the Pentium. To optimise for the Pentium, these instructions must be avoided as much as possible: E.g. instead of 'lodsb', which reads a byte from register ESI and decrements ECX, combinations of 'mov' and 'inc' are preferrable; e.g. "mov al,[byte ptr ECX]" and "inc ECX" (where ECX plays the role of the string-buffer, faster than ESI).

The results of such simple considerations (which by no means exhaust a vast subject: Pentium optimisation) are shown below:

Nevertheless, and in defense of -even "naive"- Assembly Language programming, even the "non-optimal" Assembly language code (Firgure 2, below) is more than 15 times faster than the corresponding Prolog code!

Figure 2 is the source code for the first 32-bit implementation of the predicate "concatlist/2" (written in mid-2002), and Figure 1 is the source-code of the second (more optimised) new implementation, of a (functionally identical) predicate "conclist/2", written in April 2005. Figure 3 shows some benchmark test-results, which demonstrate that the new "conclist" predicate, compared to "concatlist") is approximately 42% faster. If the string-list concatenation is repeated many times inside a calling program loop, such a speed increase can be extremely important.

Of course, I do not preclude the possibility of further optimisations being still possible. Readers who are more familiar with "MMX" and "SSE" Pentium instructions than the author, may propose new speedups and come up with new insights. Pentium optimisation is a vast new subject, and I don't pretend to be a Pentium optimisation expert; only a "fairly experienced Assembly Language programmer", using ASM to speed up high-level languages (like Prolog).

Another area of optimisation is the efficient use of 32-bit memory transfers. I've already written optimised routines that copy areas of memory 4 bytes at a time. These techniques can also be used to enhance "conclist/2", but only if the strings inside the list are "reasonably big". When the strings are small  (as are English words, for example), then such techniques contain overheads for handling the "odd bytes" which may result in speed reductions, rather than increases.

The source code follows:

Fig. 1)  Source code for the optimised predicate "conclist/2" (written in April 2005):

; ======================= source-file: _conclist.asm ==========================
;
; Purpose: Concatenates a list of strings into a single string, with or without
; a
separator-character (arg 2). If the separator is 0 (ascii), then there will
; be no separator-chars in the output string (arg 3) (concatenation).

IDEAL
P586
MODEL    FLAT

DATASEG

ALIGN 4
vdstr dd voidstr
voidstr db 0,0,0,0

CODESEG
ALIGN 4

    public    _conclist

    include "macros.inc"

PROC _conclist near
    push   ebp      ;
    mov    ebp,esp  ;   
    push   edi      ;
    push   ebx      ;
    cld             ; forwards (just in case!)
    mov ecx,[dword ptr ebp+8]    ;input-list........ARG-1(i)
    mov eax,[ecx]   ; load flag list-element
    dec al          ; is list-element = end?
    jnz @@vx1       ; if so, special processing to give empty_string
; -------------------
    mov     ebx,8   ;
    xor    edx,edx  ;
    mov al,[byte ptr ebp+12]    ; separator........ARG-2(i)
    or    al,al     ; is the separator = 0?
    jz    @@vd1     ; if so, special processing (to NOT write it!)
; -------------------
    jmp short @@ok2 ;
; -------------------
@@LP0:   
    mov eax,[ecx]   ; load flag list-element
    dec al          ; is list-element NORMAL? (=1)?
    jnz @@XX        ; if not so, exit
; -------------------
@@ok2:   
    push    ebx     ;
    push    ecx     ;
    mov ecx,[ecx+4] ; 1st string_pointer in ECX
    xor    eax,eax  ;
@@p1:
    mov bl,[ecx+eax];
    inc    eax      ;
    or    bl,bl     ;
    jnz short @@p1  ;
; -------------------   
    dec    eax      ;
    add    edx,eax  ; add the specific str_len to the total_len
    inc    edx      ; also, add 1 (for separator)
    pop    ecx      ;
    pop    ebx      ;
    add ecx,ebx     ; advance list_ptr by 8
    mov ecx,[ecx]   ; address of next element...
    jmp @@LP0       ; repeat loop for all elements
; ===================
@@XX:
    inc    edx       ;
    push    edx      ;
     call _MEM_AllocGStack      ; allocate memory for this size of string
     mov    edi,eax  ;
    pop    edx       ;
    mov ecx,[dword ptr ebp+8]   ; input-list........ARG-1(i)
    push    edi      ;
    mov ah,[byte ptr ebp+12]    ; separator........ARG-2(i)
    mov    ebx,8     ;
; --------------------
@@Lq0:
    mov al,[byte ptr ecx]    ; load flag list-element
    dec al           ; is list-element NORMAL? (=1)?
    jnz @@ZZ         ; if not so, exit
; ====================
    push    ecx      ;
     mov ecx,[ecx+4] ; 1st string_pointer in ECX
     xor    edx,edx  ;
@@q1:
     mov al,[ecx+edx];
     stosb           ; write the byte
     inc    edx      ;
     or    al,al     ;
     jnz short @@q1  ;
; --------------------
     mov [byte ptr edi-1],ah ; write the separator
    pop    ecx       ;
    add ecx,ebx      ; advance list_ptr by 8
    mov ecx,[ecx]    ; address of next element...
    jmp @@Lq0        ; repeat loop for all elements
; ====================   
@@ZZ:
    xor    al,al     ;
    mov [byte ptr edi-1],al ;
    pop    eax    ; return the new_string in EAX
    pop ebx       ;
    pop edi       ;
    pop ebp       ;
    ret           ;
; ====================
@@vx1:
    mov eax,[vdstr]  ;
    pop ebx          ;
    pop edi          ;
    pop ebp          ;
    ret              ;
; ======================================================================
; ==================== CASE of zero-ascii Separator: ===================
; ======================================================================
@@M1:
    mov eax,[ecx]     ; load flag list-element
    dec al            ; is list-element NORMAL? (=1)?
    jnz @@h2          ; if not so, end this phaser
; ---------------------
@@vd1:
    push ebx          ;
    push ecx          ;
     mov ecx,[ecx+4]  ; 1st string_pointer in ECX
     xor    eax,eax   ;
@@w1:
     mov bl,[ecx+eax] ;
     inc    eax       ;
     or    bl,bl      ;
     jnz   short @@w1 ;
; ---------------------   
     dec    eax       ;
     add    edx,eax   ; add the specific str_len to the total_len
    pop    ecx        ;
    pop    ebx        ;
    add ecx,ebx       ; advance list_ptr by 8
    mov ecx,[ecx]     ; address of next element...
    jmp     @@M1      ; repeat loop for all elements
; =====================
@@h2:
    inc    edx        ;
    push    edx       ;
     call _MEM_AllocGStack    ; allocate memory for this size of string
     mov    edi,eax   ;
    pop    edx        ;
    mov ecx,[dword ptr ebp+8] ; input-list........ARG-1(i)
    push    edi       ;
; ---------------------
@@j0:
    mov al,[byte ptr ecx]  ; load flag list-element
    dec al                 ; is list-element NORMAL? (=1)?
    jnz @@Z3               ; if not so, exit
; ==========================
    push    ecx            ;
     mov    ecx,[ecx+4]    ; 1st string_pointer in ECX
     xor    edx,edx        ;
@@f1:
     mov    al,[ecx+edx]   ;
     stosb                 ; write the byte
     inc    edx            ;
     or    al,al           ;
     jnz    short @@f1     ;
; --------------------------
     dec    edi    ;
    pop    ecx     ;
    add ecx,ebx    ; advance list_ptr by 8
    mov ecx,[ecx]  ; address of next element...
    jmp     @@j0   ; repeat loop for all elements
; ==================   
@@Z3:
    xor   al,al    ;
    mov [byte ptr edi],al ;
    pop   eax      ; return the new_string in EAX
    pop   ebx      ;
    pop   edi      ;
    pop   ebp      ;
    ret
ENDP  _conclist


    end




Fig. 2)  Source code for the ("non-optimised") predicate "concatlist/2" (written in 2002):

; ====================== source-file: _concatlist.asm =========================
;
; Purpose: Concatenates a list of strings into a single string, with or without
; a
separator-character (arg 2). If the separator is 0 (ascii), then there will
; be no separator-chars in the output string (arg 3) (concatenation).

IDEAL
P586
MODEL    FLAT

DATASEG

ALIGN 4
vdstr dd voidstr
voidstr db 0,0,0,0

CODESEG
ALIGN 4

    public    _concatlist

    include "macros.inc"

PROC _concatlist near
ARG slist:dword, sepch:byte
    push   ebp      ;
    mov    ebp,esp  ;
    push   esi      ;
    push   edi      ;
    push   ebx      ;
    cld             ; forwards (just in case!)

    mov esi,[slist] ; make ESI = first argument (list)
    mov al,[byte ptr esi]  ; load flag
    cmp    al,2     ; is it two?
    jnz    @@ok1    ; if not so, proceed as usual...
; -------------------   
    mov eax,[vdstr] ; else, enter with void_string
    pop3            ;
; ===================
@@ok1:
    mov dl,[sepch]  ; make DL = second argument (char)

    mov bl,2        ; make bl = end-element flag
    or    dl,dl     ; zero sepchar?
    jz    @@n1      ; if so, skip next...
@@n0:
     push    esi    ;

     xor    edx,edx ; use EDX as size_counter, initialize it to 0
@@L0:
     lodsd          ; load list-flag

     cmp    al,bl   ; is it 2 (end of list)?
     jz    @@Lx     ; if so, exit this loop
     lodsd          ; string_pointer in EAX
     mov    edi,eax ;
     xor    eax,eax ;
     mov    ecx,eax ;
     not    ecx     ;
     repne scasb    ; find length of string_element (including zero at the end)
     not    ecx     ; now ECX = length of string element + 1
     add    edx,ecx ; add str_length to size_counter
     inc    edx     ; increment it by one (to account for separator)
     lodsd          ; load next list-element pointer into EAX
     mov    esi,eax ; use next list-element pointer, as current list pointer
     jmp    @@L0    ; repeat big loop
; ===================
@@n1:
     push    esi    ;

     xor    edx,edx ; use EDX as size_counter, initialize it to 0
@@L1:
     lodsd          ; load list-flag

     cmp    al,bl   ; is it 2 (end of list)?
     jz    @@Lx     ; if so, exit this loop
     lodsd          ; string_pointer in EAX
     mov    edi,eax ;
     xor    eax,eax ;
     mov    ecx,eax ;
     not    ecx     ;
     repne scasb    ; find length of string_element (including zero at the end)
     not    ecx     ; now ECX = length of string element + 1
     add    edx,ecx ; add str_length to size_counter
     lodsd          ; load next list-element pointer into EAX
     mov    esi,eax ; use next list-element pointer, as current list pointer
     jmp    @@L1    ; repeat big loop
; ===================
@@Lx:   
    mov    ecx,edx ; return ECX = list_size

    pop    esi     ;
    push    esi    ;
     push    ecx   ;
      call _MEM_AllocGStack    ; allocate memory for the result-string
     pop    ecx    ;
    pop    esi     ;
    mov    edi,eax ; edi is the new string pointer
    push    edi    ; store it in the stack
     mov dl,[sepch]; make DL = second argument (char)
     mov dh,2      ; make DH = end-element flag
     jmp    @@L3   ;
; ==================
@@L3a:    
     lodsd         ; load next-element-pointer

     mov  esi,eax  ; turn it into the current list-pointer
@@L3:
     lodsd         ; load flag-element

     cmp    al,dh  ; is AL = end-element flag
     jz    @@XXX   ; if so, end this loop
     lodsd         ; load string-pointer
     push esi      ;
      mov esi,eax  ; make ESI = string
@@M0: lodsb        ; load a byte from the string
      or    al,al  ; is it a zero?
      jz    @@M1   ; if so, stop copying bytes...
      stosb        ; else, write it to the target-string
      jmp    @@M0  ; and repeat this loop
; ==================
@@M1: pop    esi   ; recover list pointer from the stack
      or    dl,dl  ; zero sepchar?
      jz    @@L3a  ; if so, skip next
      mov    al,dl ; else, make AL = sepchar
      stosb        ; and write the sepchar
      jmp    @@L3a ;
@@XXX:    
      dec    al    ;

      dec    al    ; now AL = 0
      or    dl,dl  ; zero sepchar?
      jz    @@Z3   ; if so, skip next
      dec    edi   ;
@@Z3: stosb        ; end the string with a zero
      pop    eax   ; restore and return start of new string in EAX
      pop    ebx   ;
      pop    edi   ;
      pop    esi   ;
      pop    ebp   ;
      ret
ENDP  _concatlist

    end




Fig. 3)  Visual Prolog Test program (compiled as an "EasyWin" 32-bit  executable):

GLOBAL DOMAINS
 SLIST = STRING*

GLOBAL PREDICATES
 STRING conclist(SLIST,CHAR) -(i,i) language c        %CONCLIST.asm (2002)
 STRING concatlist(SLIST,CHAR) -(i,i) language c    %CONCATLIST.asm (2005)

CONSTANTS
%
test_concatlist = 1  % <-- activate this constant for the first test
 test_concatlist_benchmark = 2  %


/** a "traditional recursive" Prolog implementation: */
PREDICATES
   prolog_conclist(SLIST,CHAR,STRING,STRING)

CLAUSES
   prolog_conclist([S],_,Old,Sx):-
           concat(Old,S,Sx), !.
   prolog_conclist([S|SL],'\0',Old,Sx):-
           concat(Old,S,New),  !,
           prolog_conclist(SL,'\0',New,Sx).
   prolog_conclist([S|SL],Char,Old,Sx):-
           str_char(Sch,Char),
           concat(S,Sch,S2),
           concat(Old,S2,New),  !,
           prolog_conclist(SL,Char,New,Sx).
         

ifdef test_concatlist
GOAL
 repeat, write("give a sentence with commas:\n> "), readln(S),
 str2slist(S,',',SL), % another ASM predicate, converting a string-with-separators to a string-list
%NOTE: 
% Use the call  "term_str(slist,SL,S)", if you wish to avoid "str2slist/3",
% but...  then your sentence must have brackets and double-quotes!
%
 write("Give an output_separator (ASCII-number): "),
 readint(ASC),
 char_int(ChSep,ASC),

 prolog_conclist(SL,ChSep,"",X), write("\n(prolog_conclist/3) result:\n\"",X,"\"\n\n"),
 Y = concatlist(SL,ChSep), write("(concatlist/2) result:\n\"",Y,"\"\n\n"),
 Z = conclist(SL,ChSep), write("(conclist/2) result:\n\"",Z,"\"\n\n"),
 readchar(Cx), Cx='\27', !.
enddef

ifdef test_concatlist_benchmark
GOAL
 FileTypeFilters = ["Text files","*.1_x",
                    "Special large Test-files","*.1_x",
            "All Prolog files","*.pro;*.pre;*.con;*.dom;*.inc;*.rc",
            "All files","*.*"],
 F = dlg_GetFileName("*.txt",FileTypeFilters,"Open a file to test",[],"",_,_),
 file_str(F,Txt), str2slist(Txt,'\n',SL), repeat,
 T1a = readclock(), X1 = conclist(SL,'\n'), T1b = readclock(),
% NOTE: "readclock" is another ASM predicate giving the CPU's clock ticks
 T2a = readclock(), X2 = concatlist(SL,'\n'), T2b = readclock(),
% NOTE: "readclock" is another ASM predicate giving the CPU's clock ticks
 T3a = readclock(), prolog_conclist(SL,'\n',"",X3), T3b = readclock(),
% NOTE: "readclock" is another ASM predicate giving the CPU's clock ticks
 X1=X2, write("X1 = X2,\n"), X1=X3, write("X1 = X3,\n"),
 Tdif1 = T1b-T1a, Tdif2 = T2b-T2a, Tdif3 = T3b-T3a, nl,
 writef("Timings:\n(1)%16  <-- conclist/2\n(2)%16 <-- concatlist/2\n(3)%16 <-- prolog_conclist/4\n",
         Tdif1,Tdif2,Tdif3),
 Rdif21 = (Tdif2 / Tdif1), Rdif31 = (Tdif3 / Tdif1),
 writef("Execution times' ratio, Naive ASM compared to 'Optimal ASM':    %12.2f\n",Rdif21),
 writef("Execution times' ratio, Visual Prolog compared to Optimal ASM: %12.2f\n",Rdif31),
 readchar(Cx), Cx='\27', !.
enddef


/* Benchmark results (1): (by cutting and pasting from the Visual Prolog executable)
- Small file case ():


Timings:
(1)           45252  <-- conclist/2
(2)           71456 <-- concatlist/2
(3)         1034448 <-- prolog_conclist/4
Execution times' ratio, 'Naive' ASM(2)  compared to Optimal ASM(1):           1.58
Execution times' ratio, Visual Prolog(3) compared to Optimal ASM(1):        22.86
X1 = X2,
X1 = X3,

Timings:
(1)           44524  <-- conclist/2
(2)           71344 <-- concatlist/2
(3)         1031984 <-- prolog_conclist/4
Execution times' ratio, 'Naive' ASM(2)  compared to Optimal ASM(1):           1.60
Execution times' ratio, Visual Prolog(3) compared to Optimal ASM(1):        23.18
*/


/* Benchmark results (1): (by cutting and pasting from the Visual Prolog executable)
- Large file case (14.3Mb):

Any call to the predicate "prolog_conclist/4" produces the following... ugly message:

PROGRAM ERROR. Module:DICT_FUN.PRO Pos:18647
 Message:1001 Gstack overflow. Not enough memory or an endless loop

Press any key ...
*/



However, by commenting out the prolog implementation, we obtained the following results (for the same 14.3 Megabyte file, converted to a list and then re-assembled as a single string, using ASM):

X1 = X2,
Timings:
(1)        78069876  <-- conclist/2
(2)       135285196 <-- concatlist/2
(3)
Execution times' ratio, Naive ASM compared to 'Optimal ASM':            1.73
X1 = X2,
Timings:
(1)        70840152  <-- conclist/2
(2)       118534664 <-- concatlist/2
(3)
Execution times' ratio, Naive ASM compared to 'Optimal ASM':            1.67
X1 = X2,
Timings:
(1)        70717452  <-- conclist/2
(2)       121604524 <-- concatlist/2
(3)
Execution times' ratio, Naive ASM compared to 'Optimal ASM':            1.72
X1 = X2,
Timings:
(1)        69697228  <-- conclist/2
(2)       118369980 <-- concatlist/2
(3)
Execution times' ratio, Naive ASM compared to 'Optimal ASM':            1.70
*/