Reading EXCEL CSV-files as Prolog Clauses (SWI-Prolog source-code)

stylized depiction of a csv text file
Image via Wikipedia

If you need to convert into Prolog terms “raw data” supplied in EXCEL csv-files, read on! The source code in this posting will read any CSV file, converting each semicolon-delimited line (or record) of the CSV file into a Prolog clause, asserted in RAM. It is also possible to use the same code to read data deliberately provided (e.g. by another application) as a CSV-file, but which is specifically intended for use as a set of Prolog clauses.

This code also uses a couple of specification predicates: time_field_type/1, field1_as_functor/1, and conv_csvhead/2. These predicates control the behaviour of the conversion process, as follows:

time_field_type/1 :

  • time_field_type(0). In this case, time-fields in the CSV file (of the form “HH:MM” or “HH:MM:SS…”) are translated into minutes, ignoring seconds or hundredths of a second.
  • time_field_type(1). In this case, time-fields in the CSV file (of the form “HH:MM” or “HH:MM:SS…”) are translated into seconds, ignoring hundredths of a second.
  • time_field_type(2). In this case, time-fields in the CSV file are kept as they are, as atoms (e.g. ‘03:35′, ‘12:45:20′, etc).

field1_as_functor/1:

  • field1_as_functor(0): Each line in the CSV-file is interpreted as a prolog clause, where the functor of the clause is the first field of the record, and the other fields are arguments.
  • field1_as_functor(foo) (where ‘foo’ can be any atom): Each line in the CSV file is interpreted as a prolog clause, where the functor of the clause is foo (or any atom supplied as 1st argument to field1_as_functor/1) and all the fields are arguments.

conv_csvhead/2:

  • This predicate is used to convert the contents of the first field (of the CSV-file) into a (user-defined) internal Prolog representation. It is used only if “time_field_type(0)” exists. For example, to convert records where the first field is a Prolog functor ‘job’ but the actual contents of this field are ‘j’ (for brevvity), using a definition “conv_csvhead(j,job)” will convert each ‘j’ into a functor ‘job’. (Use of conv_csvhead/2 is optional; in the default case, it does nothing!)

Finally, some notes:

  • The main predicate to call is “loaddb(CSVfile)“, where CSVfile can be e.g. “test.csv”.
  • Provision has been taken for special fields which contain Lists of items, comma-delimited. In EXCEL these fields will appear as longish strings, but this code was written to parse them as Prolog atom-lists. (Comment-out this section if you don’t need it).
  • The only type of field that is currently not converted into any meaningful internal representation is DATE. Dates are converted to atoms, just as they appear, without parsing their actual contents. (As an exercise, you can re-use parts of the same code to parse date-fields!) The honest reason for this omission is that… I didn’t need dates (in an application I am developing, for which this code was also written).

The source-code follows. There are useful comments inside this code. You can just copy and paste what follows from this point onwards, into a text file saved for compilation by SWI-Prolog, ending in “.pl”:

% ReadingG CSV-files (EXCEL); converting to Prolog clauses
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Open-source SWI-Prolog code by OMADEON (c)2007/2009.
%% Blogs:
%%%%%%%% http://prologsource.wordpress.com
%%%%%%%% http://omadeon.wordpress.com
% time_field_type/1 defines the type of the TIME-fields:
% Cases:
% (0): minute-accuracy,
% (1): seconds-accuracy,
% (N): time preserved as is (atom, e.g. '03:30'); N > 1 
time_field_type(0).
%time_field_type(1).
%time_field_type(2).

% Reading CSV-file, converting to prolog-clauses:
loaddb(File):-
     open(File, read, Stream),
     read_line_to_codes(Stream, T0),
     parse_line(T0,Stream),
     close(Stream).

% user-specific conversion of 1st field to functor:
%conv_csvhead('j','job'):- !. % activate if really needed!
% default-case, when the 1st field is an internal functor
conv_csvhead(X,X):- !.

% specifies if first field of each record is Prolog functor
field1_is_functor(1). % it is a functor
%field1_is_functor(foo). % NOT a functor; arg1 IS the functor!

% end of file
parse_line(end_of_file,_):- !.
% end of input-stream
parse_line([],_):- !.

% current line is parsed; next line follows
parse_line(CODES,Stream):-
     field1_is_functor(1), !,
     csvparse(CODES,XL),
     XL = [First|XXL],
     conv_csvhead(First,Head),
     TERM =.. [Head|XXL],
     assert(TERM),  % assert prolog-clause
     read_line_to_codes(Stream,T2), 
     parse_line(T2,Stream).

% current line is parsed; next line follows
parse_line(CODES,Stream):-
     field1_is_functor(Head), !,
     csvparse(CODES,XL),
     TERM =.. [Head|XL],
     assert(TERM), %%% assert prolog-clause
     read_line_to_codes(Stream,T2),
     parse_line(T2,Stream).

% error message (rarely invoked; included "just in case")
parse_line(Term,Stream):- 
     format(user_error, 'Bad term: ~p~n', [Term]),
     read_line_to_codes(Stream,T2),
     parse_line(T2,Stream).
csvparse2([],_,[[]]):-!.
csvparse2(Inp,Sep,X):-
     csvparse(Inp,Sep,X), !.

csvparse2a([],[[]]):-!.
csvparse2a(Inp,X):- csvparse(Inp,X), !.

% parsing a line delimited by a semicolon(;)
csvparse([],[]):-!.
csvparse(CODES,[FirstOut|More]):-
     append(First,[59|RestCODES],CODES),
     !,
     conv_codes(First,FirstOut),
     csvparse2a(RestCODES,More).
csvparse(Last,[OutX]):- conv_codes(Last,OutX), !.

% parsing line delimited by given separator-char:
csvparse([],_,[]):-!.
csvparse(CODES,SEPc,[FirstOut|More]):-
     append(First,[SEPc|RestCODES],CODES), !,
     conv_codes(First,FirstOut),
     csvparse2(RestCODES,SEPc,More).
csvparse(Last,_,[OutX]):-
     conv_codes(Last,OutX), !.

% check that input-list is only digits or dot(.)
are_digits([C]):- C >= 46, C =< 57, C =\= 47, !.
are_digits([C|CL]):-
      C >= 46, C =< 57, C =\= 47,
      !, are_digits(CL).

% read string enclosed in double_quotes, e.g. "the"
conv_codes([34|Rest],AtomX):-
     append(MAINx,[34],Rest),
     +\memberchk(44,MAINx),
     atom_codes(AtomX,MAINx), !.

% reading an integer or decimal number
conv_codes(CL,Nx):-
     are_digits(CL),
     atom_codes(Ax,CL),
     atom_number(Ax,Nx), !.

% read comma-delimited list in dup-quotes, e.g. "a,b,cd"
conv_codes([34|Rest],LISTx):-
     append(MAINx,[34],Rest),
     memberchk(44,MAINx),
     csvparse(MAINx,44,LISTx), !.

% time_field of the form "HH:MM"
conv_codes([C1,C2,58,C3,C4],OUTx):-
    time_field_type(TYP),
    atom_codes(A1,[C1,C2]),
    atom_codes(A2,[C3,C4]),
    atom_number(A1,N1x),
    atom_number(A2,N2x),
    (TYP = 0 ->       
      OUTx is N2x + 60*N1x % only minutes-accuracy
      ;
      ( TYP = 1 ->
        ( OUTx is 60*(N2x + 60*N1x) ) % seconds-accuracy
        ;
        atom_codes(OUTx,[C1,C2,58,C3,C4]) % time=string
      )
    ), !.

% time_field of the form "HH:MM:SS..."
conv_codes([C1,C2,58,C3,C4,58,C5,C6|CL],OUTx):-
    time_field_type(TYP),
    atom_codes(A1,[C1,C2]),
    atom_codes(A2,[C3,C4]),
    atom_number(A1,N1x),
    atom_number(A2,N2x),
    MINSx is N2x + 60*N1x,
    ( TYP = 0
      -> OUTx = MINSx % if only minutes-accuracy required
        ;
        ( TYP = 1
          -> % use this if seconds-accuracy is required
          ( atom_codes(A3,[C5,C6]),
            atom_number(A3,SECS),
            OUTx is SECS + 60*MINSx
          )
          ;
          atom_codes(OUTx,[C1,C2,58,C3,C4,58,C5,C6|CL])
          % time=string
        )
    ), !.

% a comma-delimited list
conv_codes(MAIN,LISTx):-
     memberchk(44,MAIN),
     csvparse(MAIN,44,LISTx), !.

% remaining case: just an atom
% (including case of time=string):
conv_codes(MAIN,ATOMx):-
     atom_codes(ATOMx,MAIN), !.
Reblog this post [with Zemanta]

3 Responses

  1. [...] Το νέο ποστ είναι κώδικας για ανάγνωση αρχείων EXCEL από SWI-Prolog. [...]

  2. ADO.NET free tutorial at Source Code

Leave a Reply