Sino - Yet another search engine for the Web
Andrew Mowbray
Australasian Legal Information Institute
- Overview
- Building a Sino Index
- Invoking Sino
- The Sino Search Language
- Sino Search Basics
- Words and Phrases
- Boolean and Proximity Operators
- Boolean AND
- Boolean OR
- Boolean NOT
- Proximity Operators
- Named Sections (Segments)
- Keyed Fields (Dates etc)
- Precedence
- Search Language Emulations
- Info-One
- Lexis
- Status
- C and agrep
Sino is a free text retrieval engine intended for use with httpd
and other embedded applications. Why yet another text retrieval engine for the
Web you might ask ? Well, to be honest, it all started as a bit of a joke.
Geoff (that is, Geoff King - the AustLII manager) was all impressed
about Glimp's small concordance sizes. Peter (that is, Peter van Dijk -
AustLII's principle consultant) was busy destroying CPU time with
some beautiful (his words, not mine) hypertext mark-up scripts. More to
annoy them than anything else, I thought I'd write something which went totally
against the grain (mind you, Peter has not always been exactly renowned for his
"green" disk space conservation policies). Enter Sino - short
for Size is no object - a free text retrieval system built for retrieval
times over index size. It worked like a charm - the first search we threw at
it on the AustLII production machine (something like "a*") managed to
produce a temporary spill file of almost a 1/4 of a gigabyte. This was bad, but
at least it was fast ...
The main things I have tried to achieve in building Sino are as
follows:
- annoy Peter (and to a lesser, but still significant extent - Geoff). They
still feel that I could be doing something more productive
- write something that anyone could use for free to air services like
AustLII
- provide a much more respectable search language and interface than was
available on any of the existing public domain products (particularly from an
Australian lawyers' perspective)
- produce something that is fast (no real magic needed here, just a
conventional inverted file approach with a few smarts borrowed from my
old free text system -Airs)
- don't get too hung up about index sizes (the AustLII indexes are
running at 30% of text size, which to my mind is more than acceptable)
- try to keep indexing times within sensible limits (AustLII's Sparc 20
is taking about an hour to index 60,000+ files containing 250+M)
- keep it portable so that it will at least run under Windows and on the Mac as
well as under Unix
- try not to produce 1/4G spill files again!
What started as a fairly light hearted project, has developed into a serious
system. Sino is now quite stable and is running as
the production search engine on AustLII.
To create a Sino index just run sinomake from the root of
the directory you wish to index. You can specify include and exclude patterns
via the files .sino_include and .sino_exclude respectively.
These should consist of a list of regular expressions which you want or
don't want to be indexed. If you don't specifify anything, the default is
to include all .htm, .html and .txt files and to exclude
anything starting with a dot. You will
probably also want to create a .sino_common file containing the names
of all common (non-indexed) words. If your feeling like tweaking
this, you can display frequently occuring words with sinoshow
-fsize where size is the minimun number of word occurences to
display.
Sinomake will produce the files, .sino_words, .sino_hits,
and .sino_docs in the current directory.
The current usage for sino (the search engine) is:
sino [ -n ] [ directory-name ]
[ directory-mask ]
If you just want to test the results of sinomake, then set the
default directory to the one containing the .sino files and type
sino. This will give you a (ridiculously simple) search interface.
If you are calling Sino from something else, you will probably want to
call it in non-interactive mode:
sino -n directory-name [directory-mask]
This will look for the .sino files in directory-name, read a
search as one line on standard input and spit out the search results as
file-name SPACE title NEWLINE on standard output
(where file-name is the file name to be display rooted at
directory-name and title is the HTML title of the document) -
simple but elegant.
The directory-mask argument may be used to restrict search results to
particular directories. When sino is displaying results, this is matched
against the start of found file names. Only matching documents will be
displayed.
The Sino search language is rather cosmopolitan. If you have used one
of the popular on-line legal database systems (or even if you haven't) you
probably do not need to learn anything new. Most Lexis, Status,
Info-One, (and for the non-lawyers, even C and agrep)
style searches are recognised. This section is intended for people who need to
understand exactly what Sino can parse. If you know "zot" about free
text searching, see the next section. Otherwise, if you do not a a deep seated
interest in Sino, you might want to quickly browse the relevant
Emulated Search Languages section.
When you do a Sino search, you as fundamentally searching for documents
which contain some words or phrases. If you can come up with a phrase which
you think is distinctive enough, just type it in and hit the return key! If
you need to find documents containing more than a single word or phrase, things
get a little (but not a lot) more complicated.
If you want more than one phrase or word to appear in the retrieved documents,
put an and between them. For example, to find documents
containing the phrase "moral rights" as well as the word
"copyright", you would type: "moral rights and
copyright" (less the quotes of course).
If, on the other hand, you want to find one term and/or another one, put an
or between them. For example, to find stuff which contains the
words "treaty", "convention" or "international agreement"
you would search for: "treaty or convention or international agreement".
If you wanted to, you could even put these two searches together - as in:
"treaty or convention or international agreement and moral rights and
copyright".
If you want to find two words or phrases which appear close to each other (for
example, the parties to a case), you can use the near connector.
If you wanted to find cases where Smith sued (or was sued by)
Brown, you might type: "smith near brown".
The rest of this document is a fairly detailed description of how Sino
searches documents. If your new to free text searching, you might want to go
away and have a play at this point, and come back when you have some
questions.
Now, let's get technical ... The basic unit of a Sino search is the
word. A word is any continuous sequence of alphanumeric characters.
Words are case insensitive. All words are searchable other than a relatively
small list of common words which is specified for each database. The
list of non-searchable words is typically quite small (less than 100 words) and
is generally limited to words of little informational content (such as
"the", "is", "but" and so forth).
Words may be combined into phrases without the need for any special connectors
(eg. "pervert the course of justice").
Sino automatically expands searches to match regular English plurals
(that is, a search for "treaty" will also match
"treaties" and a search for "contract" will match
"contracts"). The search parser allows for Unix shell
style pattern matching, including the ability to forward truncate
(particularly handy for Norwegian!). The following wild cards are
recognised:
- *
- matches any string (including null)
- ?
- matches any single character
- [ ... ]
- matches any one of the enclosed characters. A pair of characters
separated by a '- ' matches a range of characters (eg [a-c]
will match 'a', 'b', or 'c').
If the first character is a '^' or a '!', characters not enclosed
are matched (eg [^a-c] will matched anything except
'a', 'b' or 'c'.
The pattern must match an entire word. To search for words containing
substrings, use "*substring*". The left square bracket
symbol is also used for boolean grouping. Where you wish to start a word with
a [ ... ], you need to put the whole word in quotes
(eg "[ab]*ing").
As far as is consistent, Sino also supports regular expressions.
It will for example, treat the sequence ".*" as "*", ignore
'^' and '$' characters and will even deal with
agrep's '#' character. The main limitation is that
sequences such as "[0-9]*" will not work.
Care should be taken when applying pattern matching to ensure that patterns are
not ridiculously wild. The Sino search engine has to combine all of the
occurrence information for each matched word with a boolean OR. Patterns such
as "*" or even "a*" will lead to rather slow search
times!
Words and phrases may be connected together with boolean and
proximity operators to form more complex searches. The operators are
borrowed from a number of existing free text retrieval systems. They may be
used in any combination and regardless of their heritage.
The boolean AND operator allows you to identify documents which contain
two (or more) words or phrases. It may be written as: "and",
'+', '&', "&&" or ';'. Some typical
searches are:
- copyright and material form
- 18 and crimes act 1900
- defamation and journalist and newspaper
Where the keyword "and" is used to indicate a
boolean AND it has low precedence (like on Lexis) - it is only
evaluated after both of its arguments have been fully evaluated. Where it is
written in any of the other forms, it has a (more traditionally) higher
precedence than a boolean OR. The rationale for this is that OR is usually
used for synonyms which ought to group tightly and so giving AND a lower
precedence is usually more convenient for free text searching and is less
likely to lead neophyte searchers into difficulties.
The boolean OR operator is used to find documents containing either or
both of two terms and is typically used to find synonymous words and phrases.
It is written in Sino as: "or", ',', '|' or
"||". Examples include:
- section or s
- husband or wife or spouse
- proprietary limited or p l or pty ltd
The NOT operator allows you to find documents which contain one thing
but not another. It may be written as: "not", '-', or
'^'. In practice, this operator is seldom used, but to illustrate:
- trust not family
- trade practice act not 51
Proximity operators are used to find documents where 2 or more terms
appear near each other. Sino indexes documents in terms of where
words appear. Consequently, all proximity operators are in terms of word
positions. The simplest form of this class of operators is "near" (as
used on Info One). This operator requires that words or phrases appear
within 50 words of each other. For example:
- smith near brown
- 31 near bail act 1900
Although convenient, this operator is obviously a little on the
restrictive side. For more flexible proximity searching, you have the choice
of Lexis or Status style operators. These take the following
forms:
- /n/
- words and phrases must appear within n words of each other
(STATUS)
- /m, n/
- words must appear within m to n words of
each other (STATUS)
- w/n
- words or phrases must occur within n words of each
other (Lexis)
- pre/n
- first word must proceed second by less than n words
(Lexis)
For example:
- smith w/10 brown>
- smith /10/ brown
- smith /-10,10/ brown [ All find the word 'smith' within 10 words of
'brown']
- smith pre/10 brown
- smith /1,10/ brown [ Both find 'smith' followed by 'brown' up to 10 words
later ]
Named section (segment) searching takes one of the following forms:
- section(searchterms)
- phrase @ section
Standard named sections are title (the html title of a document) and
text (everything).
Date searches take the following forms:
- [#]date = date
- [#]date < date
- [#]date > date
- [#]date >< date
Any sensible (English style) date is OK.
Normally searches are evaluated from left to right. This is subject to the
following order of operator precedence (highest to lowest):
- word
- ( terms) phrase
- w/n pre/n w/seg /n/ /m,
n/ @ name ( terms )
- or & &&
- and not ^ || | , ;
You can use parentheses to alter this. Round, square and curly brackets
are all recognised. If you need to make any special symbols literal, these
should be enclosed in quotes (double, single or back quotes).
The following tables list available elements from the emulated search
languages:
Info One is a commercial Australian provider of CD-ROM based and on-line
services covering (primarily) State case law. Their CD and on-line products
both use the same search language. Sino supports the following Info
One style operators:
- and
- boolean AND (words/phrases must appear in same document)
- or
- boolean OR (either or both words/phrases must appear)
- not
- boolean NOT (the first word/phrase must appear but the second word
must not)
- near
- words and phrases must appear within 50 words of each other
- @
- word or phrase must appear in specified section
- [ ]
- square brackets may be used to group operators
- "term"
- double quotes may be used to escape the special
meaning of and, not etc
- #key
- info-one style date searches are supported
In general, the implementation is fairly faithful to the original. The fact
that Sino indexes words rather than characters, means that the
near operator has slightly different meaning. Another slight
difference is that or has higher precedence than
and (a common error for many neophytes anyway). As some
punctuation characters have special meaning to other search languages, it is
important not to include such characters in searches.
Lexis/Nexis is the world's largest on-line legal database. The search
language has been adopted by several other commercial products, including
the Innerview software as used by the Australian CD-ROM producer
DiskROM. Sino supports the following Lexis constructs:
- and
- boolean AND (words/phrases must appear in same document)
- or
- boolean OR (either or both words/phrases must appear)
- and not
- boolean NOT (the first word must appear but the second word must
not)
- w/seg
- words and phrases must appear within the same section (segment)
- w/n
- words or phrases must occur within n words of each
other
- pre/n
- first word must proceed second by less than n words
- section(terms)
- word or phrase must appear in
specified section
- ()
- round brackets may be used to group operators
- "term"
- double quotes may be used to escape the special
meaning of and, not etc
- key
- lexis style date seacrhes are supported
Sino counts common words (noise words) as occupying word
positions for search purposes. This will give subtly different results from
Lexis for searches such as "sale goods" (which will not match "sale of
goods"). There is currently no support for the Lexis operators not
w/n and not w/seg.
Status was one of the first free text retrieval systems to be developed
(in the early 1970's !). It was used by the short lived Eurolex service
and is still in use in Australia by the Commonwealth Attorney General's service
Scale. Sino allows the following operators:
- +
- boolean AND (words/phrases must appear in same document)
- ,
- boolean OR (either or both words/phrases must appear)
- -
- boolean NOT (the first word/phrase must appear but the second word must
not)
- /n/
- words and phrases must appear within n words of each other
- /m, n/
- words must appear within m to n words of
each other (m and n may be negative)
- @
- word or phrase must appear in specified section
- ()
- round brackets may be used to group operators
- #key
- status style date searches are supported
Sino does not index paragraphs and so the // (within paragraph)
operator is not available. The meaning of /n/ is more
general (but more useful) than is the case for Status. Otherwise, the
implementation is fairly close to the original.
For users who come from a computing science background, the following C-like
and agrep like operators are also supported:
- & or &&
- C-like boolean AND (words/phrases must
appear in same document)
- | or ||
- C-like boolean OR (either or both words/phrases must
appear)
- ;
- agrep-like boolean AND (words/phrases must appear in same
document)
- ,
- agrep-like boolean OR (either or both words/phrases must
appear)
- ^
- boolean NOT (the first word/phrase must appear but the second word
must not)
- () or {} or []
- square, round or curly brackets may be used
to group operators
The implementation of C and agrep style searches is pretty half hearted
and is only really intended for casual use.
PacLII:
Copyright Policy |
Disclaimers |
Privacy Policy |
Feedback
URL: http://www.paclii.org/paclii/help/sino.html