The Full-Text Stuff That We Didn't Put In The Manual

Home arrow Database arrow MySQL arrow The Full-Text Stuff That We Didn't Put In The Manual
The Full-Text Stuff That We Didn't Put In The Manual Print E-mail
Contributed by Howell   
Monday, 12 June 2006

The MySQL Reference Manual describes "Full-Text Search Functions" with lots of exciting details.

But we leave some things out because they're ephemeral. That is, they could change at any moment. Here, for people who really want to know what's going on with full-text searching as of May 2004 in version 4.1.2, is some information to supplement the manual:

Breaking Up Words
Index File Key Structure
Boolean Search
Exact Phrase Search
Selected Articles
Stopwords

Breaking Up Words
When we see "two words" it's fairly obvious that "two" is a word" and "words" is another word -- we can tell because there's a space between. Generally, anything that's alphabetic or numeric (including accented characters) is part of a word, anything that's not alphabetic or numeric (even "-") is not part of a word and is therefore used as a separator. Two interesting exceptions are the underscore '_' and the apostrophe ', as this example shows:

mysql> INSERT INTO articles (title)
mysql> VALUES ('John O''Hara''s nightmare');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT title FROM articles
    -> WHERE MATCH (title, body) AGAINST ('O''Hara''s');
+-------------------------+
| title                   |
+-------------------------+
| John O'Hara's nightmare |
+-------------------------+
1 row in set (0.00 sec)

Index File Key Structure
The full-text index is much like other indexes: a sorted list of "keys" which point to records in the data file. Each key has:

{
  Word            -- VARCHAR. a word within the text.
  Count           -- LONG. how many times word occurs in text.
}
{
  Weight          -- FLOAT. Our evaluation of the word's importance.
  Rowid           -- a pointer to the row in the data file.
}

This is the version 4.1 format. Earlier, it was a bit simpler.

Boolean Search
MySQL's full-text creator Sergei Golubchik observes that IN BOOLEAN MODE can be "suitable for experienced users" but "formulating a proper query can be very difficult". I don't find it so, in fact I find boolean mode easier because it's closer to what other DBMSs have for full-text search, and closer to the standard (SQL/MM) specification. But let me clear up three things in the documentation.

1. The manual says:

"A leading plus sign indicates that this word must be present in every row returned."
That's not always true. Here's a complicated query:

mysql> SELECT * FROM articles WHERE MATCH (title,body)
    -> AGAINST ('(dbms -wombat) (+mysqld)' IN BOOLEAN MODE);
+----+-------------------+-------------------------------------+
| id | title             | body                                |
+----+-------------------+-------------------------------------+
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...        |
|  4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
| 11 | mysqld            | wombat                              |
+----+-------------------+-------------------------------------+
3 rows in set (0.00 sec)

The word 'mysqld' isn't in every row, it's only in the rows that match the parenthesized expression that contains +mysqld.

2. The manual says:

"- A leading minus sign indicates that this word must not be present in any row returned."
Again, that's not always true, and here's another complex example:

mysql> SELECT * FROM articles WHERE MATCH (title,body)
    -> AGAINST ('(dbms -wombat) (mysqld)' IN BOOLEAN MODE);
+----+-------------------+-------------------------------------+
| id | title             | body                                |
+----+-------------------+-------------------------------------+
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...        |
|  4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
| 11 | mysqld            | wombat                              |
+----+-------------------+-------------------------------------+
3 rows in set (0.00 sec)

Notice the wombat? This example makes it clear that a leading minus sign can merely mean AND NOT.

3. The manual says:

"The argument to AGAINST() must be a constant string."
This is nothing to worry about. We just haven't updated the manual yet to show that the AGAINST() argument can be a variable or parameter, if you use MySQL Version 5. For example:

mysql> CREATE PROCEDURE p4 (param1 VARCHAR(100))
    -> SELECT * FROM articles
    -> WHERE MATCH (title, body) AGAINST (param1);
Query OK, 0 rows affected (0.00 sec)

mysql> CALL p4('database');
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)

Query OK, -1 rows affected (0.00 sec)

Exact Phrase Search
Suppose you want to search for a text that contains "you want to".

 SELECT ... FROM table1
 WHERE MATCH(textcolumn) AGAINST ('"you want to"' IN BOOLEAN MODE)

This is awful, for two reasons:

"You" and "want" and "to" are all stopwords if you're using the default stopword list (see the list of stopwords at the end of this article). In fact this particular query always fails because all words are stopwords.
Even if they weren't stopwords, there is nothing in a key that tells you what the relative position of the words is within the text (see the index file key structure at the start of this article).
So MySQL can't get the answer just by looking in the index. The method has to be: first try to filter out the candidate rows by looking for whatever words aren't stopwords, then examine each candidate row in the data file and make sure the words are there in the correct order. This means exact phrase searching can be a little slower than ordinary searching. On the other hand, MySQL won't find "Sam was tall" if you search for "Sam grew tall". (If "was" and "grew" are both stopwords, there's another DBMS that would find "Sam was tall". We think that looks like a bug.)

Select Articles
As well as the MySQL Reference Manual, many of the many books about MySQL have a section about full-text indexing and querying. Plus, here are some articles available free on the web:

Ian Gilfillan, "Using Fulltext Index in MySQL"
Sergei Golubchik, "MySQL Fulltext Search" (slides)
Joe Stump, "MySQL FULLTEXT Searching"
Stopwords
Here are all the stopwords in the default English file, which (in the MySQL Source download) is myisam/ft_static.c.

a's, able, about, above, according, accordingly, across, actually, after, afterwards, again, against, ain't, all, allow, allows, almost, alone, along, already, also, although, always, am, among, amongst, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anyways, anywhere, apart, appear, appreciate, appropriate, are, aren't, around, as, aside, ask, asking, associated, at, available, away, awfully, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, believe, below, beside, besides, best, better, between, beyond, both, brief, but, by, c'mon, c's, came, can, can't, cannot, cant, cause, causes, certain, certainly, changes, clearly, co, com, come, comes, concerning, consequently, consider, considering, contain, containing, contains, corresponding, could, couldn't, course, currently, definitely, described, despite, did, didn't, different, do, does, doesn't, doing, don't, done, down, downwards, during, each, edu, eg, eight, either, else, elsewhere, enough, entirely, especially, et, etc, even, ever, every, everybody, everyone, everything, everywhere, ex, exactly, example, except, far, few, fifth, first, five, followed, following, follows, for, former, formerly, forth, four, from, further, furthermore, get, gets, getting, given, gives, go, goes, going, gone, got, gotten, greetings, had, hadn't, happens, hardly, has, hasn't, have, haven't, having, he, he's, hello, help, hence, her, here, here's, hereafter, hereby, herein, hereupon, hers, herself, hi, him, himself, his, hither, hopefully, how, howbeit, however, i'd, i'll, i'm, i've, ie, if, ignored, immediate, in, inasmuch, inc, indeed, indicate, indicated, indicates, inner, insofar, instead, into, inward, is, isn't, it, it'd, it'll, it's, its, itself, just, keep, keeps, kept, know, knows, known, last, lately, later, latter, latterly, least, less, lest, let, let's, like, liked, likely, little, look, looking, looks, ltd, mainly, many, may, maybe, me, mean, meanwhile, merely, might, more, moreover, most, mostly, much, must, my, myself, name, namely, nd, near, nearly, necessary, need, needs, neither, never, nevertheless, new, next, nine, no, nobody, non, none, noone, nor, normally, not, nothing, novel, now, nowhere, obviously, of, off, often, oh, ok, okay, old, on, once, one, ones, only, onto, or, other, others, otherwise, ought, our, ours, ourselves, out, outside, over, overall, own, particular, particularly, per, perhaps, placed, please, plus, possible, presumably, probably, provides, que, quite, qv, rather, rd, re, really, reasonably, regarding, regardless, regards, relatively, respectively, right, said, same, saw, say, saying, says, second, secondly, see, seeing, seem, seemed, seeming, seems, seen, self, selves, sensible, sent, serious, seriously, seven, several, shall, she, should, shouldn't, since, six, so, some, somebody, somehow, someone, something, sometime, sometimes, somewhat, somewhere, soon, sorry, specified, specify, specifying, still, sub, such, sup, sure, t's, take, taken, tell, tends, th, than, thank, thanks, thanx, that, that's, thats, the, their, theirs, them, themselves, then, thence, there, there's, thereafter, thereby, therefore, therein, theres, thereupon, these, they, they'd, they'll, they're, they've, think, third, this, thorough, thoroughly, those, though, three, through, throughout, thru, thus, to, together, too, took, toward, towards, tried, tries, truly, try, trying, twice, two, un, under, unfortunately, unless, unlikely, until, unto, up, upon, us, use, used, useful, uses, using, usually, value, various, very, via, viz, vs, want, wants, was, wasn't, way, we, we'd, we'll, we're, we've, welcome, well, went, were, weren't, what, what's, whatever, when, whence, whenever, where, where's, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, who's, whoever, whole, whom, whose, why, will, willing, wish, with, within, without, won't, wonder, would, would, wouldn't, yes, yet, you, you'd, you'll, you're, you've, your, yours, yourself, yourselves, zero


  home              contact us

 

©2006-2009 DeveloperZone.biz   All rights reserved     powered by Mambo Designed by Siteground