TL;DR: TFIDF is GR8 IMHO


That is, "Too Long; Didn't Read: Term Frequency - Inverse Document Frequency is Great In My Humble/Honest Opinion" for those of you who don't speak both internet and nerd.


Not long enough; must read more:

TFIDF stands for Term Frequency - Inverse Document Frequency. It's a bit of a mouthful, but it's basically just a simple formula for determining how important certain words are in a given text.

Since I'm using it for the memorization tool, MemorEquip, which I built mostly for Biblical texts, let me explain with John 15:13 ESV as an example:

"Greater love has no one than this, that someone lay down his life for his friends."

So if you were to try to come up with keywords for this text, there are a couple things you might try. First you might simply find the most common word in the text. In this case, that'd be "his." Seems a little weak of a word to be a keyword. So maybe you'd try to understand the text a bit more, and perhaps you'd determine that this text is putting an emphasis on "love." While I wouldn't disagree with you, love is a pretty common word in the Bible. Furthermore, we tend to love to memorize passages about love.

So while love may be the emphasis in this passage, if you were told to guess what passage somebody was thinking of and were also told that it was a passage about love, you'd still have too many passages to choose from. It'd be about as helpful as being told the passage contained "the" - which interestingly, this one doesn't.

To be honest, I'm not entirely sure what the most helpful keywords would be here (I'll look them up and list them below though). If I had to guess, I'd probably say, "lay," "greater," and/or "friends." That said, for the purposes of my memorization tool, I'd argue that the answer could, and should, vary per individual. For instance, let's say I'm a bit of a weirdo (completely hypothetical, of course), and I actually don't have any other passages that pertain whatsoever to love. In that case, "love" would be a pretty solid keyword for me.

Here's a table with the real numbers for me. Note that document frequency numbers are based on the passages I'm actively trying to memorize, and they're not based on Scripture as a whole.

term

Term Frequency

Document Frequency

Inverse Document Frequency

TFIDF

how many times
does the word occur in this text

how many other passages
does this word occur in

invert DF

multiply TF and IDF

greater

1

1

1.0000

1.0000

friends

1

2

0.5000

0.5000

lay

1

2

0.5000

0.5000

down

1

2

0.5000

0.5000

someone

1

2

0.5000

0.5000

than

1

6

0.1667

0.1667

his

2

22

0.0455

0.0909

has

1

17

0.0588

0.0588

this

1

18

0.0556

0.0556

no

1

20

0.0500

0.0500

life

1

20

0.0500

0.0500

one

1

23

0.0435

0.0435

love

1

24

0.0417

0.0417

that

1

53

0.0189

0.0189

for

1

61

0.0164

0.0164

So in essence, TFIDF just tries to balance the frequency of a word within the given text with how common of a word it is in other texts that it knows about. Ideally, a keyword would both occur often in the given text and not very often in other texts.

This isn't really cutting edge artificial intelligence here, but I think it's a pretty elegant solution to a problem I actually have when trying to memorize tons of passages at once. While I do ultimately want to be able to associate a passage with its reference (like "John 15:13"), that actually gets pretty difficult when simultaneously trying to memorize tons of passages. Being able to request a keyword as a hint on a passage ought to be a good starter hint. If nothing is clicking just by seeing the reference, you can request a keyword which might be enough for you to at least make an educated guess as to which passage this is. And perhaps less likely but hopefully still possible, this might help you associate the reference with the keyword. If you can associate a reference with a keyword and a keyword with a passage, you've effectively associated a reference with a passage (by the transitive property).

And since I like to try to incorporate an xkcd in every post (and since "xkcd" looks like yet another ridiculous acronym/initialism even though it's not) here's one I like that was a bit of a stretch as far as relevance to the post goes. But hey, it's kinda about love and data, right? And this is dataisbaye after all.


xkcd convincing