?

Log in

No account? Create an account
 
 
17 January 2014 @ 01:45 am
Why do we need surrogate keys in relational databases?  
That's a reply to the question why we need surrogate keys in relational databases.

Let's consider practical example (simplified).
I run a job board.
My job board has jobs and resumes.
For simplicity sake let's pretend that job is a text and resume is a text.

Now I want for every resume find top 50 best matching jobs.

My idea is to split resume into words, split job into words, then calculate cosine similarity between job and resume.
Then I would order all matches by cosine similarity.

That's business model, now it's time to go into database design.

First, I need to put resumes into Resume table.
I need to introduce surrogate key id, because using ResumeText as a key is very unpractical.
The same with Job table.

So we have:
create table Resume(
ResumeId int identity(1,1) not null, -- primary key
ResumeText nvarchar(max) not null)

create table Job(
JobId int identity(1,1) not null, -- primary key
JobText nvarchar(max) not null)

All right, let's keep going, we need to keep parsed words somewhere.
My design is to introduce another surrogate key for words and put it all into Word table:
create table Word(
WordId int identity(1,1) not null, -- primary key (surrogate)
WordText nvarchar(30) not null) -- we would trim words if they are too long.
We would also need IX_WordText index on WordText column.

Then I would create JobWord and ResumeWord tables:
create table JobWord(
JobId int not null, -- IX_ JobId index
WordId int not null) -- IX_WordId index

create table ResumeWord(
ResumeId int not null, -- IX_ ResumeId index
WordId int not null) -- IX_WordId index

That covers the data structure that I need in order to find best matches.
My parser would parse jobs and resumes into words.
Then map words into WordId-s using Word table.
Then insert these WordId-s into JobWord and ResumeWord tables.

At that moment I'm ready to run SQL query that would calculate cosine similarity between any given @jobId and @resumeId.
Or between any given @resumeId and all jobs in my database.

Back to the need for surrogate keys.
Let's imagine that in our database design we decided not to use surrogate key WordId and instead link Word directly to each other.
That would allow us to reduce number of columns. That would also allow us to drop Word table.
But at what price?

The price would be 5x+ increase in JobWord and ResumeWord sizes, because instead of 4 bytes WordId they would have to contain 30 bytes WordText.
Indexes would become about 7 times wider (and ~7 times slower).


In the scenario I described above the price of solving problem (finding matching words for calculating cosine similarity) is much higher than the price of dereferencing the solution (converting JobId into JobText).
That's what I had in mind when I wrote that price of dereference is minimal and only needed when we exit the solution, but price of fat indexes is paid while we are searching for solution.
Tags: ,
 
 
 
журнал закрытjuan_gandhi on January 17th, 2014 07:44 am (UTC)
Man, this is cool, very reasonable, and makes a lot of sense, and very ingenious.

But let me tell you two things.

1. You probably don't need relational databases. Maybe lucene or mongo would do better.
OTOH if it works for you, fine!!! I like it!

2. Why do you think your space is Euclidean? It is probably not. I would suggest a pseudometrics, something like sum((abs(xi-yi))^0.95)

If you need more details, can do.
Dennis Gorelikdennisgorelik on January 17th, 2014 09:23 am (UTC)
> Why do you think your space is Euclidean? It is probably not. I would suggest a pseudometrics, something like sum((abs(xi-yi))^0.95)

What is "xi" and "yi" in that formula?
Let me guess:
"xi" is "weight of Word(i) in resume"
"yi" is "weight of Word(i) in job"
?

If yes, then why does your formula calculate distance?
It should calculate similarity, right?

Here's calculation of ScalarProduct, which is the key component in calculating cosine similarity:
select
	sum(r.WordWeight * j.WordWeight) ScalarProduct,
	r.ResumeId
from ResumeWord r
inner join JobWord j
	on j.WordId = r.WordId
where r.ResumeId = @resumeId
group by j.JobId

Does your formula fit in here?


Cosine similarity is very easy to visualize (the smaller the angle between two vectors in n-dimentional space is - the better is cosine similarity).

How do I visualize your pseudometric?
журнал закрытjuan_gandhi on February 18th, 2014 09:38 pm (UTC)
Actually, people around have convinced me that things are more complicated than I thought; so I have to ponder on all this.
Dennis Gorelikdennisgorelik on February 18th, 2014 09:49 pm (UTC)
Do you mean cosine similarity is more complicated than you thought?
журнал закрытjuan_gandhi on February 18th, 2014 09:56 pm (UTC)
I mean the whole issue is more interesting than I thought.
Dennis Gorelikdennisgorelik on February 18th, 2014 10:48 pm (UTC)
What is "the whole issue"?
Why do we need surrogate keys?
журнал закрытjuan_gandhi on February 18th, 2014 11:23 pm (UTC)
I believe it was about selecting the right metric, and somehow finding euclidean.
Dennis Gorelikdennisgorelik on January 17th, 2014 09:40 am (UTC)
> Maybe lucene or mongo would do better.

I already use ElasticSearch ("Lucene inside") for just out of box full-text search (with minimal tweaks).

But I simply don't know how to handle certain details in ElasticSearch.
For example, I need to exclude job matches that user (who owns resume) already reviewed.
In SQL Server I maintain table with such reviewed job matches, and then exclude them from consideration by appending straightforward "left join" to my SQL query (that calculates scalar product for all jobs).
Would such exclusion be as straightforward in ElasticSearch too?
Den: pic#49724155kranov on February 12th, 2014 02:24 pm (UTC)
10-15 лет назад написана масса статей типа "Естественные ключи против искуственных ключей". Там помнится единственным плюсом ествеенных ключей -- меньше джойнов таблиц, реже ходим к диску. Но сейчас оперативной памяти столько, что в моих базах основное ожидание не чтение индекса/таблицы, а запись журнала при комите. А суррогатные ключи и нормализация позволяют уменьшить объем журналов. Видел презентацию, в которой мужик утверждал что слегка нормализовал бд, уменьшил объем журналов в 3 раза, время отклика улучшилось в разы.