Log in

Dennis Gorelik
11 January 2017 @ 09:26 am
У меня два дяди: один живёт в России, другой - в Украине.
Сегодня, в разговоре со своим российским дядей я в первый раз мимоходом упомянул, что Россия оттяпала Крым.
Дядя немедленно переключился на эту тему и выдал мне российскую версию украинской истории. Дядя рассказал, что:
1) Украина - никогда не была независимым государством.
2) Западные страны вообще и США в особенности влезли в Украину и натравили украинцев на русских.
3) Западные страны всегда хотели превратить Россию в слабую страну.
Ещё Тэтчер хотела оставить в России 15 миллионов жителей.
4) В Украине не было власти в тот момент когда в Крыму провели референдум по отделению от Украины.
5) Если бы не ввод российских войск в Крым - в Крыму (и в Украине вообще) обязательно было бы много убитых.
6) Ещё до отделения Крыма - на Украине уже начали убивать про-российски настроенное население в больших количествах.

Источник информации дяди - российское телевидение.
Которое очень открытое и приглашает аккредитованных журналистов и прочих гостей со всех стран на популярные официальные телеканалы.

Подводя итоги, дядя предсказал, что несмотря ни на что, Крым - несомненно останется в составе России.
Dennis Gorelik
11 January 2017 @ 06:23 am
Today Andrey and I discovered that NeuralCrawler we created - got brain cancer: Out of 844,467 pages - 99.5% is useless junk from 2 sub-domains: "boystown.giftlegacy.com" and "boystowngift.org"

So far we attribute the cause of that cancer spread to a couple of bugs:
1) Creating extra links with every redirect (unfortunately problematic domains generate links with random sessionId and then redirect from one to another).
2) Not deleting old page links after reparsing page content.
Dennis Gorelik
03 January 2017 @ 12:57 pm
Business context
For years I wanted to collect new jobs from all over internet in order to send appealing job alert emails to candidates that created a profile on postjobfree.com
So, finally, I decided to create a web crawler for that.
However, unlike Google, I do not want to crawl billions of pages (too expensive). Several million pages should be good enough for the first working prototype.
The question is - how to determine automatically what pages to crawl and what to ignore?
That's why our web crawler is combined with self-learning neural network.

Data structure
We represent every page as a record in PageNeuron table (PageNeuronId int, Url varchar(500), …, PageRank real, ...)
We represent links from page to page in LinkAxon table (..., FromPageNeuronId int, ToPageNeuronId int, …)

PageRank calculations
PageRank is inspired by classic Google PageRank, however we calculate it differently.
Instead of calculating probability of visitor click, our NeuralRewardDistribution process distributes PageRank from every PageNeuron record to every connected record (in both directions).
With every “reward distribution” iteration NeuralRewardDistribution process distributes about 10% of PageRank to other pages (that amount is split between all reward destination PageNeuron records proportionally to LinkAxon weights).
Then, in order to prevent self-excitation of the system, NeuralRewardDistribution applies "forgetting" by reducing PageRank of original page by 10%.

Setting goals
When NeuralPageEvaluator parses crawled pages - it tries to detect words and patterns we need.
Every time NeuralPageEvaluator finds something useful, it adds reward in form of extra PageRank for the responsible PageNeuron record. For example, we reward:
- 1 PageRank point for such words as "job", "jobs", "career", "hr".
- 10 PageRank points for such words as "hrms", "taleo", "jobvite", "icims".
- 1000 PageRank points when parser discovers link to a new XML job feed in the content of PageNeuron record.
- 20 PageRank points when parser discovers link to an XML job feed that we already discovered in the past.

What to crawl
NeuralPageProcessor processes already crawled pages (PageNeuron) by passing them to NeuralPageEvaluator.
NeuralPageEvaluator returns collection of outgoing links from that parsed page.
If extracted outgoing link is new, then NeuralPageProcessor creates new PageNeuron record for it. For initial PageRank it uses (10% of source PageNeuron record PageRank * link share or that new URL among all other URLs that source PageNeuron record points to).
NeuralCrawler crawls new PageNeuron records with highest PageRank.

NeuralRewardDistribution deletes PageNeuron records (and all corresponding LinkAxon records) of PageRank is too low.
Current "delete threshold" is at PageRank = 0.01 which deletes about half of ~3 million PageNeuron records we already created.
Dennis Gorelik
31 December 2016 @ 01:56 pm
Quote #1: A scientist's aim in a discussion with his colleagues is not to persuade, but to clarify.

Quote #2: The most important step in getting a job done is the recognition of the problem. Once I recognize a problem I usually can think of someone who can work it out better than I could.

Quote #3: Don't lie if you don't have to.

Quote #4: I have been asked whether I would agree that the tragedy of the scientist is that he is able to bring about great advances in our knowledge, which mankind may then proceed to use for purposes of destruction. My answer is that this is not the tragedy of the scientist; it is the tragedy of mankind. (See: Szilárd petition).

Quote #5: I'm in no hurry to get to Mars or Venus. I don't value the exploration of the solar system as much as maybe others do.

Leo Szilard ... conceived the nuclear chain reaction in 1933, patented the idea of a nuclear reactor with Enrico Fermi.
Szilard submitted patent applications for a linear accelerator in 1928, and a cyclotron in 1929. He also conceived the idea of an electron microscope.
After Adolf Hitler became chancellor of Germany in 1933, Szilard urged his family and friends to flee Europe while they still could.
After the war, Szilard switched to biology. He invented the chemostat, discovered feedback inhibition, and was involved in the first cloning of a human cell. He publicly sounded the alarm against the possible development of salted thermonuclear bombs, a new kind of nuclear weapon that might annihilate mankind. Diagnosed with bladder cancer in 1960, he underwent treatment using a cobalt-60 treatment that he had designed.
Dennis Gorelik
"I went through a period where I questioned things, but now I believe religion is very important.”
He didn’t answer further questions about what he does believe in.
Dennis Gorelik
25 December 2016 @ 11:36 am
I'm moving my LJ to https://dennisgorelik.dreamwidth.org/

How to move:
1) Create new dreamwidth account:

2) Import old livejournal account:

3) Setup cross-posting to LJ:

Thanks to k0m4atka

Why move:
1) Livejournal decided to move their servers from California to Moscow.
That means Russian authorities can do whatever they want with my data whenever they want.

2) LiveJournal technical team seems to be getting worse: more downtimes.
That is probably caused by decline in team morale (pressured by Russian government to move under Russian control).

Originally posted at: https://dennisgorelik.dreamwidth.org/120208.html
Dennis Gorelik
06 December 2016 @ 08:44 pm
AT&T brought fiber internet to my house today.
I got "cheaper/slower" option: 100 Mbps for $70/month.

Connection speed test is impressive:
Download: 124 Mbps
Upload: 122 Mbps

Speed test to Russia looks funny: download speed (35 Mbps) is less than upload speed (85 Mbps).

Compare it with pathetic 1 Mbps upload speed on my old Comcast connection to Russia (due to high ping).

PingPlotter view of ATT fiber connection:
Dennis Gorelik
29 November 2016 @ 04:37 pm

About one week ago we noticed that upload from my computer to remote servers became quite slow.
Speed test indicates that while upload speed is good (80 - 90 Mbps), download speed is getting worse on remote servers.

My computer is in Jacksonville area and upload speed to Orlando servers is quite fast:
Orlando, FL
Ping: 21 ms
Download speed: 90.13 Mbps
Upload speed: 9.55 Mbps

That is close to 10 Mbps upload speed that Comcast claims to have for my account.

However the further away the server is - the longer is ping time and the worse is upload speed:
Dallas, TX
Ping: 47 ms
Download speed: 90.12 Mbps
Upload speed: 5.14 Mbps

Remote server: Seattle, WA
Ping: 83 ms
Download speed: 90.13 Mbps
Upload speed: 3.12 Mbps

London, UK
Ping: 115 ms
Download speed: 87.99 Mbps
Upload speed: 2.31 Mbps

Chelyabinsk, Russia
Ping: 188 ms
Download speed: 86.41 Mbps
Upload speed: 1.52 Mbps

Why does upload speed deteriorate proportionally to the ping to the remote server (but download speed stays the same)?
My hypothesis is that network protocol waits after every chunk of data for the response from the remote server (instead of continuing sending new chunks of data).
By why?
Comcast technician came today and found that there was some network issues with upload frequency. Lost packets during upload may initiate resynchronization waits that would drag upload speed down on connections with slower ping.

Comcast technician promised that within couple of days, Comcast network crew should fix the issues.
We'll see how that would change my upload speed.

Update (thanks to anspa recommendation):
PingPlotter shows a lot of Packet Loss with my Internet connection:
Dennis Gorelik
14 November 2016 @ 09:54 am
I did not know that Donald Trump played a role in Wrestlemania:

And this debate looks like a preparation for Presidential debates:

Other Donald Trump cameos:

Thank to reytsman