How to get your IP unbanned on HN

ck2 · on Nov 9, 2012

Use IPTABLES xt_connlimit to regulate overly aggressive client requests.

Since there are so few images on HN, there is no reason to have more than a couple connections per IP on port 80.

It will radically reduce your server load and there will be no blacklists/whitelists to maintain.

pandemicsyn · on Nov 9, 2012

...except people also browse HN from behind corporate gateway's/firewall's.

ck2 · on Nov 9, 2012

It won't ban them, it just throttles simultaneous connections from the same IP.

Unless there are dozens of people from the exact same IP and not an IP pool, it won't be a problem.

The worst case is they will see a longer delay for an initial page load in their browser by half a second. But it helps the server tremendously, especially since HN seems to use Apache.

cbhl · on Nov 9, 2012

In this age of NAT and IPv4 address exhaustion, it's not uncommon to have dozens of people from the exact same IP.

ck2 · on Nov 9, 2012

...and if they are all hitting HN at the exact same millisecond, then their connection should be delayed

HN serves with connection-close, not keep-alive, so as soon as one request is done, the connection is freed for the next visitor on the same IP. This would just force them to be in single file on a very quickly moving line instead of requiring dozens of connections to be served all at the same time.

Think of grocery store with one super-fast express lane vs no express lane and a dozen very slow cashiers and people with full carts ahead of you.

Don't knock connlimit until you try it. Again, it's not a ban, just backlogs the requests.

cbhl · on Nov 9, 2012

That sounds better, but it feels like a band-aid solution to me. For example, I worry about whether it will actually fix the load problems if a bad network has lots of requests, resulting in a very long queue and lots of open connections. It sounds like it's worth trying, at least.

saurik · on Nov 9, 2012

As it currently stands, they would simply be unable to use HN if they were loading it at the same time, as the server would just ban them; do you feel that is really a better solution to the proposed delay?

cbhl · on Nov 9, 2012

I think that the proposed solution gives preferential treatment to users who were around long enough (or have enough money) to be on a network where they are assigned their very own personal IPv4 address. If IP addresses mapped 1:1 to users or machines, then I'd be all for using xt_connlimit to throttle users who perform excess requests.

Even if you add a proposed delay, a user behind one of these NATted networks could (unintentionally, I hope) cause a DoS by sending lots of requests to make the queue unreasonably long, which, to someone behind the NAT, is just as bad as a server ban.

bretthoerner · on Nov 9, 2012

I'm 99% sure HN runs on FreeBSD.

ck2 · on Nov 9, 2012

Good thing you left that 1% ;-)

   Server: Apache/2.2.22 (Ubuntu)

XT_CONNLIMIT does miracles for apache especially.

Actually looks like I am wrong.

Static objects are coming from Amazon while dynamic are coming from another server @theplanet.com

   Apache/2.2.19 (FreeBSD)

So you are right, it's FreeBSB, but it's still Apache which really needs connection throttling. But there might be a reverse proxy in place. You can also IP throttle with a module in nginx.

laumars · on Nov 9, 2012

They also should upgrade Apache as there's a known vulnerability with pre-2.2.20 versions that leaves Apache open to an easy DoS attack.

smanek · on Nov 9, 2012

pg: I have fair bit of lisp dev experience. If, as a weekend project, I modified the HN src to use postgres and memcache would you consider using it in production? Obviously, I don't expect carte blanche prior agreement, but I wouldn't want to invest the time unless I thought it was plausible the work could actually help.

I would expect it to solve most of your performance problems for the foreseeable future (at the very least, by letting you scale horizontally and move the DB, frontends, and memcaches to separate boxes - plus ending memory leaks/etc by moving most of the data off the MzScheme heap).

The obvious downside is that it would use your (or someone at YC's) time. First to merge the changes I make to http://ycombinator.com/arc/arc3.tar into the production code, then to buy/setup some extra boxes and do the migration. We're probably talking, roughly, a day. It also has the unfortunate side effect of costing HN's src some of its pedagogical value, since it adds external dependencies and loses 'purity'.

Been looking for an excuse to learn arc for a while now ...

marcusmacinnes · on Nov 9, 2012

I suspect there's good reason why HN is still using this old codebase. YC after all is not short of the cash needed for a complete revamp.

The site is very much hacked together, but works... In a lot of ways, this reflects the hacker ethos of getting something up and running quickly at low cost while still producing value.

A revamp might have negative impact too by attracting a wider, more mainstream audience which could possibly dilute the purity of the community here.

Osmium · on Nov 9, 2012

> dilute the purity of the community here

Careful now :) It's not like there's anything stopping HN attracting a wider audience anyway; there's no restriction on who can register. Anyone can come and join in, which (in my opinion) is as it should be.

marcusmacinnes · on Nov 9, 2012

Of course. I'm not suggesting that there should be any limitations on who can join, but as the community moves more mainstream, quality will dilute. As the site is rather un-sexy right now, it seems to attract those who are genuinely interested. Remember what happened to Digg...

Tsts · on Nov 9, 2012

Any1 can come, but if he has a different opinion than PG on politics, "he" gets silent banned.

veemjeem · on Nov 9, 2012

There's also the usual engineer estimation: "Oh, it will probably take a day to rewrite the code. We'll deploy it and it will probably work just fine in production."

Any engineer that has live code has made this mistake before.

smanek · on Nov 9, 2012

Just to clarify, it will definetly take me more than a day to write/profile/test the changes (especially since I'll be learning arc in the process).

My hope is that it will only take a day or so to deploy it, once it's ready.

JoeCortopassi · on Nov 9, 2012

Very generous offer, but I would argue that HN's slow performance is a feature, not a bug. The average drive-by person, that is attracted to sensationalist articles and titles, simply doesn't have the patience for the slow load times of every page. The user that is seeking intelligent conversation, however, is more than willing to have 5+ second wait times if they know that they will be getting valuable content. Couple that with page load times having consistent slow load times, rather than surges of performance, and I wouldn't put past PG to build a delay into page loads to act as a sort of filter. Even if it's unintentional, I would still argue that is still useful in driving out some riff-raff

xvolter · on Nov 9, 2012

I also believe that Hacker News runs on a small stack of services developed by some past companies from Y Combinator.

I would agree that there is also little to no desire to make Hacker News "the news place" - where it supports thousands of posts a second and is extremely popular. In general Hacker News is used (and the hope is to stay that way) by startups and people interested in startups - it's slowly growing out to include more types of people - marketing, companies, blog posts who just want a lot of hits, etc - and not many people want to purposely support that.

eps · on Nov 9, 2012

This probably better belongs to a private email.

wglb · on Nov 9, 2012

The downside of this is that now there are many more moving parts to carry forward.

dylanpyle · on Nov 9, 2012

doesn't sanitize HTML fyi - may leave you open for XSS

pg · on Nov 9, 2012

Ack, what was I thinking? Fixed. Thanks!

tptacek · on Nov 9, 2012

The same thing every smart developer who ever committed or deployed a line of vulnerable code thought: "I'm just trying to get this feature done, not write a formal proof". You're in good company.

javajosh · on Nov 9, 2012

It makes me think that one non-negotiable feature of any webapp architecture is to detect situations when inbound strings are placed in any context where they can be interpreted as code, and either refuse to run or at least spit out a severe warning.

And there are no webapp architectures which do this.

samstokes · on Nov 9, 2012

Yesod (a Haskell web framework) tries pretty hard. e.g. http://www.yesodweb.com/book/shakespearean-templates

javajosh · on Nov 9, 2012

Cool, thanks for that.

(My hobby: posting "nothing like X exists" in a Hacker News thread. :)

JonnieCache · on Nov 9, 2012

Rails sortof does that, and has since 3.0.

http://yehudakatz.com/2010/02/01/safebuffers-and-rails-3-0/

javajosh · on Nov 9, 2012

Neat. Something like SafeBuffer is a practical way to approach the problem.

It seems like with the rise of 'zero copy' approaches we could do even better - simply designate a memory region as unsafe, and transform it into a safe version depending on which context it is used. These transforms would want to add a little metadata pointing to the original unsafe region in case the transformed region is ever subsequently used in a different execution context. Alas, from the perspective of one program the input to another always just looks like a string, which means that somehow our host program (and programmer) needs to signal the appropriate transform on, say, concatenation. The only way I can think of around this requirement is to force implementors of contexts to tag their interfaces as a context, and for callers to construct arguments to those functions such that constituents that derive from unsafe regions are detectable. For example we have a SQL context that takes an array of string pointers, where some of the pointers point to 'unsafe' regions, and we just concatenate the elements of the array to construct the context argument.

porlw · on Nov 9, 2012

Check out taint mode in perl. It's been around forever, and I don't understand why all web frameworks don't have a similar concept.

lazyjones · on Nov 9, 2012

The Play framework (Scala, Java) and Mojolicious (Perl) (and many other newer frameworks probably) escape output by default, so at least they make you think before allowing XSS.

fnayr · on Nov 9, 2012

same with Django (Python)

wglb · on Nov 9, 2012

Ah, the fun part of this is "interpreted as code". Which language? html, xml, js, css, json? Get that part wrong or slightly off, and what you sanitized for one isn't for the other. And sometimes there can be nested contexts.

While the idea of "taint" is useful, it is only half the battle. The other half is accounting for the context.

ainsleyb · on Nov 9, 2012

And this is the precise reason we exist. :)

someone13 · on Nov 9, 2012

Do you have a rough set of guidelines for how fast we should request from HN? For a side project, I was thinking of writing something that scraped the HN frontpage and all the associated comment threads every 10 minutes or so, and I'd rather not cause performance issues or get banned. I'd be happy to rate-limit requests to whatever is convenient.

unreal37 · on Nov 9, 2012

May be better to use the official API.

http://www.hnsearch.com/api

laumars · on Nov 9, 2012

That's not an official API: http://www.hnsearch.com/about

Quote: "HNSearch was built by the team at ThriftDB to give back to the community and to test the capabilities of the ThriftDB flexible datastore with search built-in."

Interesting API all the same though.

zargon · on Nov 9, 2012

Regardless whether it is official or not, it is pg's preferred api: http://news.ycombinator.com/item?id=4694308

tallanvor · on Nov 9, 2012

If it were an official API, wouldn't it be associated with HN or Y Combinator rather than an external website?

wglb · on Nov 9, 2012

It is by a YC company, and recommended by pg.

mvanveen · on Nov 9, 2012

The robots.txt file for HN suggests a Crawl-Delay value of 30 seconds.

citricsquid · on Nov 9, 2012

This might be helpful: http://api.ihackernews.com/

edit: oh, official API is above. Disregard this one :-)

freditup · on Nov 9, 2012

I'm curious to why HN would be walking such a performance tightrope. I could speculate, but it would be uninformed rambling, so I'd love it if someone more knowledgeable than I could explain.

grinich · on Nov 9, 2012

It's a side project by a couple of guys with full-time jobs, written in an experimental Lisp dialect and running on a single machine.

akkartik · on Nov 9, 2012

The last bit is key. HN is served off flat files, and caches state in-memory in global variables. That -- and not cost -- makes it hard to add a second machine.

sneak · on Nov 9, 2012

It also makes it nearly impossible to slowly read one's own comment history, as the "next" pagination links are session data dependent and are garbage collected quite frequently.

This is, quite possibly, the worst webapp I use on a regular basis.

gbog · on Nov 9, 2012

> the worst webapp

The first and foremost reason for me to consult HN is because it is fast. I am in China, and usualy send time on the web only on my phone, with 3g connection.

HN speed beats all other link agregators, blogs, news site, and even goolgle search, and -- most interesting: even fast Chinese sites.

I don't know why it is so fast (except when it is dead, obviously), maybe because of this flat-file architecture, which could just make sense. (Git is very fast too, right?)

And I think it is interesting that the "make it fast" is a leitmotiv that has been forgotten by so many people, Google firstly, but is still a reason for some (me, at least) to pick this site over that sire.

vacri · on Nov 9, 2012

Thank you for explaining why the 'unknown link' happens at all. It's terrible - same with when you spend time thinking and formulating a response, only to see it disappear with the same error.

TeMPOraL · on Nov 9, 2012

Back button usually helps in recovering the response.

akkartik · on Nov 9, 2012

Yeah, modern browsers are great at not losing form data. When I hit this error I hit back, then reload, then resubmit with full confidence that my comment will be unharmed.

So HN doesn't bother me so much. It's the #$%# smartypants webapps that try to reinvent textareas in javascript that piss me off. I'm looking at you, Quora and new Gmail compose.

csense · on Nov 9, 2012

> modern browsers are great at not losing form data

I wouldn't know. I had a bad experience with a lost webmail ~7 years ago. As a result, I ALWAYS copy form data to the clipboard before hitting Next.

hnriot · on Nov 9, 2012

Maybe it's time to let it go... 7 years ago is the dark ages in browser time. You are clearly the Adrian Monk of web users :)

csense · on Nov 10, 2012

If the old ways work well enough, why bother to change?

I still call my Windows scripts .bat files (instead of .cmd, that's clearly for OS/2 programs).

It was only in the last three or four years that I stopped naming my files with all-uppercase names not longer than eight letters, with an extension not longer than three letters, to be sure they would be compatible with a FAT16 filesystem.

I'm rather distrustful of GUI's for doing things like moving or copying files.

I never drag-and-drop files into programs, partly because I seldom use GUI file managers, but mainly because most programs didn't support the metaphor when Windows 95 first came out, and I haven't bothered to check if things have gotten better yet.

Given these facts, you might find it surprising to learn that my age is less than 30.

akkartik · on Nov 9, 2012

:) Me too.

TeMPOraL · on Nov 9, 2012

Me three ;). It's a hard-wired habit now :).

akkartik · on Nov 9, 2012

Still and all, you can't help but notice how awesomely rock-solid browser textareas have gotten. I've actually had my laptop run out of charge and unceremoniously die on me in the midst of a humongous comment. Reboot, login, open browser, tabs all pop up -- and there's my comment. It's utterly amazing.

TeMPOraL · on Nov 9, 2012

> Reboot, login, open browser, tabs all pop up -- and there's my comment.

And you didn't get IP-banned here? :D.

vacri · on Nov 9, 2012

I agree, it usually does, but not always. I've only lost two or three comments this way, but it is incredibly frustrating when it happens.

TallboyOne · on Nov 9, 2012

I'd also like to know this

malandrew · on Nov 9, 2012

Awesome. I've gotten my IP banned several times after the browser crashed and I reopened the tabs (I had too many HN threads open prior to crash, enough to trigger the ban)

saurik · on Nov 9, 2012

Yeah... if I open Chrome I am pretty much guaranteed to be banned for days. :( The mechanism should really be changed to account for this: a ton of requests per second for only a few seconds should not trigger an issue, it should be a number of requests per second spike along with some sustained usage per minute. I actually made modifications to Chrome to change how it loads tabs mainly because of Hacker News' weird IP ban system, but I still got burned recently as I accidentally hit "undo close tab" one too many times, which reopened an entire window.

ars · on Nov 9, 2012

On firefox turn on the option "Don't load tabs until selected". I don't see this option in chrome.

It speeds up browser startup dramatically. Especially when you leave lots of tabs open as your "to read" list.

saurik · on Nov 9, 2012

Yeah: I ended up figuring out a way to add it. I now generally like having the feature, but it was a complete necessity due to the Hacker News IP ban rules (although, as I mentioned, still doesn't solve the underlying problem for this site, which is incredibly touchy).

http://news.ycombinator.com/item?id=4717730

BUGHUNTER · on Nov 9, 2012

It is so annoying that all the other browsers STILL have not implemented this little but very effective idea - please speak more loud about this, as it seems even most developers here did not even notice this feature...

anoother · on Nov 9, 2012

Is there a similar setting in Opera?

tjoff · on Nov 9, 2012

My solution is to use a firewall with per-application rules and just turn off network access for chrome before I launch it. On my laptop I just unplug the wired/wireless network for during the launch. This was mainly because of HN but also has the added benefit of taking less system resources since a blank page typically is less resource hungry than a real page.

Firefox has a better solution for this but then again, I don't use firefox.

Revisor · on Nov 9, 2012

Happened to me as well with starting Opera. I felt as if my most loved uncle slammed the door in front of my face.

I only loaded so many pages because I love HN. :)

The ban was lifted a few days later, not sure if automatically or thanks to my (unresponded) email request.

evx · on Nov 9, 2012

In my experience the banning is too strict.

It is triggered very quickly and it seems to last forever (maybe 15min would be better?).

I ask pg to kindly consider making it a bit more lenient.

I doubt HN goes under deliberate/malicious attacks, etc...

I'm making a HN extension that preloads some data such as the comments and the links on the next page (it's still with reasonable delays).

But at the moment it's impossible for it to function without risking the user getting banned.

EwanToo · on Nov 9, 2012

I've no doubt that HN is under pretty much constant deliberate, malicious attacks.

Pretty much any site with decent traffic is under constant attack, and the high profile of HN means it'll be under far more scrutiny than others.

nkurz · on Nov 9, 2012

Repost from "Show dead" that relates to this issue:

[−]sunstone1 10 hours ago | link [dead]

Well I never had my IP banned but I did have my account hell banned after about a dozen posts as you can see. Oh, actually no, you can't see, because it's banned. No, I never bothered to get another account, now I'm just a taker not a giver.

Most of the time it's clear why a user was banned, but looking at sunstone's history I don't really see a reason. While the algorithm will never be perfect, it would be nice if there was a clearer solution for misfires.

CWIZO · on Nov 9, 2012

Great news! I was banned last week (http://news.ycombinator.com/item?id=4736919), the bann was lifted in the meantime. But this will come in handy the next time I'll be developing an extension for HN and will refresh it all the time :)

gprasanth · on Nov 9, 2012

Oh! The benefits of dynamic IPs.. :)

mindslight · on Nov 9, 2012

Well, I might as well try striking while the code is hot..

It occurs to me that I would like to interact with noprocrast in a different manner. Currently, I leave noprocrast disabled most of the time. I like to use longish minaway times (~day), but this makes me feel as if my first visit to HN will start the clock ticking, and I'd better be sure to get my HN fill before the timer runs out (yes, this is kind of ridiculous). So I only enable noprocrast (with a short maxvisit) upon realizing I'm stuck in a web loop.

The mechanism that I envision is either a button that immediately starts a one-shot noprocrast ban, or a page-count based maxvisit. The latter might be better since it could always be left enabled.

exolxe · on Nov 9, 2012

Interesting fix. Is the trigger weighted based on request action type or user karma?

ddod · on Nov 9, 2012

Thanks Paul! I'm reluctant to try this in conjunction with developing any HN scrapers since I'm not sure what set it off in the first place and your language suggests it will only unban the IP once (I will, however, make sure the CMU IP I was using gets unbanned). It would be helpful to know what, precisely, that hair trigger is so we can make sure to avoid it.