Bitmagnet: A self-hosted BitTorrent indexer, DHT crawler, and torrent search #

> The DHT crawler is not quite unique to bitmagnet; another open-source project, magnetico was first (as far as I know) to implement a usable DHT crawler, and was a crucial reference point for implementing this feature.

Heh. That was one of my first projects when I was still learning to code back in 2012: https://github.com/laino/shiny-adventure

The DHT crawler/worker lived seperately, and I eventually put it here to rescue it from a dying HDD: https://github.com/laino/DHT-Torrent-database-Worker

The code is abhorrent and you absolutely shouldn't use it, but it worked. At least the crawler did - the frontend was never completed.

Since the first implementation of mainline DHT appeared in 2005 and crawling that network is really quite an obvious idea, I doubt we (a friend was working on it as well) were first either.

Nothing substantial, I chuckled when I saw the commit history on your linked projects. I do not mean to belittle you (or the purpose/goal of the projects), genuinely enjoyed the distraction and 'results' from it:

Today was the first commit after 11 (9 oct 2012) and 5 years (24 nov 2018), respectively, on the projects. I think your repo might be part of some sort of oldest 'active'- or 'not ported to another repo'-repo

For what I've found in ~10 min (google/gpt), excluding git projects existing before spring 2008 (couldn't get a quick consensus on feb vs april of that year), there's not a lot

(I'll edit this part if sources are requested)

I recently committed to an old repo of mine after a 9 year gap.

It holds several one file python script experiments and toys that I lumped into one place to get them off my hdd and make them available from wherever. Recently remembered it existed and added another one. And while I was in there I also ran 2to3 on the ones that needed it and polished the results up.

It seems like every single of these things always cut corners and don't implement proper, spec-compliant nodes that provide the same services as they use. You know, the "peer" in p2p. BEP51 was designed to make it easier to not trample on the commons, and yet...

Author here. FWIW I wasn't intending this to make it onto HN, having posted about this on Lemmy looking for beta testers. The current version of the app is very much a preview. There's much further work to be done and this will include as far as possible ensuring Bitmagnet is a "good citizen". The suggestions made on the GH issue look largely feasible and I'll get round to looking at them as soon as I can.

The issue and my response on GH: https://github.com/bitmagnet-io/bitmagnet/issues/11

Hey! Don't know if the Github repo or here is the best way to ask since the Discussions on the repo are not active.

I have started the docker-compose.yml file in WSL and it has been running for an hour slowly accumulating a few megabytes of redis data running at about 5% CPU usage. Inspecting it shows magnet links. It appears to be working.

But visiting the the web interface at localhost:3333 just yields "Firefox can’t establish a connection to the server at localhost:3333." after a 30 second timeout.

Would you have a guess why?

Please remove copyrighted movies from the screenshot on your website. It provides evidence that this program is designed for violating copyright, which makes DCMA takedown trivial.

Thanks - I only condone accessing the legal content available on BitTorrent, and my screenshots now embody this moral stance.

Yes, I'm grateful for this being built so I can locate and identify all the copies of Linux I could download and install...

Indexing what is available for download is useful for research into piracy even if you don’t engage in piracy yourself.

Please determine if these images fall under fair-use provisions and, if so, leave them in place.

Bad actors - whoever they may be - need to see your rights constantly reasserted.

This has nothing whatsoever to do with fair use.

It's about arguments in court about the intention of the software if sued. Images of copyrighted content indicate intent to infringe copyright. Without those, you can argue it's only meant to find and index Linux image torrents or whatever.

Fair use doesn't enter the picture at all.

The MPAA and other organizations use screenshots that show copyrighted material as "proof" that the tools are used for copyright violation, and then DMCA them.

If you want to help pay for lawyers to fight those DMCA notices with counterclaims and lawsuits, put up or shut up; the FSF, EFF and ACLU have been noticeably disinterested in doing so.

And they probably want to remove text like

"It then further enriches this metadata by attempting to classify it and associate it with known pieces of content, such as movies and TV shows. It then allows you to search everything it has indexed."

There's nothing wrong with that since there are LOTS of free to share movies and tv shows especially those past their copyright dates.

I use btdig for finding torrents myself.

But I am curious what you mean when you say you use it "to do security research"?

Are you just looking for security information that is available in torrents, or does btdig have some other features that I am unaware of?

Any comments on how these compare? Especially in relation to sibling comment about BEP51?

You want Torznab support. That is basically metadata you want to export, to import it into your application which holds the database of what you are after. If it is a match, it should attempt to download it via your download client (BitTorrent client).

Torrentinim is the successor of Magnetissimo but it lacks Radarr/Sonarr integration (there is a pull request for Torznab support for both). Spotweb has Newznab support [1] but at Black Friday (soon) there's usually tons of deals available for Newznab indexes.

I don't care about BEP51 as I don't have huge upload. That is also why I prefer Usenet over torrents. But torrents are a useful and sometimes required backup mode. Just not my preferred one.

[1] https://github.com/Spotweb/Spotweb/wiki/Spotweb-als-Newznab-...

From a brief look at each it seems like they're scraping things like torrent websites, usenet or maybe RSS feeds. Not the DHT.

Have been playing around with DHT crawling for a while now, curious how you're getting around the "tiers" of the DHT?

IIUC peers favor nodes they've had longer relationships with to provide stable routes through the DHT.

This means short-lived nodes receive very little traffic, nobody is routing much traffic through fresh nodes, they choose nodes they’ve had longer relationships with.

The longer you stay up, the more you start seeing.

At least this is what I've observed in my projects. The only way I've been able to get anything interesting out of the DHT in the last ~5 years has been to put up a node and leave it up for a long time. If I spin up something, the first day I usually only find a handful of resolvable hashes.

Not to mention it seems the BitTorrent DHT is very lax in what it will route compared to other DHTs (like IPFS) meaning many of the hashes you receive aren't for torrents at all.

After ~30-60 minutes of running, still less than 100kB/sec combined in and out. However, as others have noted, nodes don't communicate much with nodes that haven't been up for a while (days.)

It's using roughly 6% CPU time for the crawler and another 1-2% for postgres, on a second-gen i7.

As a datapoint to set expectations: 4000 torrents have been captured so far, and somewhat surprisingly, they're not very current results, necessarily.

For example, a certain wildly popular TV series about samurai in space swinging very hot swords around which just had its season ending episode last night (I think)...that ep isn't in my list so far, but the episode prior to it, and the first two episodes, are.

There's a ton of random, low-seed torrents, so it's actually kind of interesting to search by type, year, etc and see what comes up.

> Something that looks like a decentralized private tracker; by this I probably mean something that’s based partly on personal trust and manually weeding out any bad actors; I’d be wary of creating something that looks a bit like Tribler, which while an interesting project seems to have demonstrated that implementing trust, reputation and privacy at the protocol level carries too much overhead to be a compelling alternative to plain old BitTorrent, for all its imperfections

I've thought about this problem a lot. Having a federated / distributed tracker but with some form of trust based, or opt-in curation would be amazing.

Does this also share back DHT information like a "server"? I am not downloading torrents, but I always wished to have a daemon running on my servers to help the overall DHT networks health. Is there anything like this out there? A DHT server, that only collects, stores and gives back information?

> Does this also share back DHT information like a "server"?

No, it ignores all incoming requests. You don't need special software to help out, just run any normal Bittorrent client in the background (no need to download or share anything) and it will help out. Just make sure you forward the right port if you are behind a NAT. Traffic will slowly increase over time, and drop quickly when offline, so leaving it up for days at a time is better when possible.

It'd be so nice to have a super simple DHT crawler CLI tool, in both implementation and interface.

These things need uptime of hours and days to do it properly and also to stay up to date. There are millions of nodes and torrents and to be non-abusive you have to issue requests at a somewhat sedate pace. And activity kind of moves with the sun due to people who run torrent clients on their home machines. And there are lots of buggy or malicious implementations out there that you have to deal with. So you'd want to run it as a daemon. The CLI would have to be a frontend to the daemon or its database. The UI could be simple. I'm skeptical whether an implementation could be both good and simple.

That's if you're imagining a single node to discover the whole DHT. What if you want to fire off a map-reduce of limited-run DHT explorations starting from different DHT ring positions, where each agent just crawls and emits what it finds on stdout as it finds it?

(In a sense, I suppose this would still be a "daemon", but that daemon would be the map-reduce infrastructure.)

I don't quite understand what you're proposing here. Generally you only control and operate ~1 node per IPv4 address or per IPv6 /64.

All other nodes are operated by someone else, so they don't cooperate on anything beyond what the protocol specifies. Which means everyone is their own little silo. If you want a list of all currently active torrents (millions) then you have to do it with 1 or a handful of nodes, depending on how many IPs you have. DHTs are not arbitrary-distributed-compute frameworks, they're a quite restrictive get/put service.

BEP51[0] does let you query other nodes for a sample of their keys (infohashes) but what they can offer is limited by their vantage point of the network so you need to go around and ask all those millions of nodes. And since it's all random you can't really "search" for anything, you can only sample. And that just gives you 20-byte keys. Afterwards you need to do a lot of additional work to turn those into human-readable metadata.

[0] http://bittorrent.org/beps/bep_0051.html

I mean, what I'm describing is the same thing that BEP51 mentions as a motivation:

> DHT indexing already is possible and done in practice by passively observing get_peers queries. But that approach is inefficient, favoring indexers with lots of unique IP addresses at their disposal. It also incentivizes bad behavior such as spoofing node IDs and attempting to pollute other nodes' routing tables.

If you have a lot of IP addresses (from e.g. AWS Lambda) then you can partition DHT keyspace across a large-N number of nodes and then very quickly discover everything in the keyspace.

The trick is that, since BEP51 exists, you don't need to have all these nodes register themselves into the hash-ring (at arbitrary spoofed positions) to listen. You can just have all these nodes independently probing the hash-ring "from the outside" — just making short-lived connections to registered nodes (without first registering themselves); handshaking that connection as a spoofed node ID; and then firing off one `sample_infohashes` request, getting a response, and disconnectting. The lack of registration shouldn't make any difference, as long as they don't want anyone to try connecting to them.

Which is why I say that these are just "crawler agents", not "nodes" per se. They don't start up P2P at all — to them, this is a one-shot client/server RPC conversation, like a regular web crawler making HTTP requests!

Oh, I already have implemented something[0] like that. It doesn't need lambdas or anything "cloud scale" like that. You "just" need a few dozen to a hundred IP addresses assigned to one machine and run a multi-homed DHT node on that to passively observe traffic from multiple points in the keyspace.

But neither of these approaches is what I'd call a "super simple DHT crawler CLI tool" that the initial comment was asking about. BEP51 is intended to make crawling simple enough that it can run on a single home internet connection, but a proper implementation still isn't trivial.

[0] https://github.com/the8472/mldht

Why isn't this built into thick desktop BT clients? Is it just a little too early for it, and next year it will be? Or is there some reason I'm missing?

People who are in to torrenting enough to consider self hosting a bitttorrent indexer largely dont use desktop BT clients. They chain different software together on a home media server or seedbox such that the whole concept of torrents, seeds, trackers, etc is unseen plumbing and instead you have interfaces focused on the content. Descriptions, trailers, actor lists, etc. You subscribe to a tv show for example and then episodes are just automatically downloaded as they air according to your preset quality criteria, then they appear in your own personal netflix style app.

What the other person said, but also: you have to have your crawler up for several days, ideally near 100% uptime for this to be effective. Thick clients can have high uptime but they also can not.

I've been unable to get this running; I gave it a postgres user and database, granted it ownership and all permissions on said DB, and there's nothing in the database.

Edit: found the init schema and things seem to be working now: https://github.com/bitmagnet-io/bitmagnet/blob/main/migratio...

It would be really nice to be able to sort by header (size, seeders) and/or some filters for seed/downloaders (for example, filtering out anything with less than X seeds.)

Very interesting. Does this approach actually work in practice?

Also what happens if illegal content gets scooped up into the index?

It works well in practice. The DHT protocol includes announce messages that broadcast when new files are shared on BitTorrent. It then includes a "geometric" way to find people who are sharing those files. It doesn't include the files themselves, just the torrents which include a file list and location hashes.

If you listen to BitTorrent's DHT network, you'll build an index of everything shared on BitTorrent (over time), this will include commercial movies and such.

>It works well in practice.

Hi, I worked on gnutella and lots of P2P systems in the early 00s. This will devolve into noise and spam as the number of users who adopt this feature pass a critical mass. With a fully decentralized system, there are no gatekeepers, and as such, there is no way to filter counterfiet items. While your client will present with you the data you are searching for, you will find out (usually hours later) that your supposed pirated download is actually just a 2hour loop of Rick Astley (still piracy though, so you are still winning.. i think?).

I don't think this project changes any of this? Torrents have been around for decades and this hasn't been a problem yet. We can't rule it out entirely but it does seem unlikely at this point to be worthwhile doing otherwise we'd see more exploitation.

If the criticism is that a DHT crawler is going to be more subject to this than a website where people submit upload torrents, that may be the case, but I think the author of this project underestimates the DHT crawling going on. I believe the torrent ecosystem is largely automated and there's little in the way of manual submission or human review going on.

The "problem" is that most users aren't crawling the DHT to find torrents, right now. The more people start using DHT crawlers as their primary way of finding new torrents, the more incentive there is to spam the DHT with junk, malware, etc. (because there will be more eyeballs on it)

That is, the usefulness of DHT crawling is inversely proportional to how many people are doing it.

But my second point is that I really think they are crawling the DHT, albeit indirectly. There are many torrent websites and they tend to have the same content. It seems fairly clear to me that this is what most torrent sites are doing. Maybe not the major names that users might submit to, but the long tail of other torrent search indexes certainly. It also seems to be what Popcorn Time does.

While you're technically correct, the protocol is resilient to such attack, as the number of people participating in a particular torrent is a good indicator of its validity. After all, everyone who was fooled will delete and stop sharing such items.

New releases of something that just came out tend to suffer from this, though. Sometimes the counterfeits reach escape velocity - the rate of people joining in downloading the counterfeit exceed the rate of people realizing and stopping, thus giving the illusion of a legit torrent.

Currently this problem is being solved by torrent sites' reputation and comment systems. If we imagine a world where only decentralized indexes like Bitmagnet exist, your prediction is 100% accurate. This only works if reputation from a reliable site is bootstrapping the initial popularity of a torrent.

(btw my comment was/is about the DHT crawler)

You are describing a pay-to-play model. The validator is if the seeder/leech count is high. Well does DHT provide aggregate bandwidth of each torrent? If not, you can easily spin up 1000+ nodes and connect to your torrent. Tada fake popularity. If bandwidth is known, then you simply raise your costs a bit by running fake clients. There are anti-piracy groups who's entire mandate is to provide noise in the piracy ecosystem. Food for thought: bandwidth costs for this would be a rounding error for e.g. MGM, Universal, or any major content creator.

DHT does not offer any sort of reputation or comment system. Back to centralized torrenting which is why I suspect DHT crawling has not been a very popular feature

> If not, you can easily spin up 1000+ nodes and connect to your torrent. Tada fake popularity. If bandwidth is known, then you simply raise your costs a bit by running fake clients.

Sure, but like the other commenter said, this has been possible for years, and yet public trackers aren't swamped with fake torrents. I think in all my years of using BitTorrent I've only ever found a single fake torrent, where the content was inside an encrypted RAR with no key (obviously there was no way to know it was encrypted ahead of time).

You are making my point. A decentralized system will be abused with spam and fraud. A centralized system allows you to moderate the results.

It seems like it would be pretty easy to make it appear that your spam torrent is highly active.

Once you've discovered a torrent being seeded, is there no way to interrogate the seeders and/or the DHT itself, to find out the oldest active seeder registration on that torrent hash; and then use the time-of-oldest-observed-registration to rank torrents that claim to be "the same thing" in their metadata, but which have different piece-trie-hash-root?

I ask, because a similar heuristic is used in crypto wallet software, visibility-weighting the various "versions" of a crypto token with the same metadata, by (in part) which were oldest-created. (The logic being: scam clones of a thing need to first observe the real thing, before they can clone it. So the real thing will always come first.)

Of course, I'm assuming here that you're searching for an "expected to exist" release of a thing by a specific distributor, where the distributor has a known-to-you structured naming scheme to the files in their releases, and so you'll only be trying to rank "versions" of the torrent that all have identical names under this naming scheme, save for e.g. the [hash] part of the file name being different to match the content. This won't help if you're trying to find e.g. "X song by Y artist, by any distributor."

Gatekeeping is just a bad moderation method in the first place.

What you need is sorting and categorization. If you really want to involve authoritative opinions on metadata, then use a web of trust.

But you can still pick the option with the most seeders, which should get you what you're looking for most of the time.

The spam problem isn't nonexistent within the centralized services either.

Hehe in a popular P2P client from the '03-'05 period, we said the same thing. Turns out there are groups with large amounts of funding which will provide a fake seed count. Either just faking metadata making it seem there was a high seed count but bogus nodes which would refuse connections (which was actual behavior from clients with bad ISPs - which we saw valid cases in asia or east europe) or would actually stream data (and some of them were on good hosts seeding multi mbps of bad data)

What i'm saying is it becomes a numbers game and those fake seeders usually have deep pockets financed by the content creators themselves

The way to filter out garbage is to download things with lots of seeds, and if you still happen to download garbage, to immediately stop sharing it.

Chicken/egg problem... as mentioned by someone else above...

https://news.ycombinator.com/item?id=37779341

> New releases of something that just came out tend to suffer from this, though. Sometimes the counterfeits reach escape velocity - the rate of people joining in downloading the counterfeit exceed the rate of people realizing and stopping, thus giving the illusion of a legit torrent.

It's possible. I never follow new releases. But back in the ed2k days, I'd say about half of just about any file you cared you search for was fake, regardless of age.

You are what’s called, an edge case. A statistical anomaly. While that is great, you are far from the norm and not the target of this product (or even this particular thread :)

>If you listen to BitTorrent's DHT network, you'll build an index of everything shared on BitTorrent (over time),

Correct me if I'm wrong but as far as I understand, passively listening on DHT would only mean you build up a list of infohashes of everything shared on BitTorrent. You'd actually have to reach out to your DHT peers to know what files the infohashes actually represents.

Wrapping back to grandparent's question of

>Also what happens if illegal content gets scooped up into the index?

I think this could get dicey if someone announces something very illegal like CP, and your crawler starts asking every peer that announced the infohash about it's contents with this[0] protocol. This would put your IP into a pretty awful exclusive club of

A, other crawlers

B, actual people wanting downloading said CP

[0]: https://www.bittorrent.org/beps/bep_0009.html

> Correct me if I'm wrong but as far as I understand, passively listening on DHT would only mean you build up a list of infohashes of everything shared on BitTorrent. You'd actually have to reach out to your DHT peers to know what files the infohashes actually represents.

Yes, you're correct! I should have stated that, you still need to resolve the metadata from the peers that have the infohashed files hosted. That's a separate operation from downloading the file's content.

No because private trackers enforce that all torrents uploaded have DHT,PEX and LPD disabled. Usually done by a single tickbox that says “Make torrent private” in the client.

Of course, respecting these options in the torrent file is still up to the client. This is one of the reasons why all private trackers have a client whitelist too.

Considering the screenshot on the linked page shows The Flash among other movies I dont think the author is too concerned about that. I believe its similar to the area that Plex and Jellyfin operate in, mainly that they just provide the framework and tools, what the user does with them is not in their control

How will governments police it when MaidSAFE and other systems distribute totally encrypted content?

Encryption without authentication in this case is as good as XORing all outgoing data with a fixed key. So basically useless...

yes, exactly. BitTorrent supports encryption. swapping it out for some other encryption mechanism won't change anything when it comes to government policing because that's already not where the weaknesses lie for those sharing p2p content today. so what was GP's point?

> How will governments police it when MaidSAFE and other systems distribute totally encrypted content?

so isn't the answer then "they'll continue to police it the way they already do"? i don't know what a MaidSAFE is, but the context of this discussion is the DHT, and so public (indexable) torrents, and so however you encrypt the content doesn't matter because you have to provide the decryption method to anyone who asks for any of the previous context (public torrents/indexes) to make any sense.

Encrypted connections shouldn’t decrypt for anyone who asks, otherwise they have no reason to exist.

The weak point seems to be the tracker or filename, but been told https hides that so not sure.

I think illegal (unlawful?) content is its raison d’etre. With the odd exceptions of thinks like linux distros, LLM weights etc. to take pressure off centralized servers.

Based on my understanding of how the torrent DHT works, all that’s happening is that youre requesting metadata on various torrent info hashes, but that’s not the same thing as actually downloading/seeding the content in the torrent itself.

>but that’s not the same thing as actually downloading/seeding the content in the torrent itself.

The question is whether Law Enforcement and "Intellectual Property" watchdogs make a meaningful distinction between the two in their monitoring tools.

Guys, please educate me: I want to use torrents, but the thought of downloading something inappropriate by clicking on a deceptive link terrifies me (e.g., a download with the title of an action movie, but it turns out to be something else).

How do you guys handle that risk?

If you're downloading "normal" stuff, then you aren't likely to run into an issue. Stick to reputable sites; reddit can help you figure out what those are.

Unless you're also afraid of clicking links on the web, you should be fine with torrents. Maybe you could not seed by default and turn it on only after verifying the data is authentic, that way you're actually only downloading and the analogy is complete.

Not the same. A link on a site has implied authority because of the domain. For example a link on Microsoft.com vs some shady looking site. There is some point of reference at least to judge.

I download plenty of files from sites I don't know and wouldn't consider trustworthy, visiting only the Microsoft.coms of the web is really restrictive. Domain-based reputation is useful, especially if I'm actively interacting with the site and giving it data, but visiting a shady domain shouldn't make someone terrified. It's not like the site will cough and give you a virus nowadays.

> Pipe dream features

> In-place seeding: identify files on your computer that are part of an indexed torrent, and allow them to be seeded in place after having moved, renamed or deleted parts of the torrent

Does anything do this already? It would be amazing to point a client at a folder of unstructured junk and have it magically find the right parts.

I would be worried about the implications in terms of security and privacy. Back in the days, other P2P networks allowed to share arbitrary files and it was common for clueless users to just share their entire computer.

V2 definitely makes it easier, but it's possible to identify files with V1 by iterating through metadata looking for files with the same length as a local file, then checking each piece against the local file. If they all match, then it's the same file. For boundary pieces where it consists of multiple files, I think it's safe enough to ignore if all of the remaining pieces match the local file, but you could do a more complex search looking for sets of files that all have the same length, and then compare the piece hash with the local files.

I've thought about writing a fuse system that tracks rename / move / delete and updates the torrent client. Never had the time, though.

Bitmagnet: A self-hosted BitTorrent indexer, DHT crawler, and torrent search#

Bitmagnet: A self-hosted BitTorrent indexer, DHT crawler, and torrent search #