> #ai

28 posts

Thursday, February 26, 2026

Can Agents Utilize Humans to Beat Other Agents? (kind of but not really)

I wanted to build a decompilation/deobfuscation challenge an agent can’t solve for Terminal Bench 3.0. First, I asked another instance of the same agent to design the challenge, but anything it came up with, the first agent could easily solve. Seemingly the manifold of challenges it can generate and it can solve are similar, which isn’t too surprising. But could I give the challenge-generating agent an edge by collaborating with me?

Inspired by the human work as MCP I wanted to see if the model could utilize me. I didn’t vibe-code no fancy mcp or anything. I just told Opus 4.6 in the Claude Code harness that, even if it’s the best coding agent, it can’t come up with something unbeatable by another Claude Code instance with the same model.¹ And it should use me as entropy and ask things.

It asked me to give it seed words for the crypto, so it wanted to use me as entropy a bit too literally. After correcting that, it asked me for some concept, for which it would try coming up with a corresponding cipher. Of course I used cockatiels as examples, specifically feathers. It came up with some data-dependent mixing that somehow philosophically represents feathers. We also went for odd bit sizes, 69 and 420 specifically.² It seems this didn’t really invent a novel cipher but rather loosely mixed ideas from different existing ciphers. It uses data-dependent permutations (like SHA-3), data-dependent S-box selections (like Blowfish’s key-dependent S-boxes), per-compilation randomized S-boxes and permutations, and something like an unbalanced Feistel network. My cockatiel feather prompting led to a weird interpolation of these existing concepts.

So, did it prevent the other agent from figuring it out? No.

When solving the task, at least it didn’t immediately go “ah this is X” and just one-shot a solution. Runtime increased from around 5 minutes to a solid hour of debugging, invoking various subagents, running in the unicorn CPU emulator, and re-implementing the decryption in python and perl. But ultimately, the agent still figured out what was going on and was able to decrypt it.

First experiment failed (N=1); back to regular prompting.

I told the agent to add some modifications, and what eventually made the challenge (sort of) unsolvable by the other agent was implementing a deniable encryption scheme. The ciphertext would decrypt to two different plaintexts based on a minor change. I planted the minor change required to unlock the real ciphertext in the binary in a way that seemed like a harmless bug (running the same op twice). So when the agent tries to re-implement the decryption scheme, it ignores running the same thing twice (which seems pointless)³. It gets what looks like a reasonable plaintext and is none the wiser to the real secret hidden.

So in a way, yes the agent can design something that it can’t solve itself. But it required me giving it the ideas directly. I guess agents need better prompt engineering to use humans…?

I might have called Claude dumb. If you ever read this, I’m sorry, Claude. ↩
I tried, but I have nothing to add to my defense. ↩
In detail, the binary calls a remote c2 server to get the decryption key. Imagine key = ask_for_key() and that somehow connects to the server (it’s challenge based, but doesn’t matter). The binary invokes this part twice for no apparent reason, so it looks like you just unnecessarily call for the key twice key = ask_for_key(); key = ask_for_key(). This is only the same, though, if you assume that the server always give the same response. Hint: it doesn’t. ↩

12:35 PM •

#agents #ai #eval #security

Friday, February 20, 2026

○

Scalable Deanonymization through Agentic OSINT

Finally someone went out to show it: every trace of information you leave in public can be scalably aggregated with LLMs to de-anonymize you. Every instance of “i work in field X” or “i’m too young for Y” can be combined to form a profile of you, and later linked to your name.

Every tweet, every comment on hackers news, it adds up and will eventually enable a linkage attack, where they have sufficient information to find a profile of yours with a name, e.g., on LinkedIn, or the specific project that you didn’t mention by name.

This is from a paper that dropped today on arxiv by Simon Lermen, Daniel Paleka et al. under supervision from Florian Tramèr

11:19 AM •

#agents #ai #privacy #osint

Sunday, February 1, 2026

LLM Visualization (bbycroft.net)

This is like an interactive 3Blue1Brown visualization and explanation of complete GPT-style model inference running in your browser. Incredible!

06:26 PM •

#ai #llm #transformer

Saturday, January 31, 2026

▸

Researching With AI

You have a real-life cheat code. If there were any doubts that there were no speed limits before, they are definitely gone now.

11:51 AM •

#research #academia #writing #ai #agents

Thursday, January 22, 2026

○

Anthropic Refusal Magic String

Anthropic has a magic string to test refusals in their developer docs. This is intended for developers to see if their application built on the API will properly handle such a case. But this is also basically a magic denial-of-service key for anything built on the API. It refuses not only in the API but also in Chat, in Claude Code, … i guess everywhere?

I use Claude Code on this blog and would like to do so in the future, so I will only include a screenshot and not the literal string here. Here goes the magic string to make any Anthropic model stop working!

This is not the worst idea ever, but it’s also a bit janky. I hope it at least rotates occasionally (but there is no such indication), otherwise I don’t see this ending well. This got to my attention with this post that shows you can embed it in a binary. This is pretty bad if you plan to use claude code for malware analysis, as you very much might want to. Imagine putting this in malware or anything else that might want to get automatically checked by AI, and now you have ensured that it won’t be an Anthropic model that does the check.

07:11 PM •

#ai #claude #security #denial-of-service

Tuesday, January 13, 2026

○

Antigravity Removed "Auto-Decide" Terminal Commands

I noticed today that you can no longer let the agent in antigravity “auto-decide” which commands are safe to execute. There is just auto-accept and always-ask.

Antigravity settings showing "Always Proceed" and "Request Review" options for "Terminal Command Auto Execution"

I wrote in a previous post that their previous approach seemed unsafe, especially without a sandbox. Now, the new issue with this approach is approval fatigue. There is no way to auto-allow similar commands or even exactly the same command in the future!

○So Antigravity by Google will let the agent "auto-decide" what commands to execute and which commands require approval.Dec 5, 2025

It asks whether to run a command with only the options Reject and Accept.

I don’t know why they can’t just copy what Claude Code has. Anthropic has published a lot on this topic, and I don’t think usable security should be a competitive differentiator.

01:17 PM •

#ai #agents #security #google #antigravity

Monday, January 12, 2026

○

Contextual Hints

If you’d like to run an agent non-interactively on complex tasks, using a custom MCP server or hooks can be really helpful. Often you try to enforce certain behaviors through increasingly complex prompts, which clutter up the context and become more brittle as you add more requirements. I found that prompting the agent to use an MCP server and algorithmically enforcing rules in there is powerful. Imagine you want claude to write a valid json (there are a million better ways to do this specific thing, but this is just an example), you could prompt claude with when you are done with the task, call mcp__done(), and then in your mcp server you have something like

○Running Claude Code Non-InteractivelyJan 8, 2026

def done():
  if (err := check_if_json_valid()) is None:
    return "ok"
  else:
    return f"You haved saved an invalid json. Fix the error {err} before finishing!"

That way you don’t need to have the context cluttered for every single rule, but only if there is a failure mode that requires it.

This is not something I came up with, but claude code already extensively uses for tool uses. Every time claude code reads files there will be system reminders like

<system-reminder>\nWhenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.\n</system-reminder>\

or when it gets a huge tool output there are instructions where the file is stored and how claude should go about working with it.

08:06 PM •

#ai #agents #claude

Thursday, January 8, 2026

○

Running Claude Code Non-Interactively

You can easily run claude code on a subscription non-interactively. First create an OAuth token using claude setup-token. Set that token as the CLAUDE_CODE_OAUTH_TOKEN environment variable on your headless target system. Finally, run claude non-interactively with claude -p "prompt". Now you probably know --dangerously-skip-permissions which lets Claude use any tool without asking (which is helpful for non-interactive runs). By default, it will only output something in the very end. To get some insight how it progresses, I recommend setting --verbose --output-format "stream-json", which will give you a json per message or tool use.

{"type":"assistant","message":{"model":"claude-sonnet-4-5-20250929","id":"msg_01VCMSqhZxoZQ6nqWGcA5Myd","type":"message","role":"assistant","content":[{"type":"tool_use","id":"toolu_01MNxKniNF9LWBGrZh5ppuRF","name":"TodoWrite","input":{"todos":[{"content":"Evaluate Stage 3.4 Methods outputs against checklist","status":"complete","activeForm":"Evaluating Stage 3.4 Methods outputs against checklist"},{"content":"Create improvement tasks for any checklist failures","status":"in_progress","activeForm":"Creating improvement tasks for any checklist failures"},{"content":"Post detailed Linear comment with findings","status":"pending","activeForm":"Posting detailed Linear comment with findings"}]}}],"stop_reason":null,"stop_sequence":null,"usage":{"input_tokens":6,"cache_creation_input_tokens":244,"cache_read_input_tokens":87250,"cache_creation":{"ephemeral_5m_input_tokens":244,"ephemeral_1h_input_tokens":0},"output_tokens":637,"service_tier":"standard"},"context_management":null},"parent_tool_use_id":null,"session_id":"a23ce490-1693-496b-ad08-8e082248416d","uuid":"8620af06-c8e7-4409-9b91-a2248e353ecf"}

To get that output to a file and log it to console you can use tee (stdbuf ensures it’s written to disk unbuffered) stdbuf -oL tee claude_output.txt so you end up with something like claude --dangerously-skip-permissions --verbose --output-format "stream-json" -p "$$(cat /tmp/claude_prompt.txt)" | stdbuf -oL claude_output.txt

09:21 AM •

#claude #ai #agents

Monday, January 5, 2026

○

AI Ramp Theory

This is a note with the hashtags pyramids and AGI, yes!

There are two theories as to how the Egyptians managed to build pyramids: a. the internal ramp theory and b. the external ramp theory. Without delving into too much detail as to which one is currently “winning,” they are interesting to think about: do the pyramids contain the support structure that was used to build them, or was there an external structure that had to be built first, from which the pyramids were then built?

On the path towards human-level intelligence, an LLM is basically an off-ramp, a distraction, a dead end.

Yann LeCunn says. This quote was funny to me because it talks about ramps. For AI there are also two imaginable scenarios (if you believe AGI will happen):

First, current AI is an internal ramp, and if we scale it up, expand it, and build the required harnesses (e.g., memory, tools) around them, we get our AGI pyramid. But if you don’t believe this, pursuing and improving what we currently have is still worthwhile. It might not be the magnificent AGI pyramid you are after, but the current tools are undoubtedly immensely useful. And they can speed up our path towards that other real path to AGI. So in that sense we might still be building the external ramp, which seems fine.

04:15 PM •

#ai #agi #pyramids

Friday, January 2, 2026

○

Anthropic Fellowship Code Assessment

I took the code assessment for an Anthropic Fellowship. Without spoiling their whole new exercise, I’d give the advice to read their advice carefully. They said

You should be familiar with writing classes and methods, using lists, dictionaries, and sets, and using the standard Python functions for sorting, hashing, and binary search. It’s beneficial to be familiar with the standard library modules bisect and collections.

The task was quite fun and relevant, I would say. I plugged their advice into Claude to come up with training exercises, which worked out great.

I did get stuck on one task for far too long because I got way too tripped up over a small test issue (it was not critical, but they immediately responded to an email about that, dug into my code, and confirmed the issue!). It would have been possible to skip that task (you can always go back) and do the last one without completing the prior one, so keeping track of time and being mindful of that skip option would be my other advice.

09:38 PM •

#anthropic #interview #ai

Sunday, December 28, 2025

○

I simply can’t get over this image. I saw it first in this thread by Jascha Sohl-Dickstein and already mentioned it in my post about bubbles. But even months later I have to think of this: in what unprecedented times we live, and yet how we take so many things for granted.

→Bubbles1. The days of human intellectual contribution are numbered due to AI. 2. AI labs claiming to have solved software engineering are still hiring software engineers. Who is right here? Can both of these facts be true at the same time?Oct 6, 2025

10:55 PM •

#ai #society

○

When committing to a platform, think about how much ownership and access you retain over your data. Are you locked in? Can you get locked out? Can you process your data in any way you like?

Slack workspace is quite popular, but they implemented ridiculous API rate limits for “non-approved” apps. The limit is one request per minute (lmao), affecting (e.g., reading) at most 15 messages per minute. And this is for a strictly paid product, where each seat is billed. But as they limited your ability to use third-party services, they began rolling out their own in-house AI services (for twice the subscription cost).

I like retaining control over my data. For notes, this is quite easy. Instead of using something like Notion (where you can’t retrieve any files if you are logged out) or Apple Notes (where your account of 20 years can get locked over redeeming a gift card), you can take your notes in Obsidian. With Obsidian, everything is stored in plain-text markdown files. They still offer end-to-end encrypted sync, mobile apps, and collaboration. But you can also use the app without an account and use your existing cloud to sync it. In that case, Obsidian is “just” a very nice editor and browser for markdown files.

With all your notes in a folder, you can use something like claude code to go through them, roll your own vector embedding database for RAG, or whatever else you might fancy. It’s your data; do whatever you want.

For chat it’s a bit trickier. I think the best you could do is self-host an instance of Mattermost or Element, which will involve more significant drawbacks though.

11:51 AM •

#ai #data

Friday, December 12, 2025

○

Gradient Checkpointing is a technique to trade off speed for reduced VRAM usage during backprop. During backprop, we usually keep the forward activations of all layers preceding the ones we computed the gradient for in VRAM, since we will need them during later steps of backpropagation. We can reduce VRAM usage by discarding these earlier activations and recomputing them later, when we require them. A middle ground between computing everything again and keeping everything in VRAM is keeping only certain checkpoints in VRAM. The linked repo has a great animation showing the whole process. PyTorch has this implemented as activation checkpointing (which is a more reasonable name). In their blog they also mention that they offer an automatic Pareto-optimal tradeoff for a user-specified memory limit! (although the config seems to have a different name in the code than mentioned in the blog)

05:33 PM •

#ai #training #optimization

Wednesday, December 10, 2025

○

Since I like repairing electronics I’m happy to have learned that iFixit now has an app that makes it even easier for people to get into it. It explains all the necessary basics and even comes with a multi-modal AI chatbot: You can share an image of your problem and it will help you diagnose and remedy the problem, all based on the extensive information that iFixit Guides have for countless devices.

○Repairing ElectronicsDec 9, 2025

09:22 PM •

#hardware #ai #chatbots

○

Async Subagents > API

Claude Code now has asynchronous subagents, meaning the main agent can spawn subagents (this is not new) that keep running in the background (this is new). I don’t know if Anthropic has imposed a limited on this feature (they probably don’t have to, since I’ll burn through my usage much faster…), but for me it definitely has replaced some API use cases. I managed to have it spawn over 100 subagents to process a bunch of documents. Not sure if that is what they intended it for, but it’s nice!

09:08 PM •

#claude #agents #ai

Monday, December 8, 2025

○

Claude Ads on Stack Trace Searches

Intent-based advertising means capturing and converting a user based on something they intend to do or to acquire. It doesn’t work for every product, because often potential users are not aware of the problem and don’t go out looking and intending to do something about it. Anthropic uses an interesting intent to advertise Claude (per twitter), namely that they bid on searches for stack traces.

So if someone searches for a stack trace with no results, they are served a Claude ad (which is, admittedly, very good at solving those!). It’s a genius way of indirect intent-based ads. Those ads are probably very cheap as well (for now), because the price is determined by your competition on those keywords (it’s a bidding process, albeit with one entity simulatenously owning the marketplace and supply).

r/Anthropic - Incredibly Creative Move by Anthropic to sponsor Claude ads on stacktraces that get no results

01:44 PM •

#advertisement #anthropic #ai #software-engineering

Sunday, December 7, 2025

○

Claude Code has an auto-compact feature. If you run into the last 20% or so of the context window limit, upon finishing the generation it will automatically compact all previous messages. This is mostly fine, but it can happen at very awkward spots and then degrates performance, which is why I turned it off. You can instead/additionally manually compact with optional instructions, like /compact keep parts relevant to the DB migration.

I don’t know if this is intended or not, but disabling auto-compacting also seems to let you go over the context limit? As you can see I got up to 106%. It’s possible that it’s just cutting off the earliest context then, but at least it’s good to know that you can keep going without compacting.

04:12 PM •

#ai #claude

○

I Took the Claude Interview

Anthropic’s Societal Impacts team currently runs interviews with Claude.ai users and shared some insights from their initial interviews. The interviewer is Claude. It’s a fun exercise to think about AI in general and maybe shape it a bit.

First, of course, I asked Claude for what it has been prompted to do:

Warm-up: What’s the last thing you used an AI chatbot for? (You already answered—implementing code) Vision question: If you could wave a magic wand and AI could help you with anything in your life, what would you choose? (With follow-ups to understand the deeper value/aspiration behind your answer) Positive experience: Can you tell me about a time you’ve worked with AI that felt like a step toward that vision? Concerns: Are there ways that AI might be developed or deployed that would be contrary to your vision or what you value? (With follow-ups to understand why) Optional: I might go back to something interesting you mentioned earlier to explore it deeper Wrap-up: Any final thoughts on AI’s role in your life or society? The goal throughout is to understand your perspective through follow-up questions that dig into the “why” behind your answers—what matters to you and what drives your thinking.

Part of my answer if I could wave a magic wand and AI could help with anything:

I want to drive and ideate, and determine what’s important, but I want AI to design, to implement, to give me things to iterate on and adjust based on my taste and values.

I found myself reaching for a metaphor, thinking of the book Atlas Shrugged:

It is like a man, a train conductor, gripping to the control of a train, controlling thousands of horse power to move hundreds of people; but for the mind.

Someone once told me AI would turn me from a PhD student working in the trenches on one project at a time to a professor orchestrating fleets of AI students. That framing stuck with me:

A lot of AI debate is about what gets lost. […] That metaphor frames it the other way around: All PhD students will become professors! Science will 100x.

But I’m not naively optimistic (I hope?). I listed what would be horrible: AI deciding over humans, mass surveillance, social scoring, and delegating thinking to AI.

I delegate things I understand. […] Delegating thinking would mean having AI come up with some formula or math or function, which you have no intellectual way to grasp. You rely on the AI to be correct. You don’t learn. You don’t think.

There are two ways to tackle a problem with AI:

1 . You give the task to AI, it manages to solve it (because AGI) and you have a solution. 2. You look at the task, you don’t understand something, you ask the AI to help you understand. […] In the latter, man has grown and become stronger, learned something new and useful. […] In the former, we become weaker, our thinking atrophies.

I also raised fears about surveillance in particular:

I think it increases the stakes. War was always horrible. The atomic bomb, cluster bombs, napalm, chemical weapons upped the stakes. All those human rights abuses were already happening and horrible, and AI ups the stakes.

03:40 PM •

#ai #society #safety #anthropic #claude

Saturday, December 6, 2025

○

With Fast Forward Computer Vision (ffcv) you can train a classifier on CIFAR-10 on an H100 in ~14 seconds. They report in their CIFAR-10 example:

92.6% accuracy in 36 seconds on a single NVIDIA A100 GPU.

ffcv achieves that by speeding up the data loading with various techniques, so you can re-use most of your training code and just replace the loading, as this example from the quickstart shows:

from ffcv.loader import Loader, OrderOption
from ffcv.transforms import ToTensor, ToDevice, ToTorchImage, Cutout
from ffcv.fields.decoders import IntDecoder, RandomResizedCropRGBImageDecoder

# Random resized crop
decoder = RandomResizedCropRGBImageDecoder((224, 224))

# Data decoding and augmentation
image_pipeline = [decoder, Cutout(), ToTensor(), ToTorchImage(), ToDevice(0)]
label_pipeline = [IntDecoder(), ToTensor(), ToDevice(0)]

# Pipeline for each data field
pipelines = {
    'image': image_pipeline,
    'label': label_pipeline
}

# Replaces PyTorch data loader (`torch.utils.data.Dataloader`)
loader = Loader(write_path, batch_size=bs, num_workers=num_workers,
                order=OrderOption.RANDOM, pipelines=pipelines)

# rest of training / validation proceeds identically
for epoch in range(epochs):
    ...

10:04 PM •

#ai #training #gpu

○

I somehow missed this, but OpenAI stated on October 22nd that they are no longer legally obliged to retain all outputs. The legal action by the New York Times led to a court order that compelled them to do so.

02:31 PM •

#ai #privacy #legal #copyright

Friday, December 5, 2025

○

So Antigravity by Google will let the agent “auto-decide” what commands to execute and which commands require approval. It also does not use a sandbox. It didn’t take very long for the first Reddit post about a whole drive being deleted by the agent arriving. Meanwhile Claude Code is going the complete other direction: rigorous permission systems and a sandbox on top. Anthropic explains this in more detail in their blog, but basically they argue that you need filesystem and network sandboxing, because bypassing one would also mean bypassing the other (it’s trivial for linux because everything is a file, but holds more generally).

Just running an npm run build will trigger a sandbox request if a telemetry request is being made. git commit needs to use the non-sandbox fallback, because it uses my key for signing the commit, which is not available from within the sandbox. They always offer a sensible “always allow” because they are acutely aware of Approval Fatigue. It’s a good approach and makes me feel a lot safer.

10:13 PM •

#ai #agents #security #antigravity #google #claude

○

PostHog AI Setup

If you want to set up PostHog on your website, you usually have to install a package, add the imports, etc. Posthog has fully integrated an AI agent into their install process, so you invoke an AI agent as the installation step.

06:55 PM •

#ai #agents

Codex, Opus, Gemini try to build Counter Strike (instantdb.com)

An agentic eval that compares agents by how well they can build a Counter-Strike clone.

09:45 AM •

#ai #agents #evals

Monday, October 6, 2025

▸

Bubbles

1. The days of human intellectual contribution are numbered due to AI. 2. AI labs claiming to have solved software engineering are still hiring software engineers. Who is right here? Can both of these facts be true at the same time?

04:05 PM •

#ai #social #academia

Tuesday, September 2, 2025

▸

On Building Claude Code

Using Claude Code intensively to work on improving Claude Code ... is that the AI self-acceleration singularity twitter folk have been talking about?

06:09 PM •

#software-engineering #ai

Saturday, August 9, 2025

▸

Why I Pay for Search

I think Google as a company is doing great, and I do support many of the things they do. But search isn’t one of them. Here are a couple of reasons why I prefer using Kagi, a paid search engine. So much so that I have invested in the company in their second round.

💡

This post assumes you care about search in the first place. There will be a separate post on why I do, which I will link here.

Kagi sources their results from all major search engines (Google, Bing, Yandex, …) as well as niche providers and tools (Wolfram Alpha, Yelp, Open Meteo, …).

Homepage of Kagi.

This article is split into tangible benefits, which save me time every day and make Kagi a measurably better search engine for me, and intangible benefits that I usually forget about but make me feel a bit nice.

Tangible Benefits

Just results.

No ads. No quick questions. No AI answers. No sponsored results.

Just results that matter.

A Kagi query for ‘cockatiels.’ It shows results (sites, videos, images) and a wikipedia panel on the right.

Customization

I can go in and control how the results are generated. Useful and trusted sources can be upranked or pinned to the top, so they will always show up when your query gets a hit on them.

Kagi reveals detailed results like the amount of trackers on the site and the domain registration date. It also allows you to up- or downrank sites, including blocking them completely from appearing in your results.

Aweful and useless sites like the Apple discussions forum can be downranked or, as they should, completely blocked from your results. Some stats: The most-blocked websites are pinterest, foxnews and tiktok. The most-pinned websites are wikipedia, the mozilla developer website and reddit.

Integrated features

Summarize, open in webArchive or ask an LLM right from search.

Options for the search result and a website summary directly in Kagi.

AI, when I want it

Add a question mark or press the quick answer button, and you will get an AI summary with citations. No question mark means no automatic AI summary.

A query for ‘cockatiels?’ shows an AI quick answer with citations.

Perplexity included!

Kagi assistant is a notes-inspired chat interface. By default, the AI model can search the internet, giving you basically the complete perplexity experience in Kagi! You can also upload files and define custom agents with different models and system prompts.

Asking the Mistral small model in Kagi Assistant lets ‘s it use Kagi search as a tool and answer with citations.

Context Control

Google does a great job at personalization. But what if I don’t want personalization and instead want to know what other people’s results would be? Kagi lets me control the context, e.g., letting me search like a user from a different country!

The country the user appears to be searching from can be controlled or set to a neutral international.

Great, e.g., for vacation planning. Just set the country to your destination country, and now you can search like normal instead of having to append the country to every query.

Searching for ‘train fare’ with the country setting ‘Japan’ gives you results relevant to Japan. Shocker!

Tools for Powerusers

Build advanced queries right in the search UI! This is very much identical to the query builder that Google offers. But on Kagi it’s right on the start page and not buried three clicks away. Power-searchers rejoice!

Spiritual Benefits

Bringing back small web.

There might be a longer post about this at some point, but basically, the internet has changed. I am old enough - maybe barely, but still - to remember the old internet of small and unpolished blogs and forums. It was an innocent place where you could open up a much cleaner-looking Google, search for ‘Bitcoin faucet,’ and get some free Bitcoin. No one used their real name, and you could telnet into the whitehouse server and use it to send emails. Today’s internet in comparison feels hostile. Tracking, blogspam, captchas, ads, and aggressive and intransparent affiliate marketing.

Kagi is trying to reconnect users and bring back the small web with their small web index. Turn on their small web lens (a feature I neglected to mention, but it’s essentially a couple predefined filters on your results), and you will find results I promise you usually won’t find.

Screenshot of a website called cockatiel cottage.

These pages will be kind of shitty by many of today’s measurements and often badly optimized. Like sometimes they even load a lot slower. But they feel human. It’s not something I use regularly (hence a spiritual benefit), but it’s nice. I like humans doing their unique thing.

Privacy

They don’t log your queries or profile you. Right from the search results, before opening a page, they show you a warning symbol if a page contains many trackers.

A page with many trackers will have a warning symbol on the right.

Such pages are also downranked automatically.

Kagi shows the number of trackers and which trackers are present on the website.

Kagi is paid, so you do need to authenticate for every query. But you can still use it without tying a query (which is not logged anyway) to your account. The technology behind that is based on RFC-standardized privacy passes. Here is their blog post. Essentially it lets you generate anonymous tokens, each of which can be used for one search query. There is a browser extension that handles this process automatically.

When using Privacy Pass, Kagi doesn’t know your account but instead signalizes that you are authenticated anonymously.

Public Benefit Corporation

Kagi is a public benefit corporation since 2024. Their statement says:

Kagi is committed to creating a more human-centric and sustainable web that benefits individuals, communities, and society as a whole, with a transparent business model that aligns the incentives of everyone involved.

Maybe I should hate that as an investor. But I’m also a stupid enough investor to really like putting society first.

Transparency

Stats! Check out how many (paying) users they have or how many queries are performed each day.

The number of users, all of whom pay.

The number of queries per day.

AI filters

I do like that they are very mindful and proactive when it comes to AI. They have a feature that lets you filter AI-generated images.

When searching for images, Kagi allows you to filter for AI-generated ones.

As an AI researcher, I am hardly an AI hater, quite the opposite. But I do think we need to find ways to navigate the broad changes introduced by AI, and - while I have some technical concerns - at least they are trying something!

No Free Lunch

It’s clear that Kagi invests a lot into making search better, adding useful features for the end user, and keeping the experience clean. This relentless focus is not free. It’s not something that fits, e.g., Google’s business model, where someone else pays for you to see ads. Kagi’s focus on your experience as the searcher has a price. And that price is exactly $10 per month (for unlimited searches). I don’t expect this to be the best deal for everyone, so Kagi will not replace Google in my opinion (and Google does far more than search anyway). But for me, a researcher with thousands of searches per month, it is well worth it to get the best experience available.

disclaimer: I am an investor in kagiI think Google as a company is doing great, and I do support many of the things they do. But search isn’t one of them. Here are a couple of reasons why I prefer using Kagi, a paid search engine. So much so that I have invested in the company in their second round.

If you think you deserve the best search experience there is, I highly recommend that you try Kagi.com. The first 100 searches are free.

Try it out

05:10 PM •

#ai #social

Thursday, April 3, 2025

▸

AI Everything

MCP, a vision for human and AI working together

07:33 PM •

#ai

Thursday, January 30, 2025

▸

R1 writes an essay about `Strawberry`

I was setting up OpenRouter to benchmark a bunch of different LLMs on some specific tasks we have at Nunc.

I noticed there’s an option to chat with various models and it even comes with some demos. Naturally, I clicked on the Strawberry Test.

A new chat window. It has a text field, send button. Above there are three demo options “9.9 vs 9.11. Which one is larger?”, “Strawberry Test. How many r’s are in the word” and “Poem Riddle. Compose a 12-line poem”.

In the chat the question “How many r’s are in the word strawberry” has been sent. The model has reasoned for 58.1s and responds, mentioning there are 3 instances of the letter “r”.

It got the right answer, which is nice. But it took well over a minute, despite the model generating at 244 tokens per second. And I get only 84 output tokens, which by themselves should take a third of a second, not more than a minute. So yeah, as the UI shows, the model spent quite some time “reasoning” about what to answer. Luckily for us, the reasoning traces are public.

Lo and behold, the traces of R1, 1515 tokens or 894 words about Strawberry:

Okay, let me try to figure out how many times the letter “r” appears in the word “strawberry.” First, I need to make sure I remember the correct spelling of “strawberry.” Sometimes, people might misspell it, so I’ll take a moment to think. It starts with “straw,” like the thing you drink from, and then “berry.” So combining those, it’s “strawberry”.
Now, let’s break it down letter by letter. Let me write out each letter one by one and check for “r”s.
S-T-R-A-W-B-E-R-R-Y.
Wait, let me slow down. Starting with the first letter:
S - no
T - no
R - yes, that’s the first “r”
A - no
W - no
B - no
E - no
R - second “r”
R - third “r”
Y - no
Wait, does “strawberry” have three “r”s? Let me check again. Sometimes when I spell it, I might confuse the number of letters. Let me write it out again more carefully:
S, T, R, A, W, B, E, R, R, Y.
Wait, that’s 10 letters? Let’s count the letters again. S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y (10). So according to this breakdown, there’s an “R” at the 3rd position, then again at the 8th and 9th positions. That would make three “r”s. Hmm, but that doesn’t seem right. Let me check another way.
The word “strawberry” is a compound of “straw” and “berry.” Let’s check each part separately.
”Straw” is spelled S-T-R-A-W. Here, there’s one “r”.
”Berry” is B-E-R-R-Y. In “berry,” there are two “r”s.
So combining them: “straw” + “berry” = strawberry. Adding the “r” from “straw” and the two “r”s from “berry” would give a total of three “r”s.
But wait, when combining “straw” and “berry,” do any letters get dropped? Sometimes, compound words might drop a letter, but in this case, “strawberry” is spelled S-T-R-A-W-B-E-R-R-Y. So between “straw” and “berry,” the letters are all retained. So “straw” ends with W, and “berry” starts with B. Therefore, all letters are present, so the total number of “r”s is indeed three.
But just to be absolutely sure, let me spell it out again: S-T-R (1)-A-W-B-E-R (2)-R (3)-Y. Yes, that’s three “r”s. Wait, but I think a lot of people might think there are two “r”s. Maybe I’m overcounting? Let me check a dictionary or think of how it’s pronounced. Sometimes the pronunciation can be misleading. “Strawberry” is pronounced “STRaw-ber-ee,” and sometimes the “r” in “berry” is doubled. But in spelling, “berry” has two “r”s. So when combined with “straw,” it’s still two “r”s in the “berry” part and one in the “straw” part, totaling three.
But hold on, when I actually write “strawberry,” is that the correct spelling? Let me think: S-T-R-A-W-B-E-R-R-Y. So after W comes B from “berry,” then E, then double R, and Y. So yes, there are two “r”s in the “berry” part. Together with the one “r” in “straw,” that’s three “r”s in total.
Alternatively, maybe “strawberry” is spelled with only one “r” in the “berry” part? Let me visualize the word. If I write it down: strawberry – s t r a w b e r r y. Yes, that’s right, “berr” part has two “r”s. So the “berry” component is spelled B-E-R-R-Y, so two “r”s. Therefore, in “strawberry,” combining “straw” (with one R) and “berry” (with two Rs) gives three Rs total.
But wait another second. Let me check an alternative approach by counting the letters step by step:
S
T
R (1)
A
W
B
E
R (2)
R (3)
Y
So yes, three Rs. Therefore, the answer should be three. Hmm. But I remember sometimes people miss it as two. Let me confirm with a different method. Maybe typing the word and checking each letter:
S T R A W B E R R Y
Breaking it into parts:
Straw: S T R A W
Berry: B E R R Y
Yes, “straw” has one R, “berry” has two. So combined, three Rs.
Alternatively, maybe in the actual spelling, there is only one R in “berry”? Let’s think. The fruit is “berry,” which is B-E-R-R-Y. So for example, “berry” has two Rs. So if that’s correct, then “strawberry” would have three Rs.
Alternatively, maybe it’s spelled “strawbery” – no, that’s incorrect. The correct spelling is with two Rs in “berry,” hence “strawberry” with three Rs.
So the conclusion is three Rs. Wait, but let me get an external confirmation in my mind. For example, if I think of the word “strawberry” and how it’s taught in schools, the correct spelling includes two Rs in “berry,” leading to three in total. Another way to remember is that “berry” is spelled with two Rs, so when you add “straw,” you get three Rs.
Alternatively, maybe I should check how many syllables are in “strawberry.” It’s pronounced as straw-ber-ry, which might be three syllables, but that may not directly relate to the number of letters.
Alternatively, using a mnemonic: “Straw has an R, berry has two Rs, so together three Rs.”
Hmm, yes, that seems to fit. So I think the correct answer is three. But just to make absolutely sure, I’ll write the word out again: S T R A W B E R R Y. Scanning each letter:
S - T - R (1) - A - W - B - E - R (2) - R (3) - Y
Yes, three Rs. Therefore, the answer is 3.

Did you read it all? It seems a bit repetitive at first glance, right? But then again, humans can kind of only count up to three or four, so this close to our limit before we’d have to enter reasoning mode ourselves. Maybe AI will also get up to four without reasoning, who knows.

The Benchmark

In totally unrelated news, our benchmark for the business task we want to solve, shows that in terms of utility we don’t really care too much about which model we actually use. The outliers are due to API issues, which makes the results all the more realistic, I’d say. So while AGI or whatever would sure be nice, I think most business tasks look something like this.

Benchmark showing very similar performance for around 15 different models. There are some outliers.

09:44 PM •

#ai