archive | Notion

Iter

[09/01/23] Iter uses Cogagent, a visual language model, to allow users to make browser automation workflows using natural language. A screenshot of the current page and the corresponding user command is used by the model to issue the necessary keyboard or mouse action. The lack of dependance on information from the DOM allows it to be much more generalisable than agents that rely solely on HTML content Technologies used: Google compute engine, Django, Playwright, React

Forked Natbot to make it more reliable

[13/12/23] This agent seeks detailed commands for precision of workflows. This allows it to execute tasks more reliably.

https://www.loom.com/share/7f1b615263274a849772967e5d6323ab?sid=e3ed37f5-f87e-4f49-9fc6-ba893661b5b0

Github (contact me to get the dataset used for finetuning GPT 3.5 turbo 1106 that this works off of)

NOTE: This does not work as well on some websites because the DOM structure of websites varies greatly across the web. I have started working on an agent that uses screenshots of webpages to make decisions instead (thanks to this paper!)

[10/12/23] Tried to add Email functionality to the bot. Google doesn't allow bots to sign in, so used proton's email service. Then I came across this beauty

After inspecting it you will realize that the "password" input box has a type attribute but the username input box doesn't have one inspite of visual similarity

These sorts of inconsistencies in the front end take a toll on gpt inference. There exist alternatives to solve this like using ID attribute, however, the solution is not generalisable as the ID attribute does not contain relevant values on all websites on the web. One way to solve this problem might be to prompt the LLM to extract the relevant content from the HTML by itself. OCR might seem like an alternative to building agents that perform tasks on the web. However, web interfaces are supposed to be consumed without clear boundaries between textual and visual elements. Both of these approaches cannot consume user interfaces holistically.

[7/12/23] Natbot is pretty dope. Over the past few days, I changed it's input interpretation to better benefit from more detailed prompts, added more HTML elements/attibutes for it to consider while neglecting the less relevant ones, and learnt how awesome the DOM snapshot feature of playwright is!

[5/12/23] I am skeptical of using OCR to feed information to the LLM as it would limit understanding to the surface level of the text, neglecting the deeper context and the interaction between textual and visual elements in UIs. Therefore, DOM simplification seems like the way to go for now

[5/12/23] Oscillating between using DOM simplification and OCR for the LLM to better comprehend and issue the desired commands.

[9/11/23] I used a domain I had lying around to make this Url Shortener which has one very important feature: It enhances link clarity by removing visually similar characters 'I' (uppercase i) and 'l' (lowercase L) from its generated URLs, thus preventing misinterpretation and typos.

Poked sticks in the OAuth 2.0 protocol

[15/10/23] Stitched togethor a prototype of a web app that helps YouTubers collaborate better with freelance video editors. YouTubers face the hassle of downloading ready-to-upload videos which they receive from video editor freelancers and then uploading them to YouTube. It's a time-consuming task, especially for those always on the move. Also, they are uncomfortable sharing their YouTube account passwords with freelancers for direct uploads due to security concerns. This app (let's call it Trolley) aims to simplify this process. It allows YouTubers to grant upload access to freelancers without sharing any account details. Once the editing is done, freelancers upload the videos to Trolley. The YouTuber then gets an online preview. If they like it, they give Trolley the go-ahead to upload the video directly to their YouTube channel. This way, the YouTuber doesn't have to download or re-upload anything, saving them a lot of time. Moreover, Trolley ensures that freelancers don’t get direct access to the YouTuber's channel, which addresses the security worries. It's about making the whole process of getting a video from the editor to YouTube less tedious and more secure for everyone involved. Currently only works for workspace accounts until it gets verified by google