Micromark as Markdown parser for SilverBullet?

mjf · December 27, 2024, 9:19am

A crazy idea to be considered… Because I can see many miscellaneous issues and feature requests that finally all reduce to fixing or extending SB’s MD parser I wonder if some “battle tested” and extensible MD parser such as the Micromark could be used to make SB’s parsing of MD more “bullet-proof”? The Mikromark parser can be used (according to the project’s description) both to generate HTML directly (which could be used for some “preview” or “print” modes) and to “do really complex things with Markdown” and “give tremendous power, such as access to all tokens with positional info” (which SilverBullet indeed does/needs for, e.g., indexing, referencing etc.), providing precise AST output.

P.S.: I know this sounds like sort of heresis but I still want to ask/know whether replacing the SB’s parser with a third-party library would be rather NO-GO or a GO (with the possible benefit of reducing the burden of fixing the SB’s MD parsing layer all the time).

P.S.2: I’ve absolutely no idea of how difficult it would be to replace the current SB’s parser. Therefor, it’s just an crazy idea.

zef · December 27, 2024, 1:17pm

My starting point is that everything is possible, but some things are not worth the effort. SilverBullet is built on CodeMirror, which is strongly tied to the Lezer parser system. In principle any parser that can produce a Lezer parse tree can be plugged into CodeMirror and therefore SilverBullet. In fact the markdown parser that SilverBullet is based on (but extends) is a custom parser doing this. If you can somehow adapt what your suggested parser produces to be a Lezer parse tree, it should work. How hard would that be? I don’t know. It’s not trivial for sure, and it’s not anything I’m personally interested to try (I already struggled quite a bit to get SB’s parser to where it is today), but if you want to have a look, the linked pages can be a starting point.

mjf · December 31, 2024, 7:41am

Aha! OK, I completely forgot that CM parser is based on Lezer (which is inspired or based on Tree-Sitter, AFAIK). So is it really so that SB can theoretically use the resulting AST for, e.g. indexing etc. and referencing things? Let’s discuss it more futher, please.

The primary issue with SB I see and have (and now I also see it has probably nothing to do with the SB’s parser) is that things like tags and attributes are not “scoped” to the the AST leafs (and their ancestors). This is what I think should be unified for both and fixed. Tags and attributes should be scoped and handled exactly the same way and a way to reference certain paragraph or list item or table cell should be provided, if possible, IMHO.

Once I know SB has true AST “at hand,” some better way of referencing into the documents (pages) AST also seems to be possible (there is already a stalled discussion about this here, and some issues as well, my ideas to extend the referencing capabilities as a fresh SB user the time, I can perhaps imagine something better now).

While having the AST at hand we can get rid of many limitations such as not having full UTF-8 support for anything everywhere, which may be (?) be achieved just by extending the markdown parser (e.g., tags containg characters like [ěščřžýáíéůú] etc. does not autocomplete now which is a great limitation for people not using English language / ASCII? exclusively).

I am just in search of some way to unify the mentioned things somehow so that SB behaves in more intuitive, natural and coherent way. To be honnest, I was quite surprised when I realized I can not access individual table cells by tags, for example. All of these are just my ideas on how I think SB should be extended and “polished”, I mean the very core functionality (indexing, referencing, encoding).

dmick1954 · January 5, 2025, 2:59pm

That would be a huge time consuming project. If that is so important to you, perhaps it is something that you should consider taking on.

mjf · January 6, 2025, 11:07am

Perhaps I will try one day… For now I mostly lack tags/attributes for paragraphs. Maybe I could try to look into the most painful character encoding issues.