Skip to content

infallible new_root#7657

Open
SATVIKsynopsis wants to merge 13 commits intounicode-org:mainfrom
SATVIKsynopsis:collator_borrowed
Open

infallible new_root#7657
SATVIKsynopsis wants to merge 13 commits intounicode-org:mainfrom
SATVIKsynopsis:collator_borrowed

Conversation

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor

Fixes #6633

This PR adds a const, infallible CollatorBorrowed::new_root() constructor backed by baked root collation data.

The implementation mirrors the invariants enforced in try_new() but avoids locale resolution and fallible paths.

Note on diacritics:
Root diacritics are included explicitly to match the invariants expected by the comparison pipeline and the existing baked data runtime path. also, there is no separate root specific baked diacritics singleton, so the default baked diacritics payload is used.

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

I hit one design question while making up CollatorBorrowed::new_root that root collation requires diacritics data to satisfy the comparison invariants, but CollationDiacriticsV1 isn’t currently exposed as a baked singleton unlike others.

For new_root, should diacritics be treated as a singleton baked value, or loaded via locale fallback even in the root only path?

I paused changes here to align with the intended data model.

@hsivonen
Copy link
Copy Markdown
Member

@robertbastian Does databake support making a const accessor for the root diacritic table even if we also have a locale override for it for Vietnamese?

@robertbastian
Copy link
Copy Markdown
Member

Yes, like so

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

Thanks. I will wire this up using expose_baked_consts = true for CollationDiacriticsV1 and update the PR.

Copy link
Copy Markdown
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put this behind an unstable feature, I don't think we have decided on the name of the method, and whether it should live on Collator or CollatorBorrowed.

Comment on lines +849 to +852
/// This constructor is:
/// - const
/// - infallible
/// - locale independent
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all obvious from the signature, I don't think we need to repeat this

you do need to add the "enabled with feature" snippet to the docs though

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gated it behind unstable and added feature snippet in the docs.

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

SATVIKsynopsis commented Feb 16, 2026

Thanks. one clarification on the remaining ci failures.

even with new_root() gated behind unstable, ci still builds the full suite and fails.

should this be resolved by either enabling expose_baked_consts = true for CollationDiacriticsV1, or using a different access pattern for root diacritics in a const context.

happy to adjust the implementation once the intended approach is confirmed.

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

Gentle follow up on this: CI is still failing due to SINGLETON_COLLATION_DIACRITICS_V1 not being available on provider::Baked, even with new_root() gated behind unstable.

To unblock the PR, should I, enable expose_baked_consts = true for CollationDiacriticsV1, or refactor to avoid requiring root diacritics in a const context?

Happy to implement whichever direction you prefer.

@robertbastian
Copy link
Copy Markdown
Member

To unblock the PR, should I, enable expose_baked_consts = true for CollationDiacriticsV1

yes, I already said that last week and you already said that you were going to do it

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

Got it. thanks for confirming.
I will enable expose_baked_consts = true for CollationDiacriticsV1 and update the PR accordingly.

@SATVIKsynopsis SATVIKsynopsis requested a review from a team as a code owner February 24, 2026 15:47
@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

After investigation, expose_baked_consts = true doesn't generate SINGLETON_COLLATION_DIACRITICS_V1 for CollationDiacriticsV1 because it has locale variants and uses a trie based lookup rather than a singleton.
I have two approaches, hardcode the root diacritics bytes directly as a const in new_root() or different approach you may prefer.

Happy to adjust to suggested changes.

@robertbastian
Copy link
Copy Markdown
Member

it doesn't generate a singleton but it generates a const for each locale. have a look at the generated code

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

SATVIKsynopsis commented Feb 24, 2026

I checked the generated data, CollationDiacriticsV1 is non singleton and only accessible via the provider and it has no per locale const like jamo. because of that, new_root() can’t pick und diacritics directly.

should new_root() stay const and defer diacritics loading, or is it ok to make it non const and load diacritics via the provider.

thanks

@hsivonen
Copy link
Copy Markdown
Member

I think to achieve the goal of the issue, new_root should load diacritics as const, so the diacritics of for root need to become const-reachable somehow. ("How" is a question for @robertbastian rather than me.)

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

Thanks, that makes sense.

I agree that new_root should remain const, which implies root diacritics need to become const-reachable.

I will wait for guidance on the preferred mechanism before making further changes.

@robertbastian
Copy link
Copy Markdown
Member

I checked the generated data, CollationDiacriticsV1 is non singleton and only accessible via the provider and it has no per locale const like jamo. because of that, new_root() can’t pick und diacritics directly.

because you haven't regenerated the data since you changed the config. are you aware that CI is failing?

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

I have regenerated the required baked data for CollationDiacriticsV1 and updated new_root() to use COLLATION_DIACRITICS_V1_UND.

The remaining CI failures are in clippy / docs / msrv / full datagen. clippy is currently complaining about missing_docs coming from the generated impl_collation_diacritics_v1!(Baked) constants.

before allowing missing_docs for these generated baked constants, I wanted to check that is this the expected fix here, or is there a preferred pattern for documenting non singleton baked data?

@robertbastian
Copy link
Copy Markdown
Member

I don't have the capacity to answer questions for every tiny step. just do something, we can then see if it's ok

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

I don't have the capacity to answer questions for every tiny step. just do something, we can then see if it's ok

I understand. i will come up with a fix and do this right away.

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

SATVIKsynopsis commented Feb 26, 2026

CI checks are passing now after regenerating the out of sync generated data.

docs, datagen, and msrv are green. really sorry for the earlier back and forth, and thanks for your patience, really appreciate your time.

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

The remaining failing check is full-datagen, which reports out of sync baked data for the new CollationDiacriticsV1 marker.

Running cargo make bakeddata locally is disk heavy on my setup, if someone with the full datagen environment could regenerate and push the updated baked data, that should resolve the final CI failure.

@sffc
Copy link
Copy Markdown
Member

sffc commented Feb 27, 2026

cargo make bakeddata collator runs it for just that component.

@SATVIKsynopsis
Copy link
Copy Markdown
Contributor Author

SATVIKsynopsis commented Feb 27, 2026

Added the baked data for collator.

While fixing clippy/doc/msrv failures, I noticed the generated CollationDiacriticsV1 constants differed from existing patterns like jamo etc and updated the exporter to generate consistent doc(hidden) annotations so generated data satisfies repo lint rules.

This change requires regenerating the full baked data. I am disk constrained locally and can’t run cargo make bakeddata, so if someone with the full datagen setup could regenerate and push the updated baked data, that should resolve the remaining CI failure.

Happy to adjust if this isn’t the preferred approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider making root-collation CollatorBorrowed const-constructible from baked data

4 participants