Skip to content

Add missing DataFrame methods (set operations and query) #1455

@timsaucer

Description

@timsaucer

Summary

Several DataFrame methods from upstream DataFusion v53 are not yet exposed in datafusion-python. This issue covers set operations and query-related methods.

Missing Methods

Set operations:

  • distinct_on — deduplicate rows based on specific columns, keeping the first row per group
  • except_distinct — set difference with deduplication (complement to existing except_all)
  • intersect_distinct — set intersection with deduplication (complement to existing intersect)
  • union_by_name — union two DataFrames matching columns by name rather than position
  • union_by_name_distinct — union by name with deduplication

Query/display:

  • explain_with_options — explain plan with configurable detail options
  • show_limit — display results with a custom row limit
  • sort_by — sort by column names (simpler API than sort which requires Expr)
  • with_param_values — bind parameter values for prepared statements

Upstream Reference

Implementation

  • Rust bindings: crates/core/src/dataframe.rs
  • Python wrappers: python/datafusion/dataframe.py

Note: This gap analysis was performed using an AI agent comparing upstream DataFusion v53 documentation against the current datafusion-python codebase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions