GH-46369: [C++] Add key-value pairs to the FileSystemFactory interface and to S3Options#50044
GH-46369: [C++] Add key-value pairs to the FileSystemFactory interface and to S3Options#50044raulcd wants to merge 12 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the C++ filesystem factory plumbing to accept an additional set of backend-specific key/value options (in addition to the filesystem URI), aligning with GH-46369’s goal of separating reusable URIs from user-specific configuration.
Changes:
- Extends
FileSystemFactoryto acceptoptionsand adds compatibility construction for existing factories. - Adds new
FileSystemFromUrioverloads that acceptoptions, and threads them through the factory registry call path. - Updates registered filesystem modules/tests (examplefs, S3, localfs tests) to compile with the new factory signature.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/arrow/testing/examplefs.cc | Updates example filesystem factory signature and forwards options to nested FileSystemFromUri call. |
| cpp/src/arrow/filesystem/s3fs.cc | Updates S3 registered factory signature to include options (currently unused). |
| cpp/src/arrow/filesystem/localfs_test.cc | Disambiguates nullptr factory registrations under the new overloaded FileSystemFactory constructors. |
| cpp/src/arrow/filesystem/filesystem.h | Introduces options-aware FileSystemFactory + new FileSystemFromUri overloads and documentation. |
| cpp/src/arrow/filesystem/filesystem.cc | Implements the new FileSystemFromUri overloads and forwards options into the registry-based factory dispatch. |
Comments suppressed due to low confidence (1)
cpp/src/arrow/filesystem/filesystem.cc:925
- The new
optionsargument is only forwarded to registered factories; built-in scheme handling (abfs/gcs/hdfs/mock) ignores it entirely. This makesFileSystemFromUri(..., options, ...)silently drop backend-specific options for these schemes. Consider either (a) plumbingoptionsinto the built-in scheme option parsing / constructors, or (b) rejecting non-emptyoptionsfor schemes that don’t support them yet to avoid surprising behavior.
Result<std::shared_ptr<FileSystem>> FileSystemFromUriReal(
const Uri& uri, const std::string& uri_string,
const std::vector<std::pair<std::string, std::any>>& options,
const io::IOContext& io_context, std::string* out_path) {
const auto scheme = uri.scheme();
{
ARROW_ASSIGN_OR_RAISE(
auto* factory,
FileSystemFactoryRegistry::GetInstance()->FactoryForScheme(scheme));
if (factory != nullptr) {
return factory->function(uri, options, io_context, out_path);
}
}
if (scheme == "abfs" || scheme == "abfss") {
#ifdef ARROW_AZURE
ARROW_ASSIGN_OR_RAISE(auto options, AzureOptions::FromUri(uri, out_path));
return AzureFileSystem::Make(options, io_context);
#else
return Status::NotImplemented(
"Got Azure Blob File System URI but Arrow compiled without Azure Blob File "
"System support");
#endif
}
if (scheme == "gs" || scheme == "gcs") {
#ifdef ARROW_GCS
ARROW_ASSIGN_OR_RAISE(auto options, GcsOptions::FromUri(uri, out_path));
return GcsFileSystem::Make(options, io_context);
#else
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@pitrou @kou I've started working on this to hopefully move S3 filesystem out of libarrow and into its own shared library. For the scope of the S3 options (on this PR or a follow up one), would you mimic all S3Options as part of the new k/v pair or only a small subset, example initially only region and credentials? |
|
@pitrou any thoughts about the scope for this PR, as shared above:
|
|
How many S3 specific options do we have? BTW, do we need to use |
|
Thanks @kou for the review, I really appreciate you taking the time!
I counted around 13 that we probably don't want to express as part of the URI (credentials, assume-role params, custom providers, connect/request timeouts, a few behavior flags, default_metadata, retry_strategy, sse_customer_key).
I can do that, I basically started a follow up PR on top of this one where I started working towards adding them: I wanted to have this in isolation for easier initial review but I am happy to push a single PR or add some of those cases here.
The There's also credentials_provider The cost is the type-erasure across the shared-library boundary, which I am trying to validate it won't be an issue for complex types on the other PR I pointed above. I've only tested But now that I think more about the added complexity, I am unsure it is worth the hassle because:
|
|
|
…f FileSystemFromUri, do not silently ignore non-empty options if underlying FS does not support them and use FileSystemFactoryOptions instead of std::vector<std::pair<std::string, std::any>> everywhere
… tests for FileSystemFromUriAndOptions and add a new one to test old behaviour
Can |
|
I don't think it's a good idea to use it for things that do not get serialized. It imposes a string parsing or conversion, loses typing etc. |
Hi @kou I wasn't sure so I decided to take some time to investigate it. I've created this commit (acfc939) on top of this PR to test So, in my opinion it is easy to use for our bindings, what are your thoughts on it? Note: A minor nit, we seem to have an |
|
Thanks for trying it. OK. Let's use |
Fix linting after rebase Add s3 fs module test for FromUriAndOptionsCredentials and refactor S3Options::FromUri and S3Options::FromUriAndOptions Add tests for FromUriAndOptions credentials Add S3RetryStrategy to options Add option for default_metadata Minor fix
|
@bkietz if you happen to have some spare cycles and can take a look |
Rationale for this change
From the issue description:
What changes are included in this PR?
FileSystemFactoryOptions, added toFileSystemFromUriAndOptionswhich allows to pass viaFileSystemFromUriRealthroughFileSystemFactory::functions3andlocal/filefactories are migrated to the new 4-arg signature.NotImplementedrather than silently dropping them.examplefsnow reads two typed options and appends them to the output path, exercised byarrow-filesystem-test, provingstd::anysurvives thelibarrowtoarrow_filesystem_example.soboundary.ci/scripts/cpp_build.shnow forwardsARROW_S3_MODULEto CMake. The conda-cpp image was already settingARROW_S3_MODULE=ON, but the flag was never passed through, sos3fs_module_test(added in GH-40343: [C++] Move S3FileSystem to the registry #41559) had not actually been built or run in CI. This activates it.access_key,secret_key,session_token,retry_strategyanddefault_metadataasFileSystemFactoryOptionsonS3Options::FromUriAndOptions.Are these changes tested?
Yes, new tests added to validate a dynamically-loaded filesystem is able to receive options via
std::anyand tests validating rejection if options passes to Filesystems that don't support it. Tests also added forS3Options::FromUriAndOptionsreceiving the newFileSystemFactoryOptionsand usage on separate s3fs moduleAre there any user-facing changes?
New public
FileSystemFromUriOptionstaking an options list. Existing ones are unchanged.