Words about Arrays and Tables

57 points by todsacerdoti 21 hours ago

Animats 18 hours ago

I get the feeling this guy likes Lua. Lua combines arrays and dicts into "tables", which are usually indexed from 1 but can be indexed by any type as a dict. He never mentions Lua, though.

Multidimensional arrays are a blind spot for modern language designers. Not arrays of arrays, as this author points out. FORTRAN had multidimensional arrays, but C didn't. Go and Rust don't have them. Part of the trouble is slices. Once you have slices, people want slicing in multidimensional arrays along any axis, which complicates the basic array representation. Discussions in this area dissolved into bikeshedding and nothing got done for Go and Rust.

zahlman 17 hours ago

> Lua combines arrays and dicts into "tables", which are usually indexed from 1 but can be indexed by any type as a dict.
I think it's more accurate to just call them hash tables ("dicts") that happen to support some very limited array-like functionality. "Appending" still looks like key insertion if you don't use a named method (https://stackoverflow.com/questions/27434142); `ipairs` misses those non-integer keys but also misses non-contiguous keys, or even a zero key (https://stackoverflow.com/questions/55108794); the `#` operator is not well defined once you add those non-contiguous keys (https://stackoverflow.com/questions/2705793) (so you can't actually cleanly override the choice to "start from 1"); slicing etc. aren't provided as operators (https://stackoverflow.com/questions/24821045); etc.
layer8 16 hours ago

C and C++ do have multidimensional arrays [0], in the sense that they are arrays of arrays where the inner arrays all have the same fixed size, encoded into the multidimensional array type. Likewise in Go and Rust. This is different from, say, Java, where if you have an array of arrays, the inner arrays can all have different sizes, and there is no array type that could express that they shouldn’t.
Being able to express slices across arbitrary dimensions natively is another matter.
[0] https://en.cppreference.com/w/cpp/language/array.html#Multid...
marcosdumay 16 hours ago

> Go and Rust
The trade-offs are way too severe for people to agree to any implementation in a low level language like Rust or one that wants to be low level like Go.
We could get them in something like Python.

zahlman 18 hours ago

2000 words, specifically. Thanks again, title filter. Even just "Arrays and Tables" would have been better.

I sincerely believe this aspect of the filter is a misfeature. Submissions that have bad reasons to put a number in the title are generally submissions that are just bad (and should be flagged) anyway.

mvc 21 hours ago

XSLT/XPath is an example of a platform that provides multiple axis through which to access your data structure.

https://developer.mozilla.org/en-US/docs/Web/XML/XPath/Refer...

teleforce 13 hours ago

>2000 words about arrays and tables

Merely 2000 words, we have a full complete book for that [1].

Joking aside, D4M has seamlessly combined spreadsheet, table, database and graph concepts based on associative array mathematics [2].

On one extreme people are going to bolt on everything on Postgresql database, and another extreme of integrating clunky disparate systems, D4M is a breath of fresh air that is based on mathematics not unlike the venerable SQL relational database concepts [3].

[1] Mathematics of Big Data Spreadsheets, Databases, Matrices, and Graphs:

https://mitpress.mit.edu/9780262038393/mathematics-of-big-da...

[2] D4M: Dynamic Distributed Dimensional Data Model:

https://d4m.mit.edu/

[3] Just Use Postgres for Everything: How to reduce complexity and move faster:

https://www.amazingcto.com/postgres-for-everything/

[4] Just Use Postgres for Everything (430 comments):

https://news.ycombinator.com/item?id=33934139

nyrikki 19 hours ago

One thing that may help is to drop the typing for generic words or pidgin holes, that will help with the FORTRAN example.

The IBM 704 had 3 index registers, which could be combined together with a bitwise OR, then was subtracted from a base address, that's where FORTRAN for its 7D arrays from.

When you mix that concept of decrementing index registers with hypercubes, you will naturally get to the extract format of BI tools like classic Tableau.

mpweiher 19 hours ago

This might be interesting:

Structuring Arrays with Algebraic Shapes

https://dl.acm.org/doi/abs/10.1145/3736112.3736141

https://news.ycombinator.com/item?id=44399757

kccqzy 20 hours ago

I'm not sure what's the insight in this article here. All the things mentioned seem pretty straightforward. But I'll comment on a few:

> Really any finite set can be a "dimension".

This is absolutely true and yet typical programming languages don't do a good job of unleashing this. When is the last time you wish you could define an array where the indices are from a custom finite set, perhaps a range 100 to 200, perhaps an enum? I only know of two languages, Pascal and Haskell, that are sufficiently flexible. All other languages insist that indices for arrays must be integers either starting from 0 or 1. Haskell has a mechanism to let you control what is an array index: https://hackage.haskell.org/package/base-4.18.1.0/docs/Data-... The lack of this ability makes programmers use more general hash maps by default and therefore leave performance on the table.

> We can also transform Row -> PowerSet(Col) into Row -> Col -> Bool, aka a boolean matrix.

I mean sure. Does the author know type signatures and their operations like this are isomorphic to algebraic operations on size of the possibly infinite sets? The function A -> B has exactly |B|^|A| implementations so that's why Col -> Bool is in this sense "same" as PowerSet(Col). And the function operator associates to the right so adding Row -> doesn't change anything. Wait till you learn why sum types are called sum types and why product types are called product types. Here's another thing that might be mind-blowing: arrays of an unknown length can be thought of as the infinite union of arrays of lengths of all natural numbers: Array<T, 0> | Array<T, 1> | … so using the above notation the cardinality is 1 + |T| + |T|^2 + …; but they can also be represented as a standard linked list List<T>=Nil|(T, List<T>) and using the above notation we have |List<T>|=1+|T|*|List<T>| which simplifies to |List<T>|=1/(1-|T|). And guess what, the Taylor expansion of that is just the earlier 1 + |T| + |T|^2 + … precisely.

Jtsummers 20 hours ago

> I only know of two languages, Pascal and Haskell, that are sufficiently flexible.
Julia, Fortran, some BASICs allow custom integer ranges.
Ada allows you to use any discrete range (integer, character, enums).
https://en.wikipedia.org/wiki/Comparison_of_programming_lang... - A more comprehensive list
The two columns of interest are "Specifiable index type" and "Specifiable base index".
- cestith 19 hours ago
  
  Older versions of Perl allowed you to set a lexically-scoped array base integer. It defaulted to zero and the docs mentioned 1-based arrays, but IIRC it could be any integer. This was deprecated several years ago in 5.12 as a harmful practice. Specifying using features of a Perl version of 5.16 or newer actually makes assigning to it (except for 0) a compile-time error.
  Interestingly, though, Perl’s native arrays are decidedly not an array of primitives. They are arrays of scalar values, and a scalar value can not only contain a character, a string, an integer, a float, or a reference to another item (scalar, hash, array, blessed object, filehandle, etc) but it can return different values when called in a numeric context vs a string context. There are certainly ways to get at primitive values, but it’s not in Perl’s native array semantics.
- pklausler 18 hours ago
  
  Non-default lower bounds in Fortran are a famous pitfall, however; they don't persist in many cases where one might think that they do, and are not 100% portable across compilers.
zahlman 17 hours ago
> When is the last time you wish you could define an array where the indices are from a custom finite set, perhaps a range 100 to 200, perhaps an enum?
> ...
> The lack of this ability makes programmers use more general hash maps by default and therefore leave performance on the table.
You can simulate this in any language by hash-mapping the custom set elements to indices. (And in the general case, you won't be able to determine the indices faster than a hash table lookup.)
In fact, I imagine there are languages and implementations where hash table lookup is the fastest way to map 'a':'z' to 0:25.
In fact, let me try a (surely totally bogus) micro-benchmark right now:
```
  $ python -m timeit --setup 'lookup = dict(zip("abcdefghijklmnopqrstuvwxyz", range(26)))' '[lookup[x] for x in "abcdefghijklmnopqrstuvwxyz"]'
  500000 loops, best of 5: 884 nsec per loop
  $ python -m timeit '[ord(x) - 97 for x in "abcdefghijklmnopqrstuvwxyz"]'
  200000 loops, best of 5: 1.21 usec per loop
```
Huh.
...And surely if you use the hash map directly, instead of using it to grab an index into another container, overall performance only gets better (unless maybe you need to save per-instance memory, and you have multiple containers using the same custom index scheme).
Hash tables are a pretty neat idea.

galaxyLogic 19 hours ago

The characteristic feature of Arays is that all but the last element have their next element and all but the first element have their previous element. So what would be the value of having or using indexes other than Integers to index Arrays?

You can also define integer-valued named constants, and use those if you prefer.

larrik 19 hours ago

That sounds more like a linked list than an array.
In C, an array is just contiguous bytes, and you reach an index through math (starting location + (index * size)).
Pretty sure lots of languages still do this to some degree, as your lookup times are effectively zero, though altering the number of elements is way more complex/expensive than in a linked list.
- deepsun 19 hours ago
  
  > and you reach an index through math (starting location + (index * size)).
  By the way, that's the only reason why C-derived languages use unintuitive zero-based indexing. There's really no other reason to call the first element a[0] instead of a[1].
  - dpassens 17 hours ago
    
    Ah, but is it unintuitive though? Once one understands that in the vast majority of cases, C deals in pointers, not arrays, it makes perfect sense that indexing uses an offset rather than the nth element.
  - tom_ 15 hours ago
    
    The index of the first element is always 0 plus the index of the first element. This means it sort-of doesn't matter what the index of the first element of the array is - but, viewed another way, it's an excellent argument for 0. Starting at 0 means you don't need to add a constant in the (very common) case of dealing with a subarray that starts at the first element of the containing array.
Jtsummers 19 hours ago
> So what would be the value of having or using indexes other than Integers to index Arrays?
It removes a level of indirection, wasted space, and mental overhead. Suppose you have some table of 'a'..'z' -> whatever. What are your options?
- Switch/case. This may be optimized and fast, but that's not guaranteed across all languages and compilers.
- Hash map. This introduces memory overhead even if it is technically O(1) for access times.
- If in C, you could do:
```
  T* table = (T*)malloc(sizeof(T) * 26) - 'a'
```
So that the characters can be used naturally without later modification (not a feature in every language, though).
- Manually calculate each offset every time:
```
  table[c - 'a'] = ...
```
With language appropriate type conversions, if needed.
Or you can allow any custom range to work, and do the math in that last option automatically because it's trivial for a compiler to figure it out and it removes unnecessary mental overhead from the program reader and writer.
```
  table['a']
```
Just works.
> You can also define integer-valued named constants, and use those if you prefer.
Because everyone wants to see `lower_case_a = 0` and the 25 other cases in their code along with all other possible constants. At least use an enum, sane languages even make them type-safe (or at least safer).
mcphage 18 hours ago

> The characteristic feature of Arays is that all but the last element have their next element and all but the first element have their previous element.
Where are you getting that from? That's not the characteristic feature of Arrays, that's the characteristic feature of Doubly Linked Lists.
- galaxyLogic 6 hours ago
  
  Use the index 'n' of any element of the array. Unless that is the first element you can always take the elment n - 1 of the same array. That is the "previous element" of the elment at n. And so on.
  I'm not saying that if you have value of any element you could derive the value of the previous or next element from it. I'm saying "it has a previous element" (unless it is the 1st element) meaning there is such a previous element and it is also possible to access that previous element if you know the index of the current element under investigation.
- zahlman 18 hours ago
  
  I wouldn't call it a characteristic feature of either. Any ordered, finite sequence of elements has those properties. Doubly linked lists simply offer O(1) access to "next elements" and "previous elements" given an element.