Monday, December 21, 2015

Rust is proving very difficult to learn

Once upon a time, I wrote and shipped C code used by millions of users.

I now write Java code professionally and have shipped code to even more than that number of users. (In fact, in the almost ancient past, I worked on Blogger itself but I've gotten way better at Java since then.)

However, Rust is proving kind of difficult for me to learn. See my previous post where I did write a pretty simple Rust program, to only print unique lines from stdin. This "unique" program is my "hello world" as it is a useful real world program.

Since I have lots of experience with writing compilers (back in my C code days and at "university"), I thought it would be interesting to create an interpreted language using Rust as the problem itself is very well known to me and thus writing an interpreter in Rust would give me a chance to write a real Rust program where I could concentrate on learning the Rust language and not so much about learning a new problem domain.

So, I've spent maybe 3-4 hours writing a "tokenizer" and now I'm trying to write a parser (a recursive decent parser) in Rust. These additional 2 hours have not gotten me very far - basically nothing to show for it except questions!

Whereas the tokenizer went ok (mostly because I found this: https://github.com/servo/rust-cssparser/blob/master/src/tokenizer.rs), the parser is proving much more difficult!

What I'm going to do now is some rubber duck debugging and let you guys see how this works out. (See https://en.wikipedia.org/wiki/Rubber_duck_debugging ).


Question 1

Why do I need to use Token::WhiteSpace instead of just WhiteSpace in a match statement (for an enum) like the code I was looking at?

I've now answered this question thanks to the rubber duck: I was simply missing this critical construct:

use self::Token::*;

What's kind of funny in Java, when you switch on an enum, you are forced to use the raw enum name and can't even fully qualify it if you wanted to, and here I was expecting the language with better "inference-ing" to do this on my behalf (because an example or two looked like it did this). This isn't a major criticism, though certainly a minor one. (Rust seems to use scope in lots of ways to make stuff less ambiguous and this probably fits well into how the language designers thought of this - just more unusual coming from Java.).

By the way, that use statement at first was a disaster! I got a whole bunch of errors I didn't have before when using it.

Hmm, why:

Before:

#[derive(PartialEq, Debug, Clone)]
pub enum Token {
    WhiteSpace(Box<String>),
    Comment(Box<String>),
    Delimiter(Box<String>),
    Symbol(Box<String>),
    String(Box<String>),
    Word(Box<String>),
    Error(Box<String>),
}

After:

use self::Token::*;

#[derive(PartialEq, Debug, Clone)]
pub enum Token {
    WhiteSpace(Box<String>),
    Comment(Box<String>),
    Delimiter(Box<String>),
    Symbol(Box<String>),
    String(Box<String>), // Doh! Look at the name of the token!
    Word(Box<String>),
    Error(Box<String>),
}

Obviously I should rename the enum value String to say StringLiteral so that the standard String class doesn't get confused with my enum value.

It seems like "technically" I didn't need to make this change.

Fixed!


Question 2

I was actually hoping that fixing that would make a world of difference, but it really didn't.

My current compiler error is:

main.rs:19:15: 19:20 error: cannot move out of borrowed content
main.rs:19         match *self {
                         ^~~~~
main.rs:26:25: 26:26 note: attempting to move value to here
main.rs:26             Token::Word(x) => *x == keyword
                                   ^
main.rs:26:25: 26:26 help: to prevent the move, use `ref x` or `ref mut x` to capture value by reference
error: aborting due to previous error
Could not compile `xyz`.

I'm sort of wondering if this isn't part of the issue:

https://doc.rust-lang.org/book/box-syntax-and-patterns.html

Specifically, maybe using #![feature(box_syntax, box_patterns)] would help?

Here's what I am trying to get working:

impl Token {
    pub fn is_keyword(&self, keyword: String) -> bool {
        match *self {
            WhiteSpace(_) => false,
            Comment(_) => false,
            Delimiter(_) => false ,
            Symbol(_) => false,
            StringLiteral(_) => false,
            Error(_) => false,
            Word(x) => *x == keyword
        }
    }
}

Let me put this in "high level terms" -- I want a convenient way to see if a particular token is a word that happens to match the passed in String.


Question 2 Interlude

BTW, there is another more radical option: I could actually add lots of enum values to Token - one for each keyword I expect to see:

#[derive(PartialEq, Debug, Clone)]
pub enum Token {
    WhiteSpace(Box<String>),
    Comment(Box<String>),
    Delimiter(Box<String>),
    Symbol(Box<String>),
    String(Box<String>),
    KeywordIf(),             // At tokenization time, I could recognize  
    KeywordElse(),           // each keyword and add it's value here
    KeywordSet(),            //
    Word(Box<String>),
    Error(Box<String>),
}

This would potentially make them far easier to match against later, may make them smaller to store, etc. Maybe even this is the ideal representation - I obviously don't know yet - I tend to favor the least number of "things" at first. In modeling this hypothetical language, I actually only want a keyword to matter if it is the first token in a "command", so "if" should be left as a simple Word sometimes, hence my original model seems right.


Question 2 continued...

In Java, what I really want is something like this:

if (x instanceof Word) && ((Word) x).getString().equals("if"))) {
  // blah blah blah
}

This is obviously "runtime" Java since I'm using "instanceof".

In Rust, I'm expecting the enum Token to act something like a "tagged union". 

Oh, looks like this compilies:

impl Token {
    pub fn is_keyword(self, keyword: String) -> bool {
        match self {
            WhiteSpace(_) => false,
            Comment(_) => false,
            Delimiter(_) => false ,
            Symbol(_) => false,
            StringLiteral(_) => false,
            Error(_) => false,
            Word(x) => x == Box::new(keyword)
        }
    }
}

In this case, maybe I'm forcing the keyword string onto the heap even though that is kind of the opposite of what I would expect to have to do...

BTW, is == even the right thing to use here? I don't know. Let's find out.

I basically did this:

    for token in tokens {
        println!("Token {:?}", token);
        println!("Function is export {:?}", token.is_keyword(
            "export".to_string()));
        }
        }

So this works great "functionally" - my program compiles and gives the correct result. I'm not really all the concerned about the performance impact just yet, I'm more worried about how badly I know Rust right now.


Where I should go next?

(Rubber ducky), where should I go next?
  1. keep with it! - le'ts create some parse trees and learn more lessons along the way -- the possible performance penalty of the extra boxing for the current solu-kludgetion I finally found won't probably matter for what I'm after anyways and is therefore second order - using my interpreted language!
  2. switch to another language - I can re-evaluate the reasons I was using Rust in the first place - like stand-alone executables - and choose another language.
    1. Swift is open source and available on Linux now... It's suddenly a new choice. 
    2. C was rejected because I don't understand it's unicode story and other stories but I like C alot in many ways
    3. C++ was rejected because it was too complex even if I could get a better unicode story by using say an external library,
    4. Go, and other garbage collected languages were rejected because I don't understand their "fork" story though maybe with modern posix I won't need to actually fork
    5. Go doesn't have sexy enums like Rust (nor traits, nor generic collections) which maybe I don't really need anyways
    6. Most other languages were rejected because they don't have a good stories about stand-alone executables and utf8. There are some Scheme systems that will compile to C but they don't come with great utf8 stories (yet?)
  3. BTW, I would probably convert to Javascript if there was a stand-alone executable generator from the Node JS guys - and that supported all the node process stuff - even if the default executable were a bit on the heavy side.
So, I guess let's keep up the spirit of pushing me to learn Rust more. I'll just pretend I'm playing an advanced version of Sudoku. At the very least, this will help me better understand the good and bad parts of Rust and help others make similar decisions.

BTW, I think I need to learn how to write unit tests in Rust that don't compile into the main binary - I see "annotations" on functions that say they are tests, so this is already thought out - probably something "cargo" can do once I use the Google or Bing search engines some more.







Saturday, December 5, 2015

Wrote my first Rust Program

Thought I would share my first Rust program, along with some thoughts about the experience.

This is actually a version of a Python script that I actually use about once a month. The goal of the program is to read and print out all of the lines from stdin, but only print out the first occurrence of each line so that you have only the unique lines.

For example, given a file that looks like this:

line with some stuff
line 0
line 1
line 1
line 0
line 2

The program should just print:

line with some stuff
line 0
line 1
line 2

Note, there is a standard Unix utility called "uniq" that does something partially similar, but only filters out duplicate lines when they are adjacent. uniq is typically use in conjuction with sort, which destroys the original order of the lines, so with this example, "line with some stuff" would end up at the bottom instead of at the top.

Here is the "finished" program:

use std::io;
use std::io::prelude::*;
use std::collections::HashSet;

fn main() {
    let mut seen = HashSet::new();
    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let line = line.unwrap();
        // For "fun", try removing the & to see what the compiler error message is
        if !seen.contains(&line) {
            // For "fun", try swapping these two statements to get a compile error
            println!("{}", line);
            seen.insert(line);
        }
    }
}

The resulting binary size for a "release" build (x86-64) is about 526K. Presumably this binary has no external dependencies so it would be easy to move to other Linux boxes with the same architecture.

Here are some thoughts on this experience:
  1. my most import "high level" observation I have is that type inference is bad for understanding what you "cargo culted" (this is different use of the word cargo from the cargo tool that comes with Rust!). I didn't realize that stdin was actually protected by a "mutex" which now explains what the lock() does. For the statement: "let line = line.unwrap();", I still don't understand! What was the type before and after the unwrap() call?
  2. installation of the rust compiler on Linux Mint 17.2 was trivial using the provided instructions from rust-lang.org
  3. the cargo build command seems to follow the route of other new languages like Go in being "Java" like in not requiring writing a Makefile (even though compiling a single rust file seems pretty trivial too)
  4. I got at least one error message while trying to write this which wasn't "GNU" enough for Emacs's compilation mode to parse correctly (some kind of macro error looked like a file:line:column but file was <asdf>.x.y or whatever)
  5. Rust is fairly new but has been in use for a while by the kind of people that use Stack Overflow so there are some "non current" answers out there which slowed me down a bit
  6. Compared to trying to use Vala, I would say this was a positive experience tooling wise
  7. Compared to trying to use Go, I would say this experience was pretty similar to writing the same program in Go
  8. Compared to C, I would still be writing my own HashSet! (more likely I would have written some Bloom filter like code with dynamic chaining to larger Bloom filters as the load increased just be manly )
  9. It would be interesting to write the same program in Swift now that Swift is open source

Why are you learning Rust?

This is a great question. An obvious short answer is that I a programming language enthusiast and learning new languages can help shape how I go about my day job and how I think about programming. Honestly though, usually I spend more time reading about alternative languages than actually sitting down to write code in them so Rust has me intrigued enough to go the extra step.

The real answer is that I would like to write my own "shell" (shells are programs like bash, zsh, etc.) and using a non-VM oriented language seems paramount so I can have a minimal binary. Of course, I definitely want strong support for unicode (especially utf-8) in any language I touch these days which Rust seems to have.

Suggested exercises to help you learn Rust

If you are writing your first Rust program that isn't "hello world", I put some comments into the sample so you can see two real error messages I got while trying to "cargo cult" my way towards this program (no pun intended).

Disclaimer

These are my opinions and not the opinions of more employer, etc.