stdin/stdout, Large Data, etc.

Having trouble getting started? Check here...
Post Reply
ecbrown
Posts: 7
Joined: Fri May 18, 2012 3:11 am

stdin/stdout, Large Data, etc.

Post by ecbrown »

Hi Dave,

I have taken Rel for a spin, and I like it! I have been using the DBrowser to interact with a custom database specified with the -f switch.

Now, I wish to plan ahead for use-cases that I encounter at work. I would like to "data mine" gigabytes and potentially terabytes of data. It looks like Berkeley DB provides the persistence, and so I think that gigabytes are possible.

Q1. Is there anything I should watch out for? For example, I am tempted to try to sandwich in -Xmx4g (make Java use 4 gigabytes of heap space)

Q2. I would like to have another program synthesize the input into Rel, such as gigabytes of insert statements. Then I would like to use Tutorial D to query, and then parse the resulting output. Is there a mode of operation that uses a) pipes, such as stdin and stdout, and/or b) a network socket

e.g.:

cat inserts.d | java -Xmx16g -jar Rel.jar -f/tmp/mydb > /dev/null
cat query.d | java -Xmx16g -jar Rel.jar -f/tmp/mydb > query.out

I can get this to work, but I would like to ask in case danger lies ahead, known limitations and what not.

Anyway, these are my questions. I don't have much concern for computer time/efficiency, but I care mostly about making it possible and easiest to get correct results using big data, and then move the result down the pipeline.

Many compliments and thanks for this program!

Best regards,
Eric
Dave
Site Admin
Posts: 372
Joined: Sun Nov 27, 2005 7:19 pm

Re: stdin/stdout, Large Data, etc.

Post by Dave »

Sounds like an interesting project!

Regarding things to watch out for, yes, there is a risk of running out of memory. Most relational operators are internally pipelined, so at any given time they require little or no more memory than is required to hold a few tuples at a time. However, if a tuple has relation-valued attributes, an entire attribute value may be held in RAM. Some operators (ORDER(), for example), and some internal mechanisms (such as one used to improve JOIN performance) will consume as much RAM as is required to hold an entire relation.

So, invoking Java with -Xmx4g is a good idea if you're dealing with gigabytes of data. Terabytes of data can probably be successfully handled with some queries but not others!

Removing these limitations is on my "to do" list, but as Rel is primarily intended as a teaching tool (where in typical use, a few dozen tuples might be considered a large number) it hasn't been a high priority. Let me know how you get on, though - if you run into memory limitations I'll try to quickly release an update that fixes them.

Regarding having another program interact with Rel: By default it will use stdin/stdout, but it can run as a background daemon accessible via the host's IP address on a specified port. It's intended to support methods defined in relclient.jar, but the protocol is simple enough to implement trivially in any language that supports sockets. java -jar Rel.jar --? will give you the options. If you like, I'll document the protocol.
Post Reply