Eric, I'm really disappointed. Rather than writing anything at all actually, I opted to run the RandomBatchWriter example program.
It wasn't 35x faster. It was 52x faster. After all the excellent posts I've seen from you, I really expected a more precise guestimation from you. ;-) Thanks for the gentle nudge to do better than python and the accumulo shell. At a million rows inserted in 13 seconds, I'm certain the Accumulo cluster I've set up can certainly handle the 2-5K records per second max we expect to throw at it. Thanks again! On Tue, Apr 30, 2013 at 1:47 PM, Eric Newton <[email protected]> wrote: > I've probably written more python than Java, so I understand. :-) > > I've used Jython for scripting tests. In unreleased versions (1.4.4 & > 1.5.0) the Proxy will let you use the language of your choice. > > -Eric > > > > On Tue, Apr 30, 2013 at 2:43 PM, Terry P. <[email protected]> wrote: > >> Hi Eric, >> Thanks for the info. You've inspired me to dive into it in Java -- I had >> been using the accumulo shell because I had a python data generation script >> already in place and it was "faster" that way. But if a small java program >> is going to be 35x "faster" than that, it makes no sense to bother with the >> shell! >> >> Thanks, >> Terry >> >> >> On Tue, Apr 30, 2013 at 11:01 AM, Eric Newton <[email protected]>wrote: >> >>> There's no need to flush... the shell is flushing after every single >>> line. >>> >>> The flush you are invoking causes a minor compaction. >>> >>> If you wrote a quick java program to ingest the data, the data would >>> load about 35x faster. >>> >>> -Eric >>> >>> >>> On Mon, Apr 29, 2013 at 6:40 PM, Terry P. <[email protected]> wrote: >>> >>>> Perhaps having a configuration item to limit the size of the >>>> shell_history.txt file would help avoid this in future? >>>> >>>> >>>> On Mon, Apr 29, 2013 at 5:37 PM, Terry P. <[email protected]> wrote: >>>> >>>>> You hit it John -- on the NameNode the shell_history.txt file is >>>>> 128MB, and same thing on the DataNode that 99% of the data went to due to >>>>> the key structure. On the other two datanodes it was tiny, and both could >>>>> login fine (just my luck that the only datanode I tried after the load was >>>>> the fat one). >>>>> >>>>> So is --disable-tab-completion supposed to skip reading the >>>>> shell_history.txt file? It appears that is not the case with 1.4.2 as it >>>>> still dies with OOM error. >>>>> >>>>> I now see that a better way to go would probably be to use >>>>> --execute-file switch to read the load file rather than pipe it to the >>>>> shell. Correct? >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Apr 29, 2013 at 5:04 PM, John Vines <[email protected]> wrote: >>>>> >>>>>> Depending on your answer to Eric's question, I wonder if your history >>>>>> is enough to blow it up. You may also want to check the size of >>>>>> ~/.accumulo/shell_history.txt and see if that is ginormous. >>>>>> >>>>>> >>>>>> On Mon, Apr 29, 2013 at 5:07 PM, Terry P. <[email protected]> wrote: >>>>>> >>>>>>> Hi John, >>>>>>> I attempted to start the shell with --disable-tab-completion but it >>>>>>> still failed in an identical manner. What is that feature/option? >>>>>>> >>>>>>> The ACCUMULO_OTHER_OPTS var was set to "-Xmx256m -Xms64m" via the >>>>>>> 2GB example config script. I upped the -Xmx256m to 512m and the shell >>>>>>> started successfully, so thanks! >>>>>>> >>>>>>> What would cause the shell to need more than 256m of memory just to >>>>>>> start? I'd like to understand how to determine an appropriate value to >>>>>>> set >>>>>>> ACCUMULO_OTHER_OPTS to. >>>>>>> >>>>>>> Thanks, >>>>>>> Terry >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Apr 29, 2013 at 2:21 PM, John Vines <[email protected]>wrote: >>>>>>> >>>>>>>> The shell gets it's memory config from the accumulo-env file from >>>>>>>> ACCUMULO_OTHER_OPTS. If, for some reason, the value was low or there >>>>>>>> was a >>>>>>>> lot of data being loaded for the tab completion stuff in the shell, it >>>>>>>> could die. You can try upping that value in the file or try running the >>>>>>>> shell with "--disable-tab-completion" to see if that helps. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Apr 29, 2013 at 3:02 PM, Terry P. <[email protected]>wrote: >>>>>>>> >>>>>>>>> Greetings folks, >>>>>>>>> I have stood up our 8-node Accumulo 1.4.2 cluster consisting of 3 >>>>>>>>> ZooKeepers, 1 NameNode (also runs Accumulo Master, Monitor, and GC), >>>>>>>>> and 3 >>>>>>>>> DataNodes / TabletServers (Secondary NameNode with Alternate Accumulo >>>>>>>>> Master process will follow). The initial config files were copied >>>>>>>>> from the >>>>>>>>> 2GB/native-standalone directory. >>>>>>>>> >>>>>>>>> For a quick test I have a text file I generated to load 500,000 >>>>>>>>> rows of sample data using the Accumulo shell. For lack of a better >>>>>>>>> place >>>>>>>>> to run it this first time, I ran it on the NameNode. The script >>>>>>>>> performs >>>>>>>>> flushes every 10,000 records (about 30,000 entries). After the load >>>>>>>>> finished, when I attempt to login to the Accumulo Shell on the >>>>>>>>> NameNode, I >>>>>>>>> get the error: >>>>>>>>> >>>>>>>>> [root@edib-namenode ~]# /usr/lib/accumulo/bin/accumulo shell -u >>>>>>>>> $AUSER -p $AUSERPWD >>>>>>>>> # >>>>>>>>> # java.lang.OutOfMemoryError: Java heap space >>>>>>>>> # -XX:OnOutOfMemoryError="kill -9 %p" >>>>>>>>> # Executing /bin/sh -c "kill -9 24899"... >>>>>>>>> Killed >>>>>>>>> >>>>>>>>> The performance of that test was pretty poor at about 160/second >>>>>>>>> (somewhat expected, as it was just one thread) so to keep moving I >>>>>>>>> generated 3 different load files and ran one on each of the 3 >>>>>>>>> DataNodes / >>>>>>>>> TabletServers. Performance was much better, sustaining 1,400 per >>>>>>>>> second. >>>>>>>>> Again, the test data load files have flush commands every 10,000 >>>>>>>>> records >>>>>>>>> (30,000 entries), including at the end of the file. >>>>>>>>> >>>>>>>>> However, as with the NameNode, now I cannot login to the Accumulo >>>>>>>>> shell on any of the DataNodes either, as I get the same >>>>>>>>> OutOfMemoryError. >>>>>>>>> >>>>>>>>> My /etc/security/limits.conf file is set with 64000 for nofile and >>>>>>>>> 32000 for nproc for the hdfs user (which is also running Accumulo, I >>>>>>>>> haven't split accumulo out yet). >>>>>>>>> >>>>>>>>> I don't see any errors in the tserver or logger logs (standard and >>>>>>>>> debug) or any info related to the shell failing to load. I'm at a >>>>>>>>> loss >>>>>>>>> with respect to where to look. The servers have 16GB of memory, and >>>>>>>>> each >>>>>>>>> has about 14GB currently free. >>>>>>>>> >>>>>>>>> Any help would be greatly appreciated. >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Terry >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
