On Mon, Sep 26, 2011 at 3:51 PM, Marcel Bruch <[email protected]> wrote:
> Thanks Stefan. I gave it a try. Could you or someone else comment on
> the code and its performance?
>
> I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
> Storing ~240 MB took roughly 3 minutes. Is this the expected time such
> an operation takes? Is it possible to improve the performance somehow?

the performance seems rather poor. it's hard to tell what's wrong
without having the test data. i noticed that you're storing the
content of the .json files as string properties. why aren't you
storing the json data as nodes & properties?

anyway, i quickly ran an adapted ad hoc test on my machine
(macbook pro 2.66 ghz, standard harddisk). the test imports
an 'svn export' of jackrabbit/trunk.

importing ~6500 files takes ~30s which is IMO decent.

cheers
stefan


/////////////////////////////////////////////////////////////////////////////////////////////////////////
import org.apache.commons.io.FileUtils;
import org.apache.jackrabbit.core.TransientRepository;

import javax.jcr.Node;
import javax.jcr.Session;
import javax.jcr.SimpleCredentials;
import java.io.File;
import java.io.FileInputStream;
import java.util.Calendar;

public class JcrArtifactStoreTest {

    static final String FILE_ROOT = "/Users/stefan/tmp/jackrabbit-src/";

    static final boolean STORE_BINARY = false;

    static int count = 0;
    static long size = 0;
    static long ts = 0;

    public static void main(String[] args) throws Exception {

        TransientRepository repository = new TransientRepository();
        Session session = repository.login(new
SimpleCredentials("admin", "admin".toCharArray()));

        ts = System.currentTimeMillis();
        long ts0 = ts;

        importNode(new File(FILE_ROOT), session.getRootNode());

        session.save();

        long ts1 = System.currentTimeMillis();
        System.out.printf("%d ms: %d units persisted. data %s\n", ts1
- ts, count,
                FileUtils.byteCountToDisplaySize(size));
        ts = ts1;

        System.out.println("Total time: " + (ts1 - ts0) + " ms");
    }

    static void importNode(File file, Node parent) throws Exception {
        if (file.isDirectory()) {
            Node newNode = parent.addNode(file.getName(), "nt:folder");
            File[] children = file.listFiles();
            if (children != null) {
                for (int i = 0; i < children.length; i++) {
                    importNode(children[i], newNode);
                }
            }
        } else {
            Node newNode = parent.addNode(file.getName(), "nt:file");
            String nt = STORE_BINARY ? "nt:resource" : "nt:unstructured";
            Node content = newNode.addNode("jcr:content", nt);
            if (STORE_BINARY) {
                content.setProperty("jcr:data", new FileInputStream(file));
            } else {
                content.setProperty("jcr:data",
FileUtils.readFileToString(file));
            }
            content.setProperty("jcr:lastModified", Calendar.getInstance());
            content.setProperty("jcr:mimeType", "application/octet-stream");

            size += file.length();
            count++;
            if (++count % 500 == 0) {
                parent.getSession().save();

                long ts1 = System.currentTimeMillis();

                System.out.printf("%d ms: %d units persisted. data
%s\n", ts1 - ts, count,
                        FileUtils.byteCountToDisplaySize(size));
                ts = ts1;
            }
        }
    }
}



>
> The code I used to persist data is given below. The pure IO time w/o
> jackrabbit is ~1second w/ solid state disk.
>
> Thanks for your comments,
> Marcel
>
> Mon Sep 26 15:39:05 CEST 2011: 200 units persisted.  data 5 MB
> Mon Sep 26 15:39:11 CEST 2011: 400 units persisted.  data 13 MB
> Mon Sep 26 15:39:21 CEST 2011: 600 units persisted.  data 21 MB
> Mon Sep 26 15:39:31 CEST 2011: 800 units persisted.  data 28 MB
> Mon Sep 26 15:39:35 CEST 2011: 1000 units persisted.  data 33 MB
> Mon Sep 26 15:39:40 CEST 2011: 1200 units persisted.  data 42 MB
> Mon Sep 26 15:39:44 CEST 2011: 1400 units persisted.  data 49 MB
> Mon Sep 26 15:39:50 CEST 2011: 1600 units persisted.  data 57 MB
> Mon Sep 26 15:39:54 CEST 2011: 1800 units persisted.  data 65 MB
> Mon Sep 26 15:39:58 CEST 2011: 2000 units persisted.  data 72 MB
> Mon Sep 26 15:40:10 CEST 2011: 2200 units persisted.  data 88 MB
> Mon Sep 26 15:40:15 CEST 2011: 2400 units persisted.  data 94 MB
> Mon Sep 26 15:40:22 CEST 2011: 2600 units persisted.  data 102 MB
> Mon Sep 26 15:40:26 CEST 2011: 2800 units persisted.  data 107 MB
> Mon Sep 26 15:40:30 CEST 2011: 3000 units persisted.  data 113 MB
> Mon Sep 26 15:40:36 CEST 2011: 3200 units persisted.  data 123 MB
> Mon Sep 26 15:40:40 CEST 2011: 3400 units persisted.  data 129 MB
> Mon Sep 26 15:40:45 CEST 2011: 3600 units persisted.  data 136 MB
> Mon Sep 26 15:40:48 CEST 2011: 3800 units persisted.  data 140 MB
> Mon Sep 26 15:40:58 CEST 2011: 4000 units persisted.  data 143 MB
> Mon Sep 26 15:41:18 CEST 2011: 4200 units persisted.  data 154 MB
> Mon Sep 26 15:41:24 CEST 2011: 4400 units persisted.  data 164 MB
> Mon Sep 26 15:41:38 CEST 2011: 4600 units persisted.  data 185 MB
> Mon Sep 26 15:41:43 CEST 2011: 4800 units persisted.  data 193 MB
> Mon Sep 26 15:41:50 CEST 2011: 5000 units persisted.  data 204 MB
> Mon Sep 26 15:41:56 CEST 2011: 5200 units persisted.  data 211 MB
> Mon Sep 26 15:42:00 CEST 2011: 5400 units persisted.  data 218 MB
> Mon Sep 26 15:42:05 CEST 2011: 5600 units persisted.  data 226 MB
> Mon Sep 26 15:42:10 CEST 2011: 5800 units persisted.  data 235 MB
> Mon Sep 26 15:42:15 CEST 2011: 5927 units persisted
>
>
> public class JcrArtifactStoreTest {
>
>    private TransientRepository repository;
>    private Session session;
>
>    @Before
>    public void setup() throws RepositoryException {
>
>        final File basedir = new File("recommenders/").getAbsoluteFile();
>        basedir.mkdir();
>        repository = new TransientRepository(basedir);
>        session = repository.login(new SimpleCredentials("username",
> "password".toCharArray()));
>    }
>
>    @Test
>    public void test2() throws ConfigurationException,
> RepositoryException, IOException {
>
>        int i = 0;
>        int size = 0;
>        final Iterator<File> it = findDataFiles();
>        final Node rootNode = session.getRootNode();
>
>        while (it.hasNext()) {
>            final File file = it.next();
>            Node activeNode = rootNode;
>            for (final String segment : new
> Path(file.getAbsolutePath()).segments()) {
>                activeNode = JcrUtils.getOrAddNode(activeNode, segment);
>            }
>            // System.out.println(activeNode.getPath());
>            final String content = Files.toString(file, Charsets.UTF_8);
>            size += content.getBytes().length;
>            activeNode.setProperty("cu", content);
>            if (++i % 200 == 0) {
>                session.save();
>                System.out.printf("%s: %d units persisted.  data %s
> \n", new Date(), i,
>                        FileUtils.byteCountToDisplaySize(size));
>            }
>        }
>        session.save();
>        System.out.printf("%s: %d units persisted\n", new Date(), i);
>    }
>
>    @SuppressWarnings("unchecked")
>    private Iterator<File> findDataFiles() {
>        return FileUtils.iterateFiles(new
> File("/Users/Marcel/Repositories/frankenberger-android-example-apps/"),
>                FileFilterUtils.suffixFileFilter(".json"), 
> TrueFileFilter.TRUE);
>    }
>
>
>
>
> 2011/9/26 Stefan Guggisberg <[email protected]>:
>> hi marcel,
>>
>> On Sun, Sep 25, 2011 at 3:40 PM, Marcel Bruch <[email protected]> wrote:
>>> Hi,
>>>
>>> I'm looking for some advice whether Jackrabbit might be a good choice for 
>>> my problem. Any comments on this are greatly appreciated.
>>>
>>>
>>> = Short description of the challenge =
>>>
>>> We've built a Eclipse based tool that analyzes java source files and stores 
>>> its analysis results in additional files. The workspace  potentially has 
>>> hundreds of projects and each project may have up to a few thousands of 
>>> files. Say, there will be 200 projects and 1000 java source files per 
>>> project in a single workspace. Then, there will be 200*1000 = 200.000 files.
>>>
>>> On a full workspace build, all these 200k files have to be compiled (by the 
>>> IDE) and analyzed (by our tool) at once and the analysis results have to be 
>>> dumped to disk rather fast.
>>> But the most common use case is that a single file is changed several times 
>>> per minute and thus gets frequently analyzed.
>>>
>>> At the moment, the analysis results are dumped on disk as plain json files; 
>>> one json file for each java class. Each json file is around 5 to 100kb in 
>>> size; some files grow up to several megabytes (<10mb), these files have a 
>>> few hundred JSON complex nodes (which might perfectly map to nodes in JCR).
>>>
>>> = Question =
>>>
>>> We would like to change the simple file system approach by a more 
>>> sophisticated approach and I wonder whether Jackrabbit may be a suitable 
>>> backend for this use case. Since we map all our data to JSON already, it 
>>> looks like Jackrabbit/JCR is a perfect fit for this but I can't say for 
>>> sure.
>>>
>>> What's your suggestion? Is Jackrabbit capable to quickly load and store 
>>> json-like data - even if 200k files (nodes + their sub-nodes) have to be 
>>> updated very in very short time?
>>
>> absolutely. if the data is reasonably structured/organized jackrabbit
>> should be a perfect fit.
>> i suggest to leverage the java package space hierarchy for organizing the 
>> data
>> (i.e. org.apache.jackrabbit.core.TransientRepository ->
>> /org/apache/jackrabbit/core/TransientRepository).
>> for further data modeling recommondations see [0].
>>
>> cheers
>> stefan
>>
>> [0] http://wiki.apache.org/jackrabbit/DavidsModel
>>
>>>
>>>
>>> Thanks for your suggestions. I've you need more details on what operations 
>>> are performed or how data looks like, I would be glad to take your 
>>> questions.
>>>
>>> Marcel
>>>
>

Reply via email to